{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linear regression exercises\n",
    "\n",
    "Credits: Matthew Graham, Pavlos Protopapas\n",
    "\n",
    "This notebook provides exercises on linear regression. \n",
    "\n",
    "First we need to do some Python setup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import sys\n",
    "import numpy as np\n",
    "import scipy as sp\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import sklearn as sk\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LinearRegression\n",
    "sns.set(style=\"ticks\")\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and Explore Data\n",
    "\n",
    "For these exercises, we're going to use a data set of galaxies with known (spectroscopically confirmed) redshifts and SDSS magnitudes. We're interested in determining the redshift of a galaxy from its colors (photometric redshift).\n",
    "First we will load the data and have a look at it. The data can be downloaded from: <a href=\"http://www.astro.caltech.edu/~mjg/sdss_gal.csv.gz\">http://www.astro.caltech.edu/~mjg/sdss_gal.csv.gz</a>\n",
    "\n",
    "Note that you will need to uncompress the file before using it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sdss_gal_df = pd.read_csv('sdss_gal.csv', low_memory=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sdss_gal_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sdss_gal_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sdss_gal_df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll plot a random subsample of the data to get an idea of what it looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sdss_gal_sample = sdss_gal_df.sample(n=1000, random_state=0)\n",
    "redshift = sdss_gal_sample['redshift'].values\n",
    "gr = sdss_gal_sample['g-r'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n",
    "\n",
    "ax.scatter(gr, redshift, color='gray', alpha=0.1)\n",
    "\n",
    "ax.set_xlabel('g-r')\n",
    "ax.set_ylabel('Redshift')\n",
    "ax.set_title('SDSS galaxy redshifts:\\n SDSS g-r color vs Redshift Scatter Plot')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The syntax of a regressor\n",
    "\n",
    "Scikit-learn models have a fixed syntax so it is the same for a simple linear regression operation as it is for something much more complex such as random forest. The specific model is represented as a class with model parameters defined in the class constructor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class sklearn.linear_model.LinearRegression(\n",
    "    fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The class will also have a <i>fit</i> method for fitting (training) the model which takes the data (X) as a Numpy array of shape [n_samples, n_features] and the response values (y) as a Numpy array of shape [n_samples, n_responses]:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def fit(X, y):\n",
    "    ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Modeling the Data\n",
    "\n",
    "We're going to use train_test_split method to create our training and test data sets for a random subsample. We'll set the test set to be half the size of the training set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sdss_gal_sample = sdss_gal_df.sample(n=1000, random_state=0)\n",
    "\n",
    "y = sdss_gal_sample['redshift'].values\n",
    "X = sdss_gal_sample['g-r'].values\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X.reshape((len(X), 1)), y, test_size=0.33, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we define our basic linear regressor and fit it to the data. You can confirm that the values of the slope and intercept are what you would expect from a MSE loss function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regression = LinearRegression(fit_intercept=True)\n",
    "regression.fit(X_train, y_train)\n",
    "\n",
    "regression_line = lambda x: regression.intercept_ + regression.coef_ * x\n",
    "print 'The equation of the regression line is: {} + {} * x'.format(regression.intercept_, regression.coef_[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And what does the regression line look like with the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n",
    "\n",
    "x_vals = np.linspace(0, 3, 100)\n",
    "ax.plot(x_vals, regression_line(x_vals), color='red', linewidth=1.0, label='regression line')\n",
    "ax.scatter(X_train, y_train, color='gray', alpha=0.1, label='data')\n",
    "\n",
    "\n",
    "ax.set_xlabel('g-r')\n",
    "ax.set_ylabel('Redshift')\n",
    "ax.set_title('SDSS Galaxy Redshift Data:\\n Trip Duration vs Fare Scatter Plot')\n",
    "ax.legend(loc='best')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate and Interpret the Model\n",
    "\n",
    "Let's have a bit more of a look at the linear model we've fitted."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Train vs Test Error\n",
    "\n",
    "Firstly how the MSE values look for the training and the test data sets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_MSE = np.mean((y_train - regression.predict(X_train))**2)\n",
    "test_MSE = np.mean((y_test - regression.predict(X_test))**2)\n",
    "print 'The train MSE is {}, the test MSE is {}'.format(train_MSE, test_MSE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Uncertainty in the Model Parameter Estimates\n",
    "\n",
    "Now we're going to generate a whole load of random subsamples, fit these, and look at the distributions of the model parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def find_regression_params(regression_model, samples):\n",
    "    sdss_gal_sample = sdss_gal_df.sample(n=samples)\n",
    "\n",
    "    y = sdss_gal_sample['redshift'].values\n",
    "    X = sdss_gal_sample['g-r'].values\n",
    "\n",
    "    X_train, X_test, y_train, y_test = train_test_split(X.reshape((len(X), 1)), y, test_size=0.33, random_state=0)\n",
    "\n",
    "    regression_model.fit(X_train, y_train)\n",
    "    \n",
    "    return regression_model.intercept_, regression_model.coef_[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regression_model = LinearRegression(fit_intercept=True)\n",
    "\n",
    "total_draws = 500\n",
    "samples = 1000\n",
    "regression_params = []\n",
    "\n",
    "for i in range(total_draws):\n",
    "    if i % 10 == 0:\n",
    "        out = i * 1. / total_draws * 100\n",
    "        sys.stdout.write(\"\\r%d%%\" % out)\n",
    "        sys.stdout.flush()\n",
    "        \n",
    "    regression_params.append(find_regression_params(regression_model, samples))\n",
    "    \n",
    "sys.stdout.write(\"\\r%d%%\" % 100)\n",
    "regression_params = np.array(regression_params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 2, figsize=(10, 5))\n",
    "\n",
    "ax[0].hist(regression_params[:, 0], bins=50, color='blue', edgecolor='white', linewidth=1, alpha=0.2)\n",
    "ax[0].axvline(x=regression_params[:, 0].mean(), color='red', label='mean = {0:.2f}'.format(regression_params[:, 0].mean()))\n",
    "ax[0].axvline(x=regression_params[:, 0].mean() - 2 * regression_params[:, 0].std(), color='green', linestyle='--', label='std = {0:.2f}'.format(regression_params[:, 0].std()))\n",
    "ax[0].axvline(x=regression_params[:, 0].mean() + 2 * regression_params[:, 0].std(), color='green', linestyle='--')\n",
    "\n",
    "ax[0].set_xlabel('Intercept')\n",
    "ax[0].set_ylabel('Frequency')\n",
    "ax[0].set_title('Histogram of Estimates of Intercept')\n",
    "ax[0].legend(loc='best')\n",
    "\n",
    "\n",
    "ax[1].hist(regression_params[:, 1], bins=50, color='gray', edgecolor='white', linewidth=1, alpha=0.2)\n",
    "ax[1].axvline(x=regression_params[:, 1].mean(), color='red', label='mean = {0:.2f}'.format(regression_params[:, 1].mean()))\n",
    "ax[1].axvline(x=regression_params[:, 1].mean() - 2 * regression_params[:, 1].std(), color='green', linestyle='--', label='std = {0:.2f}'.format(regression_params[:, 1].std()))\n",
    "ax[1].axvline(x=regression_params[:, 1].mean() + 2 * regression_params[:, 1].std(), color='green', linestyle='--')\n",
    "\n",
    "ax[1].set_xlabel('Slope')\n",
    "ax[1].set_ylabel('Frequency')\n",
    "ax[1].set_title('Histogram of Estimates of Slope')\n",
    "ax[1].legend(loc='best')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. The Effect of Sample Size on Uncertainty\n",
    "\n",
    "We can also look at what the effect of sample size is on the model parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regression_model = LinearRegression(fit_intercept=True)\n",
    "\n",
    "def compute_SE(total_draws, samples, regression_model):\n",
    "\n",
    "    regression_params = []\n",
    "\n",
    "    for i in range(total_draws):\n",
    "        regression_params.append(find_regression_params(regression_model, samples))\n",
    "\n",
    "    regression_params = np.array(regression_params)\n",
    "    return np.std(regression_params[:, 0]), np.std(regression_params[:, 1])\n",
    "\n",
    "total_draws = 100\n",
    "samples = range(100, 10001, 900)\n",
    "ses = []\n",
    "\n",
    "for i in range(len(samples)):\n",
    "    out = i * 1. / len(samples) * 100\n",
    "    sys.stdout.write(\"\\r%d%%\" % out)\n",
    "    sys.stdout.flush()\n",
    "    ses.append(compute_SE(total_draws, samples[i], regression_model))\n",
    "\n",
    "sys.stdout.write(\"\\r%d%%\" % 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ses = np.array(ses)\n",
    "    \n",
    "fig, ax = plt.subplots(1, 1, figsize=(10, 5))\n",
    "\n",
    "ax.plot(samples, ses[:, 0], color='teal', label='SE of intercept')\n",
    "ax.plot(samples, ses[:, 1], color='brown', label='SE of slope')\n",
    "\n",
    "ax.set_xlabel('Number of Samples')\n",
    "ax.set_ylabel('SE')\n",
    "ax.set_title('Uncertainty in Parameter Estimates vs Sample Size')\n",
    "ax.legend(loc='best')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Residual Analysis\n",
    "\n",
    "In residual analysis we want to check that the residuals are uncorrelated and normally distributed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sdss_gal_df = pd.read_csv('sdss_gal.csv', low_memory=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sdss_gal_sample = sdss_gal_df.sample(n=1000, random_state=0)\n",
    "\n",
    "y = sdss_gal_sample['redshift'].values\n",
    "X = sdss_gal_sample['g-r'].values\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X.reshape((len(X), 1)), y, test_size=0.33, random_state=0)\n",
    "\n",
    "regression = LinearRegression(fit_intercept=True)\n",
    "regression.fit(X_train, y_train)\n",
    "\n",
    "regression_line = lambda x: regression.intercept_ + regression.coef_ * x\n",
    "print 'The equation of the regression line is: {} + {} * x'.format(regression.intercept_, regression.coef_[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also check the value of $R^2$ for both the training and test data sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_R_sq = regression.score(X_train, y_train)\n",
    "test_R_sq = regression.score(X_test, y_test)\n",
    "print 'The train R^2 is {}, the test R^2 is {}'.format(train_R_sq, test_R_sq)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 2, figsize=(10, 5))\n",
    "\n",
    "errors = y_train - regression.predict(X_train)\n",
    "ax[0].scatter(X_train, errors, color='blue', alpha=0.1, label='residuals')\n",
    "ax[0].axhline(y=0, color='gold', label='zero error')\n",
    "\n",
    "\n",
    "ax[0].set_xlabel('g-r')\n",
    "ax[0].set_ylabel('redshift')\n",
    "ax[0].set_title('SDSS Galaxy Redshift Data:\\n Predictor vs Residual')\n",
    "ax[0].legend(loc='best')\n",
    "\n",
    "ax[1].hist(errors, color='blue', alpha=0.1, label='residuals', bins=50, edgecolor='white', linewidth=2)\n",
    "ax[1].axvline(x=0, color='gold', label='zero error')\n",
    "\n",
    "\n",
    "ax[1].set_xlabel('g-r')\n",
    "ax[1].set_ylabel('redshift')\n",
    "ax[1].set_title('SDSS Galaxy Redshift Data:\\n Residual Histogram')\n",
    "ax[1].legend(loc='best')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multiple Linear Regression\n",
    "\n",
    "Clearly redshift is not just a function of one color but of several so here we will look at fitting a linear model wiht multiple parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sdss_gal_sample = sdss_gal_df.sample(n=1000, random_state=0)\n",
    "#sdss_gal_sample['lpep_pickup_datetime'] = nyc_cab_sample['lpep_pickup_datetime'].apply(lambda dt: pd.to_datetime(dt).hour)\n",
    "#sdss_gal_sample['Lpep_dropoff_datetime'] = nyc_cab_sample['Lpep_dropoff_datetime'].apply(lambda dt: pd.to_datetime(dt).hour)\n",
    "msk = np.random.rand(len(sdss_gal_sample)) < 0.8\n",
    "train = sdss_gal_sample[msk]\n",
    "test = sdss_gal_sample[~msk]\n",
    "\n",
    "y_train = train['redshift'].values\n",
    "X_train = train[['g-r', 'r-i', 'i-z']].values\n",
    "\n",
    "y_test = test['redshift'].values\n",
    "X_test = test[['g-r', 'r-i', 'i-z']].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "multi_regression_model = LinearRegression(fit_intercept=True)\n",
    "multi_regression_model.fit(X_train, y_train)\n",
    "\n",
    "print 'The equation of the regression plane is: {} + {}^T . x'.format(multi_regression_model.intercept_, multi_regression_model.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Train vs Test Error\n",
    "\n",
    "Again we can look at various model diagnostics: first comparing the MSE and $R^2$ values for the training and test data sets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_MSE= np.mean((y_train - multi_regression_model.predict(X_train))**2)\n",
    "test_MSE= np.mean((y_test - multi_regression_model.predict(X_test))**2)\n",
    "print 'The train MSE is {}, the test MSE is {}'.format(train_MSE, test_MSE)\n",
    "\n",
    "train_R_sq = multi_regression_model.score(X_train, y_train)\n",
    "test_R_sq = multi_regression_model.score(X_test, y_test)\n",
    "print 'The train R^2 is {}, the test R^2 is {}'.format(train_R_sq, test_R_sq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Uncertainty in the Model Parameter Estimates\n",
    "\n",
    "And the uncertainties in each parameter estimate (u-g, g-r, r-i, i-z):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def find_regression_params(regression_model, samples, cols):\n",
    "    sdss_gal_sample = sdss_gal_df.sample(n=samples)\n",
    "\n",
    "    y = sdss_gal_sample['redshift'].values\n",
    "    X = sdss_gal_sample[cols].values\n",
    "\n",
    "    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)\n",
    "\n",
    "    regression_model.fit(X_train, y_train)\n",
    "    \n",
    "    return np.hstack((np.array([regression_model.intercept_]), regression_model.coef_))\n",
    "\n",
    "\n",
    "def plot_hist_se(vals, bins, title, xlabel, ax):\n",
    "    mean = vals.mean()\n",
    "    std = vals.std()\n",
    "    ax.hist(vals, bins=bins, color='blue', edgecolor='white', linewidth=1, alpha=0.2)\n",
    "    ax.axvline(mean, color='red', label='mean = {0:.2f}'.format(mean))\n",
    "    ax.axvline(mean - 2 * std, color='green', linestyle='--', label='std = {0:.2f}'.format(std))\n",
    "    ax.axvline(mean + 2 * std, color='green', linestyle='--')\n",
    "\n",
    "    ax.set_xlabel(xlabel)\n",
    "    ax.set_ylabel('Frequency')\n",
    "    ax.set_title(title)\n",
    "    ax.legend(loc='best')\n",
    "\n",
    "\n",
    "    return ax"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_draws = 500\n",
    "samples = 1000\n",
    "regression_params = []\n",
    "\n",
    "for i in range(total_draws):\n",
    "    if i % 10 == 0:\n",
    "        out = i * 1. / total_draws * 100\n",
    "        sys.stdout.write(\"\\r%d%%\" % out)\n",
    "        sys.stdout.flush()\n",
    "        \n",
    "    regression_params.append(find_regression_params(multi_regression_model, samples, ['g-r', 'r-i', 'i-z']))\n",
    "    \n",
    "sys.stdout.write(\"\\r%d%%\" % 100)\n",
    "regression_params = np.array(regression_params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 4, figsize=(20, 5))\n",
    "\n",
    "ax[0] = plot_hist_se(regression_params[:, 0], 50, 'Histogram of Estimates of Intercept', 'Intercept', ax[0])\n",
    "ax[1] = plot_hist_se(regression_params[:, 1], 50, 'Histogram of Estimates of Slope of g-r', 'Slope of g-r', ax[1])\n",
    "ax[2] = plot_hist_se(regression_params[:, 2], 50, 'Histogram of Estimates of Slope of r-i', 'Slope of r-i', ax[2])\n",
    "ax[3] = plot_hist_se(regression_params[:, 3], 50, 'Histogram of Estimates of Slope of i-z', 'Slope of i-z', ax[3])\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating the Significance of Predictors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from statsmodels.tools import add_constant\n",
    "\n",
    "predictors_multiple = ['g-r', 'r-i', 'i-z']\n",
    "predictors_simple = ['g-r']\n",
    "\n",
    "X_train_multi = add_constant(train[predictors_multiple].values)\n",
    "X_test_multi = add_constant(test[predictors_multiple].values)\n",
    "\n",
    "X_train_simple = add_constant(train[predictors_simple].values)\n",
    "X_test_simple = add_constant(test[predictors_simple].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Measuring Significance Using F-Stat, p-Values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import statsmodels.api as sm\n",
    "multi_regression_model = sm.OLS(y_train, X_train_multi).fit()\n",
    "print 'F-stat:', multi_regression_model.fvalue\n",
    "print 'p-values: {} (intercept), {} (g-r), {} (r-i), {} (i-z)'.format(*multi_regression_model.pvalues)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Measuring Significance Using AIC/BIC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print \"AIC for ['g-r', 'r-i', 'i-z']:\", multi_regression_model.aic\n",
    "print \"BIC for ['g-r', 'r-i', 'i-z']:\", multi_regression_model.bic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "simple_regression_model = sm.OLS(y_train, X_train_simple).fit()\n",
    "print \"AIC for ['g-r']:\", simple_regression_model.aic\n",
    "print \"BIC for ['g-r']:\", simple_regression_model.bic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Measuring Significance Using R^2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "simple_model = LinearRegression(fit_intercept=False)\n",
    "simple_model.fit(X_train_simple, y_train)\n",
    "\n",
    "print \"Simple Model: train R^2 = {}, test R^2 = {}\".format(simple_model.score(X_train_simple, y_train), simple_model.score(X_test_simple, y_test))\n",
    "\n",
    "multiple_model = LinearRegression(fit_intercept=False)\n",
    "multiple_model.fit(X_train_multi, y_train)\n",
    "\n",
    "print \"Multiple Predictor Model: train R^2 = {}, test R^2 = {}\".format(multiple_model.score(X_train_multi, y_train), multiple_model.score(X_test_multi, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. The Effect of Number of Predictors on R^2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "multi_regression_model = LinearRegression(fit_intercept=True)\n",
    "\n",
    "cols = ['u-g', 'g-r', 'r-i', 'i-z']\n",
    "train_R_sq = []\n",
    "test_R_sq = []\n",
    "for i in range(1, len(cols) + 1):\n",
    "    predictors = cols[:i]\n",
    "    X_train = train[predictors].values\n",
    "    X_test = test[predictors].values\n",
    "    \n",
    "    multi_regression_model.fit(X_train, y_train)\n",
    "    \n",
    "    train_R_sq.append(multi_regression_model.score(X_train, y_train))\n",
    "    test_R_sq.append(multi_regression_model.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 1, figsize=(10, 5))\n",
    "\n",
    "ax.plot(range(1, len(cols) + 1), train_R_sq, color='blue', label='train R^2')\n",
    "ax.plot(range(1, len(cols) + 1), test_R_sq, color='red', label='test R^2')\n",
    "\n",
    "ax.set_title('Number of Predictor vs Model Fitness')\n",
    "ax.set_xlabel('Number of Predictors')\n",
    "ax.set_ylabel('R^2')\n",
    "ax.legend(loc='best')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multiple Linear Regression with Interaction Terms\n",
    "\n",
    "Now we'll consider multiple linear models with cross terms:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "y_train = train['redshift'].values\n",
    "X_train = train[['g-r', 'r-i', 'i-z']].values\n",
    "\n",
    "y_test = test['redshift'].values\n",
    "X_test = test[['g-r', 'r-i', 'i-z']].values\n",
    "\n",
    "gen_cross_terms = PolynomialFeatures(degree=3, interaction_only=True)\n",
    "cross_terms = gen_cross_terms.fit_transform(X_train)\n",
    "X_train_with_cross = np.hstack((X_train, cross_terms))\n",
    "cross_terms = gen_cross_terms.fit_transform(X_test)\n",
    "X_test_with_cross = np.hstack((X_test, cross_terms))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get fit statistics for the model with cross terms:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "multi_regression_model = LinearRegression(fit_intercept=True)\n",
    "multi_regression_model.fit(X_train_with_cross, y_train)\n",
    "\n",
    "train_MSE = np.mean((y_train - multi_regression_model.predict(X_train_with_cross))**2)\n",
    "test_MSE = np.mean((y_test - multi_regression_model.predict(X_test_with_cross))**2)\n",
    "print 'The train MSE with interaction terms is {}, the test MSE is {}'.format(train_MSE, test_MSE)\n",
    "\n",
    "train_R_sq = multi_regression_model.score(X_train_with_cross, y_train)\n",
    "test_R_sq = multi_regression_model.score(X_test_with_cross, y_test)\n",
    "print 'The train R^2 with interaction terms is {}, the test R^2 is {}'.format(train_R_sq, test_R_sq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And compare to models without the cross terms:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "multi_regression_model.fit(X_train, y_train)\n",
    "\n",
    "train_MSE = np.mean((y_train - multi_regression_model.predict(X_train))**2)\n",
    "test_MSE = np.mean((y_test - multi_regression_model.predict(X_test))**2)\n",
    "print 'The train MSE without interaction terms is {}, the test MSE is {}'.format(train_MSE, test_MSE)\n",
    "\n",
    "train_R_sq = multi_regression_model.score(X_train, y_train)\n",
    "test_R_sq = multi_regression_model.score(X_test, y_test)\n",
    "print 'The train R^2 without interaction terms is {}, the test R^2 is {}'.format(train_R_sq, test_R_sq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Polynomial Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_train = train['redshift'].values\n",
    "X_train = train[['g-r', 'r-i', 'i-z']].values\n",
    "\n",
    "y_test = test['redshift'].values\n",
    "X_test = test[['g-r', 'r-i', 'i-z']].values\n",
    "\n",
    "gen_poly_terms = PolynomialFeatures(degree=2, interaction_only=False)\n",
    "X_train_with_poly = gen_poly_terms.fit_transform(X_train)\n",
    "X_test_with_poly = gen_poly_terms.fit_transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "poly_regression_model = LinearRegression(fit_intercept=True)\n",
    "poly_regression_model.fit(X_train_with_poly, y_train)\n",
    "\n",
    "train_MSE= np.mean((y_train - poly_regression_model.predict(X_train_with_poly))**2)\n",
    "test_MSE= np.mean((y_test - poly_regression_model.predict(X_test_with_poly))**2)\n",
    "print 'The train MSE for degree 2 poly model is {}, the test MSE is {}'.format(train_MSE, test_MSE)\n",
    "\n",
    "train_R_sq = poly_regression_model.score(X_train_with_poly, y_train)\n",
    "test_R_sq = poly_regression_model.score(X_test_with_poly, y_test)\n",
    "print 'The train R^2 for degree 2 poly model is {}, the test R^2 is {}'.format(train_R_sq, test_R_sq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Effect of Polynomial Degree on Model Performance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "train_R_sq = []\n",
    "test_R_sq = []\n",
    "max_deg = 10\n",
    "\n",
    "min_max_scaler = MinMaxScaler()\n",
    "X_train = min_max_scaler.fit_transform(X_train)\n",
    "X_test = min_max_scaler.fit_transform(X_test)\n",
    "\n",
    "for d in range(max_deg + 1):\n",
    "\n",
    "    out = d * 1. / max_deg * 100\n",
    "    sys.stdout.write(\"\\r%d%%\" % out)\n",
    "    sys.stdout.flush()\n",
    "\n",
    "    gen_poly_terms = PolynomialFeatures(degree=d, interaction_only=False)\n",
    "    X_train_with_poly = gen_poly_terms.fit_transform(X_train)\n",
    "    X_test_with_poly = gen_poly_terms.fit_transform(X_test)\n",
    "    \n",
    "    poly_regression_model = LinearRegression(fit_intercept=False)\n",
    "    poly_regression_model.fit(X_train_with_poly, y_train)\n",
    "    \n",
    "    train_R_sq.append(poly_regression_model.score(X_train_with_poly, y_train))\n",
    "    test_R_sq.append(poly_regression_model.score(X_test_with_poly, y_test))\n",
    "    \n",
    "sys.stdout.write(\"\\r%d%%\" % 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 2, figsize=(15, 5))\n",
    "\n",
    "ax[0].plot(range(max_deg + 1), np.array(train_R_sq), color='blue', label='train R^2')\n",
    "\n",
    "ax[0].set_title('Number of Polynomial Degree vs Model Fitness')\n",
    "ax[0].set_xlabel('Degree of Polynomial')\n",
    "ax[0].set_ylabel('Train R^2')\n",
    "ax[0].legend(loc='best')\n",
    "\n",
    "ax[1].plot(range(max_deg + 1), test_R_sq, color='red', label='test R^2')\n",
    "\n",
    "ax[1].set_title('Number of Polynomial Degree vs Model Fitness')\n",
    "ax[1].set_xlabel('Degree of Polynomial')\n",
    "ax[1].set_ylabel('Test R^2')\n",
    "ax[1].legend(loc='best')\n",
    "\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examples of Overfitting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're going to overload the train_test_split method to return a number of samples and also a validation set if requested:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sdss_gal_df = pd.read_csv('sdss_gal.csv', low_memory=False)\n",
    "\n",
    "def train_test_split(df, n_samples, validation=False):\n",
    "    if validation:\n",
    "        sdss_gal_sample = df.sample(n=n_samples)\n",
    "\n",
    "        msk = np.random.rand(len(sdss_gal_sample)) < 0.8\n",
    "        non_test = sdss_gal_sample[msk]\n",
    "        test = sdss_gal_sample[~msk]\n",
    "        \n",
    "        msk = np.random.rand(len(non_test)) < 0.7\n",
    "        train = non_test[msk]\n",
    "        validation = non_test[~msk]\n",
    "        \n",
    "        return train, validation, test\n",
    "    \n",
    "    else:\n",
    "        sdss_gal_sample = df.sample(n=n_samples)\n",
    "\n",
    "        msk = np.random.rand(len(sdss_gal_sample)) < 0.8\n",
    "        train = sdss_gal_sample[msk]\n",
    "        test = sdss_gal_sample[~msk]\n",
    "\n",
    "        return train, test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Too Many Features\n",
    "\n",
    "The SDSS galaxy redshift data set only has four features but for a higher dimensional data set may be more prone to this effect:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train, test = train_test_split(sdss_gal_df, 50)\n",
    "y_train = train['redshift'].values\n",
    "y_test = test['redshift'].values\n",
    "\n",
    "multi_regression_model = LinearRegression(fit_intercept=True)\n",
    "\n",
    "cols = ['u-g', 'g-r', 'r-i', 'i-z']\n",
    "\n",
    "train_R_sq = []\n",
    "test_R_sq = []\n",
    "for i in range(1, len(cols) + 1):\n",
    "    predictors = cols[:i]\n",
    "    X_train = train[predictors].values\n",
    "    X_test = test[predictors].values\n",
    "    \n",
    "    multi_regression_model.fit(X_train, y_train)\n",
    "    \n",
    "    train_R_sq.append(multi_regression_model.score(X_train, y_train))\n",
    "    test_R_sq.append(multi_regression_model.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 1, figsize=(10, 5))\n",
    "\n",
    "ax.plot(range(1, len(cols) + 1), train_R_sq, color='blue', label='train R^2')\n",
    "ax.plot(range(1, len(cols) + 1), test_R_sq, color='red', label='test R^2')\n",
    "\n",
    "ax.set_title('Number of Predictor vs Model Fitness')\n",
    "ax.set_xlabel('Number of Predictors')\n",
    "ax.set_ylabel('R^2')\n",
    "ax.legend(loc='best')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Degree Too High\n",
    "\n",
    "We've already seen this for the polynomial with multiple predictors but we can demonstrate it even with just one predictor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train, test = train_test_split(sdss_gal_df, 5000)\n",
    "y_train = train['redshift'].values\n",
    "y_test = test['redshift'].values\n",
    "\n",
    "poly_regression_model = LinearRegression(fit_intercept=False, normalize=True)\n",
    "\n",
    "X_train = train[['g-r']].values.reshape((len(train), 1))\n",
    "X_test = test[['g-r']].values.reshape((len(test), 1))\n",
    "\n",
    "train_R_sq = []\n",
    "test_R_sq = []\n",
    "max_deg = 30\n",
    "\n",
    "min_max_scaler = MinMaxScaler()\n",
    "X_train = min_max_scaler.fit_transform(X_train)\n",
    "X_test = min_max_scaler.fit_transform(X_test)\n",
    "\n",
    "for d in range(1, max_deg):\n",
    "\n",
    "    gen_poly_terms = PolynomialFeatures(degree=d)\n",
    "    X_train_poly = gen_poly_terms.fit_transform(X_train)\n",
    "    X_test_poly = gen_poly_terms.fit_transform(X_test)\n",
    "    \n",
    "    poly_regression_model.fit(X_train_poly, y_train)\n",
    "    \n",
    "    train_R_sq.append(poly_regression_model.score(X_train_poly, y_train))\n",
    "    test_R_sq.append(poly_regression_model.score(X_test_poly, y_test))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 2, figsize=(15, 5))\n",
    "\n",
    "ax[0].plot(range(1, max_deg), np.array(train_R_sq), color='blue', label='train R^2')\n",
    "\n",
    "ax[0].set_title('Number of Polynomial Degree vs Model Fitness')\n",
    "ax[0].set_xlabel('Degree of Polynomial')\n",
    "ax[0].set_ylabel('Train R^2')\n",
    "ax[0].legend(loc='best')\n",
    "\n",
    "ax[1].plot(range(1, max_deg), test_R_sq, color='red', label='test R^2')\n",
    "\n",
    "ax[1].set_title('Number of Polynomial Degree vs Model Fitness')\n",
    "ax[1].set_xlabel('Degree of Polynomial')\n",
    "ax[1].set_ylabel('Test R^2')\n",
    "ax[1].legend(loc='best')\n",
    "\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameters are Too Extreme"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = np.linspace(0, 40, 20)\n",
    "y_train = np.hstack((2 * X_train + 10,  np.array([(2 * 45 + 10) * 10000])))\n",
    "X_train = np.hstack((X_train, np.array([45])))\n",
    "\n",
    "regression_model = LinearRegression(fit_intercept=True)\n",
    "regression_model.fit(X_train.reshape((len(X_train), 1)), y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 1, figsize=(15, 5))\n",
    "\n",
    "ax.scatter(X_train, y_train, color='blue', label='data: (x, 2x + 10)')\n",
    "ax.plot(X_train, regression_model.intercept_ + regression_model.coef_[0] * X_train, color='red', label='regression line: {}x + {}'.format(regression_model.coef_[0], regression_model.intercept_), )\n",
    "\n",
    "ax.set_title('Number of Polynomial Degree vs Model Fitness')\n",
    "ax.set_xlabel('Degree of Polynomial')\n",
    "ax.set_ylabel('Train R^2')\n",
    "ax.legend(loc='best')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross Validation For Selecting Polynomial Model Degree\n",
    "\n",
    "We're going to use k-fold cross validation to determine the degree of the best polynomial for our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import KFold\n",
    "\n",
    "non_test, test = train_test_split(sdss_gal_df, 5000)\n",
    "regression_model = LinearRegression(fit_intercept=False)\n",
    "min_max_scaler = MinMaxScaler()\n",
    "\n",
    "y_non_test = non_test['redshift'].values\n",
    "y_test = test['redshift'].values\n",
    "\n",
    "X_non_test = non_test['g-r'].values.reshape((len(non_test), 1))\n",
    "X_test = test['g-r'].values.reshape((len(test), 1))\n",
    "\n",
    "X_non_test = min_max_scaler.fit_transform(X_non_test)\n",
    "X_test = min_max_scaler.fit_transform(X_test)\n",
    "\n",
    "kf = KFold(n_splits=10)\n",
    "\n",
    "x_val_scores = []\n",
    "\n",
    "for d in range(1, 20):\n",
    "\n",
    "    gen_poly_terms = PolynomialFeatures(degree=d, interaction_only=False)\n",
    "    X_non_test_poly = gen_poly_terms.fit_transform(X_non_test)\n",
    "\n",
    "    validation_R_sqs = []\n",
    "    for train_index, val_index in kf.split(X_non_test_poly):\n",
    "        X_train, X_val = X_non_test_poly[train_index], X_non_test_poly[val_index]\n",
    "        y_train, y_val = y_non_test[train_index], y_non_test[val_index]\n",
    "\n",
    "        regression_model.fit(X_train, y_train)\n",
    "        validation_R_sqs.append(regression_model.score(X_val, y_val))\n",
    "        \n",
    "    x_val_scores.append(np.mean(validation_R_sqs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n",
    "\n",
    "ax.plot(range(1, 20), x_val_scores, color='blue')\n",
    "\n",
    "ax.set_title('Polynomial Degree vs Cross Validation Score')\n",
    "ax.set_xlabel('Degree of Polynomial')\n",
    "ax.set_ylabel('Cross Validation Score')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_degree = range(1, 20)[np.argmax(x_val_scores)]\n",
    "\n",
    "gen_poly_terms = PolynomialFeatures(degree=d, interaction_only=False)\n",
    "X_non_test_poly = gen_poly_terms.fit_transform(X_non_test)\n",
    "X_test_poly = gen_poly_terms.fit_transform(X_test)\n",
    "\n",
    "regression_model.fit(X_non_test_poly, y_non_test)\n",
    "test_R_sq = (regression_model.score(X_test_poly, y_test))\n",
    "\n",
    "print 'best degree is:', best_degree\n",
    "print 'the test R^2 for a degree {} model is: {}'.format(best_degree, test_R_sq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparing Ridge and LASSO Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train, validation, test = train_test_split(sdss_gal_df, 5000, validation=True)\n",
    "\n",
    "y_train = train['redshift'].values\n",
    "y_val = validation['redshift'].values\n",
    "y_test = test['redshift'].values\n",
    "\n",
    "regression_model = LinearRegression(fit_intercept=True)\n",
    "\n",
    "all_predictors = ['u-g', 'g-r', 'r-i', 'i-z']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Ridge Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = train[all_predictors].values\n",
    "X_val = validation[all_predictors].values\n",
    "X_test = test[all_predictors].values\n",
    "\n",
    "ridge_regression = Ridge(alpha=1.0, fit_intercept=True)\n",
    "ridge_regression.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))\n",
    "\n",
    "print 'Ridge regression model:\\n {} + {}^T . x'.format(ridge_regression.intercept_, ridge_regression.coef_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print 'Train R^2: {}, test R^2: {}'.format(ridge_regression.score(np.vstack((X_train, X_val)), \n",
    "                                                                  np.hstack((y_train, y_val))), \n",
    "                                           ridge_regression.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. LASSO Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso_regression = Lasso(alpha=1.0, fit_intercept=True)\n",
    "lasso_regression.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))\n",
    "\n",
    "print 'Lasso regression model:\\n {} + {}^T . x'.format(lasso_regression.intercept_, lasso_regression.coef_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print 'Train R^2: {}, test R^2: {}'.format(lasso_regression.score(np.vstack((X_train, X_val)), \n",
    "                                                                  np.hstack((y_train, y_val))), \n",
    "                                           lasso_regression.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Effect of the Regularization Parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "reg_params = np.hstack((10.**np.arange(-7, 0), 10**np.arange(0, 4) + 0.01))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Ridge Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_R_sq = []\n",
    "test_R_sq = []\n",
    "\n",
    "for reg in reg_params:\n",
    "    ridge_regression = Ridge(alpha=reg, fit_intercept=True)\n",
    "    ridge_regression.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))\n",
    "    \n",
    "    train_R_sq.append(ridge_regression.score(np.vstack((X_train, X_val)), np.hstack((y_train, y_val))))\n",
    "    test_R_sq.append(ridge_regression.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n",
    "\n",
    "ax.plot(reg_params, train_R_sq, color='blue', label='train')\n",
    "ax.plot(reg_params, test_R_sq, color='red', label='test')\n",
    "\n",
    "ax.set_xscale('log')\n",
    "ax.set_title('Regularization Parameter vs Test R^2')\n",
    "ax.set_xlabel('Value of 1/lambda')\n",
    "ax.set_ylabel('R^2')\n",
    "ax.legend(loc='best')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. LASSO Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_R_sq = []\n",
    "test_R_sq = []\n",
    "\n",
    "for reg in reg_params:\n",
    "    lasso_regression = Lasso(alpha=reg, fit_intercept=True)\n",
    "    lasso_regression.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))\n",
    "    \n",
    "    train_R_sq.append(lasso_regression.score(np.vstack((X_train, X_val)), np.hstack((y_train, y_val))))\n",
    "    test_R_sq.append(lasso_regression.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n",
    "\n",
    "ax.plot(reg_params, train_R_sq, color='blue', label='train')\n",
    "ax.plot(reg_params, test_R_sq, color='red', label='test')\n",
    "\n",
    "ax.set_xscale('log')\n",
    "ax.set_title('Regularization Parameter vs Test R^2')\n",
    "ax.set_xlabel('Value of 1/lambda')\n",
    "ax.set_ylabel('R^2')\n",
    "ax.legend(loc='best')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross Validation: Selecting the Regularization Parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_non_test = np.vstack((X_train, X_val))\n",
    "y_non_test = np.hstack((y_train, y_val))\n",
    "\n",
    "\n",
    "kf = KFold(n_splits=10)\n",
    "\n",
    "x_val_scores = []\n",
    "\n",
    "for reg in reg_params:\n",
    "    ridge_regression = Ridge(alpha=reg, fit_intercept=True)\n",
    "    \n",
    "    validation_R_sqs = []\n",
    "    for train_index, val_index in kf.split(X_non_test):\n",
    "        X_train, X_val = X_non_test[train_index], X_non_test[val_index]\n",
    "        y_train, y_val = y_non_test[train_index], y_non_test[val_index]\n",
    "\n",
    "        \n",
    "        ridge_regression.fit(X_train, y_train)\n",
    "        validation_R_sqs.append(ridge_regression.score(X_val, y_val))\n",
    "        \n",
    "    x_val_scores.append(np.mean(validation_R_sqs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n",
    "\n",
    "ax.plot(reg_params, x_val_scores, color='blue')\n",
    "\n",
    "ax.set_xscale('log')\n",
    "ax.set_title('Regularization Strength vs Cross Validation Score')\n",
    "ax.set_xlabel('Strength of Regularization')\n",
    "ax.set_ylabel('Cross Validation Score')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_alpha = reg_params[np.argmax(x_val_scores)]\n",
    "\n",
    "ridge_regression = Ridge(alpha=best_alpha, fit_intercept=True)\n",
    "ridge_regression.fit(X_non_test, y_non_test)\n",
    "test_R_sq = (ridge_regression.score(X_test, y_test))\n",
    "\n",
    "print 'best regularization param is:', best_alpha\n",
    "print 'the test R^2 for ridge regression with alpha = {} is: {}'.format(best_alpha, test_R_sq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bias-Variance Tradeoff"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Consider a regression model that fits perfectly the *training data* (e.g. high order polynomial or small $h$). This complex model will have a high variance when regressing new data. On the other hand, a model too simple will have very low variance, but at the same time it will have a high bias. Furthermore, there is a tradeoff between bias and variance in terms of the MSE of the model:\n",
    "\n",
    "$\\mathrm{MSE}(x) = \\mathbb{E}(f(x) - \\hat{f}(x)) = \\mathrm{bias}_x^2 + \\mathrm{variance}_x + \\sigma^2$, where\n",
    "\n",
    "$\\mathrm{bias}_x = \\mathbb{E}(\\hat{f}(x))- f(x)$\n",
    "\n",
    "$\\mathrm{variance}_x = \\mathrm{Var}(\\hat{f}(x)) = \\mathbb{E}(\\hat{f}(x) - \\mathbb{E}(\\hat{f}(x)))$\n",
    "\n",
    "$\\sigma$ = standard deviation of the noise.\n",
    "\n",
    "<img src = http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png/>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.arange(0, 2*np.pi, 0.01)\n",
    "y = np.cos(2*x)\n",
    "\n",
    "sigma_noise = 0.1\n",
    "N = 5\n",
    "pl.clf()\n",
    "\n",
    "pl.plot(x, y, \"k\")\n",
    "colors = ['r', 'b', 'g', 'c', \"m\", \"y\"]\n",
    "for i in range(N):\n",
    "    x_data = np.random.random(10)*2*np.pi\n",
    "    y_data = np.cos(2*x_data) + np.random.normal(scale = sigma_noise, size = 10)\n",
    "\n",
    "    p1 = np.polyfit(x_data, y_data, 0)\n",
    "    p1 = np.poly1d (p1)\n",
    "    y1 = p1(x)\n",
    "\n",
    "    pl.plot(x_data, y_data, \"o\" + colors[i])\n",
    "    pl.plot (x, y1, colors[i])\n",
    "pl.title (\"0 order poly: high bias, low variance\")\n",
    "pl.ylim ([-3, 3])    \n",
    "pl.show()\n",
    "\n",
    "pl.clf()\n",
    "pl.plot(x, y, \"k\")\n",
    "for i in range(N):\n",
    "    x_data = np.random.random(10)*2*np.pi\n",
    "    y_data = np.cos(2*x_data) + np.random.normal(scale = sigma_noise, size = 10)\n",
    "\n",
    "    p10 = np.polyfit(x_data, y_data, 8)\n",
    "    p10 = np.poly1d (p10)\n",
    "    y10 = p10(x)\n",
    "\n",
    "    pl.plot(x_data, y_data, \"o\" + colors[i])\n",
    "    pl.plot (x, y10, colors[i])\n",
    "pl.title (\"9th order poly: low bias, high variance\")\n",
    "pl.ylim ([-3, 3])\n",
    "pl.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In general, the more biased our model is, the lowest the variance, and vice versa. The good news is that we can balance variance and bias using the RMSE over a _test set_. This is the same concept of cross-validation used in classification.\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  },
  "name": "_merged"
 },
 "nbformat": 4,
 "nbformat_minor": 1
}