{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Supervised Learning II\n", "by Mauricio Araya\n", "## (A.K.A. Supervised Learning Lab)\n", "\n", "Credit: Guillermo Cabrera, Matthew Graham, and Scikit Learn\n", "\n", "(Yes... I will force you to code now...)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.- Regularized Regression... or ridding alone\n", "\n", "### Objective\n", "* Understand the effect of the regularization parameters\n", "* Compare Ridge and Lasso\n", "* Warm up!\n", "\n", "### Regularization Reminder (slide)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise\n", "### a) Forging your own Galaxy Photometry/Redshift dataset\n", "* Use my Tuesday's notebook to load the `sdss_gal.csv`\n", "* Downsample the data (use $n=10000$ for example)\n", "* Divide into training and test/validation\n", "* Select the 'u-g' feature (and hack it to be a matrix)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "galaxy_feat = pd.read_csv('sdss_gal.csv', low_memory=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### b) Use the scikit-learn to perform a ridge regression \n", "* Use now a polynomial model (degree = 10 for example)\n", "* Plot your data and the curve line (use the `.predict()` function, not manually)\n", "* Plot the parameters of the regression using a bar plot " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Ridge,Lasso\n", "from sklearn.preprocessing import PolynomialFeatures" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### c) See how regularization works\n", "* Pack the code into a function that recieves the degree of the polynomial and the alpha parameter for regularizing Ridge.\n", "* This function should overplot the regression line to the data, and in a separate plot the parameters of the regression using a bar plot.\n", "* Use interact to play with the two values ($d \\in [2,15]$, $\\alpha \\in [0,1]$)\n", "* Do the same for LASSO regularization" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def train_and_plot(d=10,a=0.0):\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.- Linear Regression... without training wheels\n", "\n", "### Objective\n", "\n", "As an exercise lets do a linear regression without training wheels:\n", "* Left wheel: $x \\in \\mathbb{R}$ --> $\\mathbf{x} \\in \\mathbb{R}^n$ variable\n", "* Right wheel: `scikit-learn` package --> just `pandas`, `numpy` and `scipy`\n", "\n", "### Theory\n", "What is a linear model?\n", "$$ y = f(\\mathbf{x};\\mathbf{w}) = \\sum_j w_j \\phi_j(\\mathbf{x}) = \\mathbf{w}^\\textrm{T} \\boldsymbol{\\phi}(\\mathbf{x}) $$\n", "\n", "Examples:\n", "* Polinomial: $\\phi_j(\\mathbf{x}) = \\|\\mathbf{x}\\|^j$\n", "* Gaussian: $\\phi_j(\\mathbf{x}) = \\exp\\left\\{\\frac{- \\|\\mathbf{x} - \\boldsymbol{\\mu}_j\\|^2}{2s^2}\\right\\}$\n", "* Sigmoidal: $\\phi_j(\\mathbf{x}) = \\sigma\\left(\\frac{\\|\\mathbf{x} - \\boldsymbol{\\mu}_j\\|}{s}\\right)= \\frac{1}{1 + \\exp\\left( \\frac{-\\|\\mathbf{x} - \\boldsymbol{\\mu}_j\\|}{s}\\right)}$" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "28373cb2f3aa4eb297d6b28e9e3154d8", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type interactive.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "interactive(children=(FloatSlider(value=0.0, description='mu', max=2.0, min=-2.0), FloatSlider(value=1.0, description='s', max=10.0, min=0.1), Output()), _dom_classes=('widget-interact',))" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from ipywidgets import interact\n", "\n", "def show_gaussian(mu=0.0,s=1.0):\n", " a = np.arange(-10, 10, 0.1)\n", " plt.plot(a, np.exp(-(a-mu)**2/(2*s**2)))\n", "interact(show_gaussian,mu=(-2.0,2.0),s=(0.1,10))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "fd21d36b0d5b4102a3979ffdf2df8b70", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type interactive.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "interactive(children=(FloatSlider(value=0.0, description='mu', max=2.0, min=-2.0), FloatSlider(value=1.0, description='s', max=10.0, min=0.1), Output()), _dom_classes=('widget-interact',))" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def show_sigmoid(mu=0.0,s=1.0):\n", " a = np.arange(-10, 10, 0.1)\n", " plt.plot(a, 1./(1. + np.exp(-(a-mu)/s)))\n", "interact(show_sigmoid,mu=(-2.0,2.0),s=(0.1,10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine you fix every $\\mu_j$ and $s$. \n", "\n", "How we can learn $\\mathbf{w}$ (assuming Gaussian noise $\\epsilon \\sim N(\\mu,\\sigma)$)?\n", "\n", "$$\\begin{align} Pr(Y \\mid \\mathbf{X},\\mathbf{y}) &= \\prod_i \\mathcal{N}(Y = y_i \\mid f(\\mathbf{x}_i;\\mathbf{w}),\\sigma)\\\\ &= \\prod_i \\mathcal{N}(Y = y_i \\mid \\mathbf{w}^\\textrm{T} \\boldsymbol{\\phi}(\\mathbf{x}_i),\\sigma)\\end{align} $$\n", "\n", "We want to maximize this probability\n", "\n", "$$ \\begin{align} \\ln Pr(Y \\mid \\mathbf{X},\\mathbf{y}) & = \\sum_i \\ln \\mathcal{N}(Y = y_i \\mid \\mathbf{w}^\\textrm{T} \\boldsymbol{\\phi}(\\mathbf{x}_i),\\sigma)\\\\ & \\propto \\sum_i\\left( y_i - \\mathbf{w}^\\textrm{T} \\boldsymbol{\\phi}(\\mathbf{x}_i)\\right)^2 \\end{align} $$\n", "\n", "If we compute the gradient of this\n", "$$ \\begin{align} \\nabla\\ln Pr(Y \\mid \\mathbf{X},\\mathbf{y}) & = \\sum_i\\left( y_i - \\mathbf{w}^\\textrm{T}\\boldsymbol{\\phi}(\\mathbf{x}_i)\\right)\\boldsymbol{\\phi}(\\mathbf{x}_i)^\\textrm{T} \\\\\n", "\\nabla\\ln Pr(Y \\mid \\mathbf{X},\\mathbf{y}) &= \\mathbf{0} \\\\\n", "\\sum_i y_i \\boldsymbol{\\phi}(\\mathbf{x}_i)^\\textrm{T} &= \\mathbf{w}^\\textrm{T}\\sum_i\\boldsymbol{\\phi}(\\mathbf{x}_i)\\boldsymbol{\\phi}(\\mathbf{x}_i)^\\textrm{T}\\\\\n", "\\mathbf{y}^\\textrm{T}\\boldsymbol{\\Phi} &= \\mathbf{w}^\\textrm{T}\\boldsymbol{\\Phi}\\boldsymbol{\\Phi}^\\textrm{T}\\\\\n", "\\mathbf{w} &= (\\boldsymbol{\\Phi}^\\textrm{T}\\boldsymbol{\\Phi})^{-1}\\boldsymbol{\\Phi}^\\textrm{T}\\mathbf{y}\n", "\\end{align}$$\n", "\n", "The $\\boldsymbol{\\Phi}$ is called the Design Matrix and have the form:\n", "$$\\begin{bmatrix}\n", "1 & \\phi_1(\\mathbf{x}_1) & \\phi_2(\\mathbf{x}_1) & \\ldots & \\phi_m(\\mathbf{x}_1)\\\\\n", "1 & \\phi_1(\\mathbf{x}_2) & \\phi_2(\\mathbf{x}_2) & \\ldots & \\phi_m(\\mathbf{x}_2)\\\\\n", "\\vdots & \\vdots & \\vdots & \\ddots & \\vdots\\\\\n", "1 & \\phi_1(\\mathbf{x}_n) & \\phi_2(\\mathbf{x}_n) & \\ldots & \\phi_m(\\mathbf{x}_n)\n", "\\end{bmatrix}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise\n", "### a) Use now ALL the dimensions now and select $\\mu$'s \n", "* Check if you have all the dimensions of the training and validation sets (not one like in the previous example)\n", "* Resample to $n=1000$ o similar for working\n", "* Select $m=20$ points of the training sample to use a $\\mu$ values\n", "* It will be handy for later doing this in a function, recieving the number of total samples $(n,m)$" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def get_matrices(n,m):\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b) Construct your own design matrix\n", "* Choose a not so trivial kernel (i.e. Gaussian/Sigmoidal)\n", "* Do not obsess with speed... yet. Just give a solution.\n", "* If you have dissmissed the previous recommendation, use the L2-norm $\\|\\cdot\\|_2$ and remember that\n", "$$ (\\mathbf{a} - \\mathbf{b})^\\textrm{T}(\\mathbf{a} - \\mathbf{b}) = \\mathbf{a}^\\textrm{T}\\mathbf{a} + \\mathbf{b}^\\textrm{T}\\mathbf{b} - 2\\mathbf{a}^\\textrm{T}\\mathbf{b} $$ and remember to round up to 10 decimals\n", "* (Optional) Plot the 20 first rows of the matrix using `plt.imshow`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c) Train and Predict\n", "* Get a design matrix\n", "* Compute $\\mathbf{w} = (\\boldsymbol{\\Phi}^\\textrm{T}\\boldsymbol{\\Phi})^{-1}\\boldsymbol{\\Phi}^\\textrm{T}\\mathbf{y}$ using `scipy.linal` package for computing the inverse. Remember that this model is not regularized, so compute the pseudo-inverse.\n", "* Get the \"predictions\" for the training data $\\mathbf{w}^\\textrm{T}\\boldsymbol{\\Phi}_\\mathbf{X}$\n", "* Get the predictions for the test/validation data $\\mathbf{w}^\\textrm{T}\\boldsymbol{\\Phi}_\\mathbf{X'}$\n", "* Do a scatter plot of the real redshifts of the training set against the 'u-g' dimension in blue\n", "* Use the same plot to show the predicted redshifts values of the training set.\n", "* Create a new figure for the test/validation data\n", "* Did you put all this in a function?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### d) Computing the RMS errors\n", "* Compute the RMS error for the training and validation/test set: $RMS = \\sqrt{\\frac{\\sum_i^n (y_i - y_i')^2}{n}}$\n", "* Plot the error for $s \\in [0.1,5]$\n", "* Plot the error for $n \\in [100,10000]$ step = 100\n", "* Plot the error for $m \\in [10,100] $ step = 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.- Logistic \"Regression\"... into the bikeway\n", "### Objective\n", "* Use regression models to classify (everything is connected)\n", "* Exercise the use of metrics\n", "* Learn the importance of understanding the model\n", "\n", "### Theory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us explore the following idea:\n", "$$\\sigma(\\mathbf{w}^\\textrm{T}\\mathbf{x}) = \\frac{1}{1+\\exp(-\\mathbf{w}^\\textrm{T}\\mathbf{x})}$$" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4cacd803a0de4f91885759ceb5399716", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type interactive.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "interactive(children=(IntSlider(value=1, description='w1', max=5, min=-5), FloatSlider(value=1.0, description='w2', max=1.0, min=0.001), FloatSlider(value=1.0, description='w3', max=1.0, min=0.001), Output()), _dom_classes=('widget-interact',))" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def logistic2D(w1=1,w2=1,w3=1):\n", " w = [w1, w2, w3]\n", " x1_g, x2_g = np.meshgrid(np.arange(-5., 5.0, 0.1),np.arange(-5., 5.0, 0.1))\n", " y = w[0] + w[1]*x1_g + w[2]*x2_g\n", " plt.contourf(x1_g, x2_g, 1./(1.+np.exp(-y)), cmap=plt.cm.seismic, levels = np.arange(0, 1.1, 0.05))\n", " plt.xlim([-5,5])\n", " plt.ylim([-5,5])\n", " plt.colorbar()\n", "interact(logistic2D,w1=(-5,5),w2=(0.001,1.),w3=(0.001,1.))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets use our linear model under a sigmoid function $\\sigma$ to separate classes then! \n", "\n", "Let $\\phi(\\boldsymbol{x}) = \\boldsymbol{\\phi}$\n", "\n", "$Pr(Y = c_1 \\mid \\boldsymbol{x}) = \\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi})$\n", "\n", "Reciprocaly\n", "\n", "$Pr(Y = c_2 \\mid \\boldsymbol{x}) = 1-\\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi})$\n", "\n", "Note that:\n", "\n", "$\\sigma(-a) = 1 - \\sigma(a)$\n", "\n", "So...\n", "\n", "$\\boldsymbol{w}^\\top\\boldsymbol{\\phi} = \\ln\\left(\\frac{Pr(Y=c_1\\mid\\boldsymbol{x})}{Pr(y=c_2\\mid\\boldsymbol{x})}\\right)$,\n", "\n", "Let us define $c_1 = 1$ and $c_2 = 0$ just because... then,\n", "\n", "$Pr(Y=y\\mid\\boldsymbol{x}) = \\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi})^y (1 - \\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi})^{1-y}$\n", "\n", "If we try to compute the log likelihood for a training data\n", "\n", "$$ E(\\boldsymbol{w}) = \\ln Pr(\\mathbf{Y}=\\mathbf{y}\\mid\\boldsymbol{X},\\boldsymbol{w}) = \\sum_i \\left\\{y_i \\ln( \\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi}_i)) + (1-y_i)\\ln(1 - \\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi}_i))\\right\\}$$\n", "\n", "Then, the gradient looks nice...\n", "\n", "$\\nabla E(\\boldsymbol{w}) = \\sum_{i}(\\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi}_i)) - y_n)\\boldsymbol{\\phi} = 0$ \n", "\n", "but is no closed form to solve this, so we use iterative methods to solve them (numerical optimization methods such as Newton-Raphson). \n", "\n", "Obviously, Mr. Scikit Learn have coded that for us... [link](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)\n", "\n", "## Exercise\n", "\n", "### a) Load and balance the SDSS stars dataset (RRLyrae)\n", "* Load the dataset and put names to the columns\n", "* Find out how imbalanced is the dataset\n", "* Subsample the class of non-variable stars to the same size of the other class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b) Train and Evaluate a Logistic Regression Classifier (LRC)\n", "* Train the LRC with the balanced data\n", "* Predict the classes for the same data\n", "* Compute the completeness and contamination " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn import linear_model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c) Incrementally include the rest of the data\n", "* Create a function that trains and predicts depending on the number of samples used for subsampling the non-variables class. This function should return the completeness and contamination.\n", "* Plot the behaviour of the completeness and contamination as you increase the number of samples of the non-variables class. I recommend steps of 500. \n", "* Plot also the parameters of the model using bars\n", "* What is your explanation for this result? can we do something about it?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def get_measures(ns):\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }