{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Supervised Learning II\n",
    "by Mauricio Araya\n",
    "## (A.K.A. Supervised Learning Lab)\n",
    "\n",
    "<b>Credit:</b> Guillermo Cabrera, Matthew Graham, and Scikit Learn\n",
    "\n",
    "(Yes... I will force you to code now...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.- Regularized Regression... or ridding alone\n",
    "\n",
    "### Objective\n",
    "* Understand the effect of the regularization parameters\n",
    "* Compare Ridge and Lasso\n",
    "* Warm up!\n",
    "\n",
    "### Regularization Reminder (slide)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise\n",
    "### a) Forging your own Galaxy Photometry/Redshift dataset\n",
    "* Use my Tuesday's notebook to load the `sdss_gal.csv`\n",
    "* Downsample the data (use $n=10000$ for example)\n",
    "* Divide into training and test/validation\n",
    "* Select the 'u-g' feature (and hack it to be a matrix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "galaxy_feat = pd.read_csv('sdss_gal.csv', low_memory=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### b) Use the scikit-learn to perform a ridge regression \n",
    "* Use now a polynomial model (degree = 10 for example)\n",
    "* Plot your data and the curve line (use the `.predict()` function, not manually)\n",
    "* Plot the parameters of the regression using a bar plot "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Ridge,Lasso\n",
    "from sklearn.preprocessing import PolynomialFeatures"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### c) See how regularization works\n",
    "* Pack the code into a function that recieves the degree of the polynomial and the alpha parameter for regularizing Ridge.\n",
    "* This function should overplot the regression line to the data, and in a separate plot the parameters of the regression using a bar plot.\n",
    "* Use interact to play with the two values ($d \\in [2,15]$, $\\alpha \\in [0,1]$)\n",
    "* Do the same for LASSO regularization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_and_plot(d=10,a=0.0):\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.- Linear Regression... without training wheels\n",
    "\n",
    "### Objective\n",
    "\n",
    "As an exercise lets do a linear regression without training wheels:\n",
    "* <b>Left wheel</b>: $x \\in \\mathbb{R}$ --> $\\mathbf{x} \\in \\mathbb{R}^n$ variable\n",
    "* <b>Right wheel</b>: `scikit-learn` package --> just `pandas`, `numpy` and `scipy`\n",
    "\n",
    "### Theory\n",
    "What is a linear model?\n",
    "$$ y = f(\\mathbf{x};\\mathbf{w}) = \\sum_j w_j \\phi_j(\\mathbf{x}) = \\mathbf{w}^\\textrm{T} \\boldsymbol{\\phi}(\\mathbf{x}) $$\n",
    "\n",
    "Examples:\n",
    "* Polinomial: $\\phi_j(\\mathbf{x}) = \\|\\mathbf{x}\\|^j$\n",
    "* Gaussian: $\\phi_j(\\mathbf{x}) = \\exp\\left\\{\\frac{- \\|\\mathbf{x} - \\boldsymbol{\\mu}_j\\|^2}{2s^2}\\right\\}$\n",
    "* Sigmoidal: $\\phi_j(\\mathbf{x}) = \\sigma\\left(\\frac{\\|\\mathbf{x} - \\boldsymbol{\\mu}_j\\|}{s}\\right)= \\frac{1}{1 + \\exp\\left( \\frac{-\\|\\mathbf{x} - \\boldsymbol{\\mu}_j\\|}{s}\\right)}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "28373cb2f3aa4eb297d6b28e9e3154d8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/html": [
       "<p>Failed to display Jupyter Widget of type <code>interactive</code>.</p>\n",
       "<p>\n",
       "  If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
       "  that the widgets JavaScript is still loading. If this message persists, it\n",
       "  likely means that the widgets JavaScript library is either not installed or\n",
       "  not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
       "  Widgets Documentation</a> for setup instructions.\n",
       "</p>\n",
       "<p>\n",
       "  If you're reading this message in another frontend (for example, a static\n",
       "  rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
       "  it may mean that your frontend doesn't currently support widgets.\n",
       "</p>\n"
      ],
      "text/plain": [
       "interactive(children=(FloatSlider(value=0.0, description='mu', max=2.0, min=-2.0), FloatSlider(value=1.0, description='s', max=10.0, min=0.1), Output()), _dom_classes=('widget-interact',))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "<function __main__.show_gaussian>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from ipywidgets import interact\n",
    "\n",
    "def show_gaussian(mu=0.0,s=1.0):\n",
    "    a = np.arange(-10, 10, 0.1)\n",
    "    plt.plot(a, np.exp(-(a-mu)**2/(2*s**2)))\n",
    "interact(show_gaussian,mu=(-2.0,2.0),s=(0.1,10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fd21d36b0d5b4102a3979ffdf2df8b70",
       "version_major": 2,
       "version_minor": 0
      },
      "text/html": [
       "<p>Failed to display Jupyter Widget of type <code>interactive</code>.</p>\n",
       "<p>\n",
       "  If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
       "  that the widgets JavaScript is still loading. If this message persists, it\n",
       "  likely means that the widgets JavaScript library is either not installed or\n",
       "  not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
       "  Widgets Documentation</a> for setup instructions.\n",
       "</p>\n",
       "<p>\n",
       "  If you're reading this message in another frontend (for example, a static\n",
       "  rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
       "  it may mean that your frontend doesn't currently support widgets.\n",
       "</p>\n"
      ],
      "text/plain": [
       "interactive(children=(FloatSlider(value=0.0, description='mu', max=2.0, min=-2.0), FloatSlider(value=1.0, description='s', max=10.0, min=0.1), Output()), _dom_classes=('widget-interact',))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "<function __main__.show_sigmoid>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def show_sigmoid(mu=0.0,s=1.0):\n",
    "    a = np.arange(-10, 10, 0.1)\n",
    "    plt.plot(a, 1./(1. + np.exp(-(a-mu)/s)))\n",
    "interact(show_sigmoid,mu=(-2.0,2.0),s=(0.1,10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Imagine you fix every $\\mu_j$ and $s$. \n",
    "\n",
    "<b>How we can learn</b> $\\mathbf{w}$ (assuming Gaussian noise $\\epsilon \\sim N(\\mu,\\sigma)$)?\n",
    "\n",
    "$$\\begin{align} Pr(Y \\mid \\mathbf{X},\\mathbf{y}) &= \\prod_i \\mathcal{N}(Y = y_i \\mid f(\\mathbf{x}_i;\\mathbf{w}),\\sigma)\\\\ &= \\prod_i \\mathcal{N}(Y = y_i \\mid \\mathbf{w}^\\textrm{T} \\boldsymbol{\\phi}(\\mathbf{x}_i),\\sigma)\\end{align} $$\n",
    "\n",
    "We want to maximize this probability\n",
    "\n",
    "$$ \\begin{align} \\ln Pr(Y \\mid \\mathbf{X},\\mathbf{y}) & = \\sum_i \\ln \\mathcal{N}(Y = y_i \\mid \\mathbf{w}^\\textrm{T} \\boldsymbol{\\phi}(\\mathbf{x}_i),\\sigma)\\\\ & \\propto \\sum_i\\left( y_i - \\mathbf{w}^\\textrm{T} \\boldsymbol{\\phi}(\\mathbf{x}_i)\\right)^2 \\end{align} $$\n",
    "\n",
    "If we compute the gradient of this\n",
    "$$ \\begin{align} \\nabla\\ln Pr(Y \\mid \\mathbf{X},\\mathbf{y}) & = \\sum_i\\left( y_i - \\mathbf{w}^\\textrm{T}\\boldsymbol{\\phi}(\\mathbf{x}_i)\\right)\\boldsymbol{\\phi}(\\mathbf{x}_i)^\\textrm{T} \\\\\n",
    "\\nabla\\ln Pr(Y \\mid \\mathbf{X},\\mathbf{y}) &= \\mathbf{0} \\\\\n",
    "\\sum_i y_i \\boldsymbol{\\phi}(\\mathbf{x}_i)^\\textrm{T} &= \\mathbf{w}^\\textrm{T}\\sum_i\\boldsymbol{\\phi}(\\mathbf{x}_i)\\boldsymbol{\\phi}(\\mathbf{x}_i)^\\textrm{T}\\\\\n",
    "\\mathbf{y}^\\textrm{T}\\boldsymbol{\\Phi} &= \\mathbf{w}^\\textrm{T}\\boldsymbol{\\Phi}\\boldsymbol{\\Phi}^\\textrm{T}\\\\\n",
    "\\mathbf{w} &= (\\boldsymbol{\\Phi}^\\textrm{T}\\boldsymbol{\\Phi})^{-1}\\boldsymbol{\\Phi}^\\textrm{T}\\mathbf{y}\n",
    "\\end{align}$$\n",
    "\n",
    "The $\\boldsymbol{\\Phi}$ is called the <b>Design Matrix</b> and have the form:\n",
    "$$\\begin{bmatrix}\n",
    "1 & \\phi_1(\\mathbf{x}_1) & \\phi_2(\\mathbf{x}_1) & \\ldots & \\phi_m(\\mathbf{x}_1)\\\\\n",
    "1 & \\phi_1(\\mathbf{x}_2) & \\phi_2(\\mathbf{x}_2) & \\ldots & \\phi_m(\\mathbf{x}_2)\\\\\n",
    "\\vdots & \\vdots & \\vdots & \\ddots & \\vdots\\\\\n",
    "1 & \\phi_1(\\mathbf{x}_n) & \\phi_2(\\mathbf{x}_n) & \\ldots & \\phi_m(\\mathbf{x}_n)\n",
    "\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise\n",
    "### a) Use now ALL the dimensions now and select $\\mu$'s \n",
    "* Check if you have all the dimensions of the training and validation sets (not one like in the previous example)\n",
    "* Resample to $n=1000$ o similar for working\n",
    "* Select $m=20$ points of the training sample to use a $\\mu$ values\n",
    "* It will be handy for later doing this in a function, recieving the number of total samples $(n,m)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_matrices(n,m):\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### b) Construct your own design matrix\n",
    "* Choose a not so trivial kernel (i.e. Gaussian/Sigmoidal)\n",
    "* Do not obsess with speed... yet. Just give a solution.\n",
    "* If you have dissmissed the previous recommendation, use the L2-norm $\\|\\cdot\\|_2$ and remember that\n",
    "$$ (\\mathbf{a} - \\mathbf{b})^\\textrm{T}(\\mathbf{a} - \\mathbf{b}) = \\mathbf{a}^\\textrm{T}\\mathbf{a} + \\mathbf{b}^\\textrm{T}\\mathbf{b} - 2\\mathbf{a}^\\textrm{T}\\mathbf{b} $$ and remember to round up to 10 decimals\n",
    "* (Optional) Plot the 20 first rows of the matrix using `plt.imshow`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### c) Train and Predict\n",
    "* Get a design matrix\n",
    "* Compute $\\mathbf{w} = (\\boldsymbol{\\Phi}^\\textrm{T}\\boldsymbol{\\Phi})^{-1}\\boldsymbol{\\Phi}^\\textrm{T}\\mathbf{y}$ using `scipy.linal` package for computing the inverse. Remember that this model is not regularized, so compute the pseudo-inverse.\n",
    "* Get the \"predictions\" for the training data $\\mathbf{w}^\\textrm{T}\\boldsymbol{\\Phi}_\\mathbf{X}$\n",
    "* Get the predictions for the test/validation data $\\mathbf{w}^\\textrm{T}\\boldsymbol{\\Phi}_\\mathbf{X'}$\n",
    "* Do a scatter plot of the real redshifts of the training set against the 'u-g' dimension in blue\n",
    "* Use the same plot to show the predicted redshifts values of the training set.\n",
    "* Create a new figure for the test/validation data\n",
    "* Did you put all this in a function?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### d) Computing the RMS errors\n",
    "* Compute the RMS error for the training and validation/test set: $RMS = \\sqrt{\\frac{\\sum_i^n (y_i - y_i')^2}{n}}$\n",
    "* Plot the error for $s \\in [0.1,5]$\n",
    "* Plot the error for $n \\in [100,10000]$ step = 100\n",
    "* Plot the error for $m \\in [10,100] $ step = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.- Logistic \"Regression\"... into the bikeway\n",
    "### Objective\n",
    "* Use regression models to classify (everything is connected)\n",
    "* Exercise the use of metrics\n",
    "* Learn the importance of understanding the model\n",
    "\n",
    "### Theory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us explore the following idea:\n",
    "$$\\sigma(\\mathbf{w}^\\textrm{T}\\mathbf{x}) = \\frac{1}{1+\\exp(-\\mathbf{w}^\\textrm{T}\\mathbf{x})}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4cacd803a0de4f91885759ceb5399716",
       "version_major": 2,
       "version_minor": 0
      },
      "text/html": [
       "<p>Failed to display Jupyter Widget of type <code>interactive</code>.</p>\n",
       "<p>\n",
       "  If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
       "  that the widgets JavaScript is still loading. If this message persists, it\n",
       "  likely means that the widgets JavaScript library is either not installed or\n",
       "  not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
       "  Widgets Documentation</a> for setup instructions.\n",
       "</p>\n",
       "<p>\n",
       "  If you're reading this message in another frontend (for example, a static\n",
       "  rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
       "  it may mean that your frontend doesn't currently support widgets.\n",
       "</p>\n"
      ],
      "text/plain": [
       "interactive(children=(IntSlider(value=1, description='w1', max=5, min=-5), FloatSlider(value=1.0, description='w2', max=1.0, min=0.001), FloatSlider(value=1.0, description='w3', max=1.0, min=0.001), Output()), _dom_classes=('widget-interact',))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "<function __main__.logistic2D>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def logistic2D(w1=1,w2=1,w3=1):\n",
    "    w = [w1, w2, w3]\n",
    "    x1_g, x2_g = np.meshgrid(np.arange(-5., 5.0, 0.1),np.arange(-5., 5.0, 0.1))\n",
    "    y = w[0] + w[1]*x1_g + w[2]*x2_g\n",
    "    plt.contourf(x1_g, x2_g, 1./(1.+np.exp(-y)), cmap=plt.cm.seismic, levels = np.arange(0, 1.1, 0.05))\n",
    "    plt.xlim([-5,5])\n",
    "    plt.ylim([-5,5])\n",
    "    plt.colorbar()\n",
    "interact(logistic2D,w1=(-5,5),w2=(0.001,1.),w3=(0.001,1.))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets use our linear model under a sigmoid function $\\sigma$ to separate classes then! \n",
    "\n",
    "Let $\\phi(\\boldsymbol{x}) = \\boldsymbol{\\phi}$\n",
    "\n",
    "$Pr(Y = c_1 \\mid \\boldsymbol{x}) = \\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi})$\n",
    "\n",
    "Reciprocaly\n",
    "\n",
    "$Pr(Y = c_2 \\mid \\boldsymbol{x}) = 1-\\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi})$\n",
    "\n",
    "Note that:\n",
    "\n",
    "$\\sigma(-a) = 1 - \\sigma(a)$\n",
    "\n",
    "So...\n",
    "\n",
    "$\\boldsymbol{w}^\\top\\boldsymbol{\\phi} = \\ln\\left(\\frac{Pr(Y=c_1\\mid\\boldsymbol{x})}{Pr(y=c_2\\mid\\boldsymbol{x})}\\right)$,\n",
    "\n",
    "Let us define $c_1 = 1$ and $c_2 = 0$ just because... then,\n",
    "\n",
    "$Pr(Y=y\\mid\\boldsymbol{x}) = \\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi})^y (1 - \\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi})^{1-y}$\n",
    "\n",
    "If we try to compute the log likelihood for a training data\n",
    "\n",
    "$$ E(\\boldsymbol{w}) = \\ln Pr(\\mathbf{Y}=\\mathbf{y}\\mid\\boldsymbol{X},\\boldsymbol{w}) = \\sum_i \\left\\{y_i \\ln( \\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi}_i)) + (1-y_i)\\ln(1 - \\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi}_i))\\right\\}$$\n",
    "\n",
    "Then, the gradient looks nice...\n",
    "\n",
    "$\\nabla E(\\boldsymbol{w}) = \\sum_{i}(\\sigma(\\boldsymbol{w}^\\top\\boldsymbol{\\phi}_i)) - y_n)\\boldsymbol{\\phi} = 0$ \n",
    "\n",
    "but is no closed form to solve this, so we use iterative methods to solve them (numerical optimization methods such as Newton-Raphson). \n",
    "\n",
    "Obviously, Mr. Scikit Learn have coded that for us... [link](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)\n",
    "\n",
    "## Exercise\n",
    "\n",
    "### a) Load and balance the SDSS stars dataset (RRLyrae)\n",
    "* Load the dataset and put names to the columns\n",
    "* Find out how imbalanced is the dataset\n",
    "* Subsample the class of non-variable stars to the same size of the other class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### b) Train and Evaluate a Logistic Regression Classifier (LRC)\n",
    "* Train the LRC with the balanced data\n",
    "* Predict the classes for the same data\n",
    "* Compute the completeness and contamination "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import linear_model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### c) Incrementally include the rest of the data\n",
    "* Create a function that trains and predicts depending on the number of samples used for subsampling the non-variables class. This function should return the completeness and contamination.\n",
    "* Plot the behaviour of the completeness and contamination as you increase the number of samples of the non-variables class. I recommend steps of 500. \n",
    "* Plot also the parameters of the model using bars\n",
    "* What is your explanation for this result? can we do something about it?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_measures(ns):\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}