{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Practical Introduction to Machine Learning\n",
    "by <b>Mauricio Araya</b>\n",
    "\n",
    "<b>Credits:</b> Francisco Foster, Matthew Graham, Pavlos Protopapas\n",
    "\n",
    "## 1.- SDSS Data \n",
    "  \n",
    "<img src=\"https://www.sdss.org/wp-content/uploads/2014/11/SDSS_telescope_new.jpg\" alt=\"SLOAN\" width=\"300\">\n",
    "\n",
    "Lets download data from the Sloan Digital Sky Survey, the all-time favorite dataset for Machine Learning in Astronomy. We could have used data from UCI or Kaggle, but I think SDSS data is very appropaite for this school ;).\n",
    "\n",
    "## 1.1.- Download Star Photometry Data using AstroML\n",
    "We will start cheating a little bit by using the AstroML package\n",
    "\n",
    "`conda install -c astropy astroml`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from astroML.datasets import fetch_rrlyrae_combined\n",
    "sdss_star_feat, sdss_star_type = fetch_rrlyrae_combined()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sdss_star_type"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and use the Pandas package..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import warnings\n",
    "warnings.simplefilter(action='ignore', category=Warning)\n",
    "%matplotlib inline\n",
    "pd.set_option('display.max_rows',10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "as our data manager"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "star_feat=pd.DataFrame(sdss_star_feat)\n",
    "star_feat.columns=['u-g', 'g-r', 'r-i', 'i-z']\n",
    "star_feat.plot.scatter('u-g', 'g-r')\n",
    "star_feat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This data also have labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import matplotlib.pyplot as plt\n",
    "star_label=pd.DataFrame(sdss_star_type)\n",
    "star_label.columns=['Type']\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "star_feat[star_label['Type']==0].plot.scatter('u-g','g-r',c='red',ax=ax)\n",
    "star_feat[star_label['Type']==1].plot.scatter('u-g','g-r',c='blue',ax=ax)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "star_feat['Type']=star_label['Type']\n",
    "star_feat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.2.- Download Galaxy Photometry Data (with Redshifts)\n",
    "This dataset are galaxies with known (spectroscopically confirmed) redshifts and colour magnitudes. We're interested in determining the redshift of a galaxy from its colors (photometric redshift). The data can be downloaded from: http://www.astro.caltech.edu/~mjg/sdss_gal.csv.gz, and we will use `urllib`to do this from the notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import urllib\n",
    "urllib.request.urlretrieve(\"http://www.astro.caltech.edu/~mjg/sdss_gal.csv.gz\", \"sdss_gal.csv.gz\")\n",
    "!gunzip sdss_gal.csv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "galaxy_feat = pd.read_csv('sdss_gal.csv', low_memory=False)\n",
    "galaxy_feat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gal_sample = galaxy_feat.sample(n=1000)\n",
    "gal_sample.plot.scatter('g-r','redshift',color='gray',alpha=0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.- Regression\n",
    "Regression is about predicting values of continous variables.\n",
    "\n",
    "$$y' = f(x' \\mid \\mathbf{X},\\mathbf{y})$$\n",
    "\n",
    "where $y \\in \\mathbb{R}^n$, $x \\in \\mathbb{R}^m$ and $\\mathbf{y}$ and $\\mathbf{X}$ are the target and non-target features for all the samples respectively.\n",
    "\n",
    "We will use the SDSS Galaxy data, well... a portion of it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simple... not correct\n",
    "train_data = gal_sample[:750]\n",
    "test_data = gal_sample[750:]\n",
    "y_train = train_data['redshift']\n",
    "X_train = train_data['g-r']\n",
    "# Formatting hack...\n",
    "X_train=X_train.values.reshape(len(X_train), 1);\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1.- Parametric Regression\n",
    "\n",
    "We can condense the information found in $\\mathbf{X}$ and $\\mathbf{y}$ by imposing a *parametric model*, meaning to optimize certain parameters for the given data. Now our model is\n",
    "$$y' = f(x' ; \\theta^*)$$\n",
    "where \n",
    "$$\\theta^* = \\underset{\\theta}{\\operatorname{argmax}} \\left\\{ Pr(Y = f(X;\\theta) \\mid \\mathbf{X}, \\mathbf{y}) \\right\\}$$\n",
    "which under a <b>linear model</b> and a Gaussian noise $\\epsilon$ assumption ($Y = f(X) + \\epsilon $) it becomes\n",
    "$$ \\theta^* = \\underset{\\theta}{\\operatorname{argmin}} \\left\\{ \\sum_i (y_i - f(X_i;\\theta))^2 \\right\\}$$.\n",
    "\n",
    "Consider now a straight line as our model,\n",
    "$$ f(x;\\theta) = a x + b$$\n",
    "where our parameters are $\\theta = \\{a,b\\}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "regression = LinearRegression(fit_intercept=True)\n",
    "regression.fit(X_train, y_train)\n",
    "\n",
    "regression_line = lambda x: regression.intercept_ + regression.coef_ * x\n",
    "print('The equation of the regression line is: {} + {} * x'.format(regression.intercept_, regression.coef_[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "x_vals = np.linspace(0, 3, 100)\n",
    "\n",
    "train_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',ax=ax)\n",
    "ax.plot(x_vals, regression_line(x_vals), color='red', linewidth=1.0, label='regression line')\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not very good... lets try another <b>linear</b> model!\n",
    "$$ y = ax^3 + bx^2 + cx + d $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gen_poly_terms = PolynomialFeatures(degree=2)\n",
    "X_train_with_poly = gen_poly_terms.fit_transform(X_train)\n",
    "poly_regression = LinearRegression(fit_intercept=True)\n",
    "poly_regression.fit(X_train_with_poly, y_train)\n",
    "display(poly_regression.coef_)\n",
    "poly_regression.intercept_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "coef = poly_regression.coef_\n",
    "inter = poly_regression.intercept_\n",
    "poly = lambda x: inter + coef[1] * x + coef[2] * x*x \n",
    "train_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',ax=ax)\n",
    "ax.plot(x_vals, poly(x_vals), color='red', linewidth=1.0, label='regression line')\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gen_poly_terms = PolynomialFeatures(degree=3)\n",
    "X_train_with_poly = gen_poly_terms.fit_transform(X_train)\n",
    "poly_regression = LinearRegression(fit_intercept=True)\n",
    "poly_regression.fit(X_train_with_poly, y_train)\n",
    "display(poly_regression.coef_)\n",
    "poly_regression.intercept_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "coef = poly_regression.coef_\n",
    "inter = poly_regression.intercept_\n",
    "poly = lambda x: inter + coef[1] * x + coef[2] * x*x + coef[3]*x*x*x\n",
    "train_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',ax=ax)\n",
    "ax.plot(x_vals, poly(x_vals), color='red', linewidth=1.0, label='regression line')\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "test_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',ax=ax)\n",
    "ax.plot(x_vals, poly(x_vals), color='red', linewidth=1.0, label='regression line')\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have basically *learned* the parameters!\n",
    "\n",
    "This is not the best we can do of course!, we can:\n",
    "* Change the function/model\n",
    "* Use more dimensions\n",
    "* Go non-linear...\n",
    "* Use more/better data\n",
    "* Use regularized models\n",
    "* etc...\n",
    "\n",
    "## 2.2 Non-parametric Regression\n",
    "\n",
    "Consolidating data into model parameters have some advantages and drawbacks. An alternative is to use non-parametric models. Now, we want to predict \n",
    "$$ y' = f(x'; \\mathbf{X}, \\mathbf{y}, \\theta_0) $$\n",
    "For example, consider a model based on assigning the same Gaussian function (Kernel) to each sample:\n",
    "$$ K_\\sigma(x)=\\frac{1}{\\sqrt{2\\pi}\\sigma}exp\\left(\\frac{-x^2}{2\\sigma^2}\\right)$$\n",
    "$$ y'=\\frac{\\sum_{i=1}^n K_\\sigma(x'-x_i)y_i}{\\sum_{i=1}^nK_\\sigma(x'-x_i)}$$\n",
    "Please note that $\\theta_0 = \\sigma$: \n",
    "\n",
    "<b>Non-parametric $\\neq$ no parameters!</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def GKR(x_predict,x_data,y_data,s):\n",
    "    dmat = np.tile(x_data,len(x_predict))\n",
    "    dmat = dmat - np.tile(x_predict,(len(x_data),1))\n",
    "    K = np.exp(-(dmat*dmat)/(2*s*s))/(np.sqrt(2*np.pi)*s)\n",
    "    return(K.T.dot(y_data) / K.sum(axis=0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_gkr(sigma=0.1):\n",
    "    y_gkr=GKR(x_vals,X_train,y_train,sigma)\n",
    "    fig, ax = plt.subplots()\n",
    "    train_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',s=sigma*500,ax=ax)\n",
    "    ax.plot(x_vals, y_gkr, color='red', linewidth=1.0, label='regression line')\n",
    "    plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ipywidgets import interact\n",
    "interact(plot_gkr,sigma=(0.01,1.0,0.01))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are much smarter ways to do this... for example Gaussian Processes!\n",
    "\n",
    "## 3.- Labelling\n",
    "\n",
    "Consider now the SDSS star photometry data. \n",
    "\n",
    "<b>Warning:</b> we will do this *naively* (i.e., wrongly). During the rest of the week we will improve this...\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N=2000\n",
    "star_sample=star_feat[-1:-N-1:-1]\n",
    "star_sample = star_sample.sample(n=N)\n",
    "star_train = star_sample[:int(N*0.75)]\n",
    "star_test = star_sample[int(N*0.75):]\n",
    "star_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "star_train[star_train['Type']==0].plot.scatter('u-g','g-r',c='red',ax=ax)\n",
    "star_train[star_train['Type']==1].plot.scatter('u-g','g-r',c='blue',ax=ax)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "star_test[star_test['Type']==0].plot.scatter('u-g','g-r',c='red',ax=ax)\n",
    "star_test[star_test['Type']==1].plot.scatter('u-g','g-r',c='blue',ax=ax)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(star_train['Type'].sum()/len(star_train))\n",
    "display(star_test['Type'].sum()/len(star_test))\n",
    "display(star_feat['Type'].sum()/len(star_feat))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.1.- Classification (Supervised)\n",
    "\n",
    "Classification is labelling based on previously annotated samples.\n",
    "\n",
    "### Discriminative Classification Models\n",
    "Think on a boundary dividing data. In 2 dimensions is a line/curve, in 3 dimensions a surface, in 4 dimensions a volume, and so on. The boundary divides data into classes. This is what is called a <b>discriminative model</b>. \n",
    "\n",
    "#### Support Vector Machines\n",
    "*Vocabulary:* This is a  <font color='blue'>discriminative</font> <font color='green'>(non-parametric)</font> <font color='magenta'>linear</font> model for a <font color='red'>supervised</font> <font color='orange'>batch</font>  learning problem"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.svm import SVC\n",
    "clf = SVC(kernel='linear')\n",
    "clf.fit(star_train[['u-g','g-r']], star_train['Type'])\n",
    "y_pred = clf.predict(star_test[['u-g','g-r']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "star_test['Predict']=y_pred\n",
    "star_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "star_test['Predict'].sum()/len(star_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "star_test[star_test['Predict']==0.0].plot.scatter('u-g','g-r',c='red',ax=ax)\n",
    "star_test[star_test['Predict']==1.0].plot.scatter('u-g','g-r',c='blue',ax=ax)\n",
    "\n",
    "# Compute the boundary\n",
    "w = clf.coef_[0]\n",
    "a = -w[1] / w[0]\n",
    "yy = np.linspace(-0.1, 0.4)\n",
    "xx = a * yy - clf.intercept_[0] / w[0]\n",
    "\n",
    "ax.plot(xx, yy, '-k')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "FP = star_test[star_test['Predict']==1.0]; FP = FP[FP['Type']==0.0]\n",
    "FN = star_test[star_test['Predict']==0.0]; FN = FN[FN['Type']==1.0]\n",
    "TP = star_test[star_test['Predict']==1.0]; TP = TP[TP['Type']==1.0]\n",
    "TN = star_test[star_test['Predict']==0.0]; TN = TN[TN['Type']==0.0]\n",
    "fig, ax = plt.subplots()\n",
    "TP.plot.scatter('u-g','g-r',c='blue',ax=ax,label=\"TP\")\n",
    "TN.plot.scatter('u-g','g-r',c='red',ax=ax,label=\"TN\")\n",
    "FP.plot.scatter('u-g','g-r',c='magenta',ax=ax,label=\"FP\",marker='+',s=100)\n",
    "FN.plot.scatter('u-g','g-r',c='green',ax=ax,label=\"FN\",marker='+',s=100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Radial Basis Function Kernel\n",
    "We can construct a hyperplane (line) in other space by transforming data to that space, and then come back. This is done using kernels\n",
    "\n",
    "$${\\displaystyle K(\\mathbf {x} ,\\mathbf {x'} )=\\exp \\left(-{\\frac {\\|\\mathbf {x} -\\mathbf {x'} \\|^{2}}{2\\sigma ^{2}}}\\right)}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.ndimage import gaussian_filter\n",
    "def plot_svm_rbf(gamma=20.0):\n",
    "    clf_rbf = SVC(kernel='rbf', gamma=gamma)\n",
    "    clf_rbf.fit(star_train[['u-g','g-r']], star_train['Type'])\n",
    "    y_pred_rbf = clf_rbf.predict(star_test[['u-g','g-r']])\n",
    "    star_test['PredictRBF']=y_pred_rbf\n",
    "    xlim = (0.7, 1.35)\n",
    "    ylim = (-0.15, 0.4)\n",
    "    xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 101),\n",
    "                     np.linspace(ylim[0], ylim[1], 101))\n",
    "    Z = clf_rbf.predict(np.c_[ xx.ravel(),yy.ravel()])\n",
    "    Z = Z.reshape(xx.shape)\n",
    "    Z = gaussian_filter(Z, 2)\n",
    "    FP = star_test[star_test['PredictRBF']==1.0]; FP = FP[FP['Type']==0.0]\n",
    "    FN = star_test[star_test['PredictRBF']==0.0]; FN = FN[FN['Type']==1.0]\n",
    "    TP = star_test[star_test['PredictRBF']==1.0]; TP = TP[TP['Type']==1.0]\n",
    "    TN = star_test[star_test['PredictRBF']==0.0]; TN = TN[TN['Type']==0.0]\n",
    "    fig, ax = plt.subplots()\n",
    "    TP.plot.scatter('u-g','g-r',c='red',ax=ax,label=\"TP\")\n",
    "    TN.plot.scatter('u-g','g-r',c='blue',ax=ax,label=\"TN\")\n",
    "    FP.plot.scatter('u-g','g-r',c='green',ax=ax,label=\"FP\",marker='+',s=100)\n",
    "    FN.plot.scatter('u-g','g-r',c='magenta',ax=ax,label=\"FN\",marker='+',s=100)\n",
    "    ax.contour(xx, yy, Z, [0.5], colors='k')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "interact(plot_svm_rbf,gamma=(0.1,300,10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2. Clustering (Unsupervised)\n",
    "\n",
    "Now think trying to put labels but without knowing previous examples on the Galaxy data... but using now all the dimensions!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gal_sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gal_4proj(axes):\n",
    "    ((ax1, ax2), (ax3, ax4)) = axes\n",
    "    gal_sample.plot.scatter('u-g','redshift',color='gray',alpha=0.1,ax=ax1)\n",
    "    gal_sample.plot.scatter('g-r','redshift',color='gray',alpha=0.1,ax=ax2)\n",
    "    gal_sample.plot.scatter('r-i','redshift',color='gray',alpha=0.1,ax=ax3)\n",
    "    gal_sample.plot.scatter('i-z','redshift',color='gray',alpha=0.1,ax=ax4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes =plt.subplots(2,2)\n",
    "gal_4proj(axes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gaussian Mixture Model \n",
    "\n",
    "Consider a Gaussian Mixture Model:\n",
    "$$ \\mathcal{N}(x; \\mu, \\Sigma) = \\frac{\\exp \\left(-{\\frac{1}{2}}( x - \\mu )^{\\mathrm {T}}\\Sigma^{-1}(x - \\mu )\\right)}{\\sqrt {(2\\pi )^{k}|\\Sigma| }}$$\n",
    "$$ p(x) = \\displaystyle\\sum_{j=1}^{k} \\phi_j\\mathcal{N}(x; \\mu_j, \\Sigma_j)$$\n",
    "$$\\displaystyle\\sum_{j=1}^{k} \\phi_j = 1 $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.mixture import GaussianMixture\n",
    "colors = ['red','blue','green','magenta','cyan','orange']\n",
    "def clust_4proj(mix,axes,n):\n",
    "    for dim in range(4):\n",
    "        ax = axes[int(dim/2),dim%2]\n",
    "        labels=mix.predict(gal_sample)\n",
    "        for i in range(n):\n",
    "            gal_sample[labels==i].plot.scatter(dim,'redshift',color=colors[i],alpha=0.1,ax=ax)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n=4\n",
    "mix = GaussianMixture(n_components=n,covariance_type='full', max_iter=100)\n",
    "mix.fit(gal_sample)\n",
    "fig, axes =plt.subplots(2,2)\n",
    "clust_4proj(mix,axes,n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Vocabulary:* This is a <font color='blue'>generative</font> <font color='green'>parametric</font> <font color='magenta'>linear</font> model for a <font color='red'>unsupervised</font> <font color='orange'>batch</font>  learning problem"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import matplotlib as mpl\n",
    "def GMM_4proj(gmm,axes,n):\n",
    "    for clust in range(n):\n",
    "        for dim in range(4):\n",
    "            dims=[dim,4]\n",
    "            ax = axes[int(dim/2),dim%2]\n",
    "            cov = gmm.covariances_[clust]\n",
    "            cov = cov[dims][:,dims]\n",
    "            v, w = np.linalg.eigh(cov)\n",
    "            u = w[0] / np.linalg.norm(w[0])\n",
    "            angle = np.arctan2(u[1], u[0])\n",
    "            angle = 180 * angle / np.pi  # convert to degrees\n",
    "            v = 2. * np.sqrt(2.) * np.sqrt(v)\n",
    "            ell = mpl.patches.Ellipse(gmm.means_[clust,dims], v[0], v[1],\n",
    "                     180 + angle, color=colors[clust])\n",
    "            ell.set_clip_box(ax.bbox)\n",
    "            ell.set_alpha(0.3)\n",
    "            ax.add_artist(ell)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_clusters(n=2):\n",
    "    mix = GaussianMixture(n_components=n,covariance_type='full', max_iter=100)\n",
    "    mix.fit(gal_sample)\n",
    "    fig, axes =plt.subplots(2,2)\n",
    "    gal_4proj(axes)\n",
    "    GMM_4proj(mix,axes,n)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interact(show_clusters,n=(2,6,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.- Characterizing\n",
    "\n",
    "## Dimensionality Reduction (PCA)\n",
    "Consider the Singular Value Decomposition of your data (in matrix form)\n",
    "$$\\mathbf{X} = \\mathbf{U}\\mathbf{\\Sigma}\\mathbf{W}^T$$\n",
    "Then, you can compute an affine transformation of your data such that\n",
    "$${\\displaystyle {\\begin{aligned}\\mathbf {X} ^{T}\\mathbf {X} &=\\mathbf {W} \\mathbf {\\Sigma } ^{T}\\mathbf {U} ^{T}\\mathbf {U} \\mathbf {\\Sigma } \\mathbf {W} ^{T}\\\\&=\\mathbf {W} \\mathbf {\\Sigma } ^{T}\\mathbf {\\Sigma } \\mathbf {W} ^{T}\\\\&=\\mathbf {W} \\mathbf {\\Sigma'}\\mathbf {W} ^{T}\\end{aligned}}}$$\n",
    "Meaning that\n",
    "$$\\begin{align}\n",
    "\\mathbf{T} & = \\mathbf{X} \\mathbf{W} \\\\\n",
    "           & = \\mathbf{U}\\mathbf{\\Sigma}\\mathbf{W}^T \\mathbf{W} \\\\\n",
    "           & = \\mathbf{U}\\mathbf{\\Sigma}\n",
    "\\end{align}$$\n",
    "PCA for dimensionality reduction is basically \n",
    "$$ \\mathbf{T}_L = \\mathbf{U}_L\\mathbf{\\Sigma}_L = \\mathbf{X} \\mathbf{W}_L  $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import decomposition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n=4\n",
    "mix = GaussianMixture(n_components=n,covariance_type='full', max_iter=100)\n",
    "mix.fit(gal_sample)\n",
    "labels=mix.predict(gal_sample) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca = decomposition.PCA(n_components=3)\n",
    "pca.fit(gal_sample)\n",
    "lowd = pca.transform(gal_sample)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib notebook\n",
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "fig = plt.figure(1, figsize=(7, 5))\n",
    "ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)\n",
    "ax.scatter(lowd[:, 0], lowd[:, 1], lowd[:, 2], c=labels, cmap=plt.cm.gist_rainbow,\n",
    "           edgecolor='k')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca_comp=pd.DataFrame(pca.components_)\n",
    "pca_comp.columns=[['u-g', 'g-r', 'r-i', 'i-z','redshift']]\n",
    "pca_comp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mix = GaussianMixture(n_components=n,covariance_type='full', max_iter=100)\n",
    "mix.fit(lowd)\n",
    "labels_low=mix.predict(lowd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(1, figsize=(7, 5))\n",
    "ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)\n",
    "ax.scatter(lowd[:, 0], lowd[:, 1], lowd[:, 2], c=labels_low, cmap=plt.cm.gist_rainbow,\n",
    "           edgecolor='k')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}