{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Practical Introduction to Machine Learning\n", "by Mauricio Araya\n", "\n", "Credits: Francisco Foster, Matthew Graham, Pavlos Protopapas\n", "\n", "## 1.- SDSS Data \n", " \n", "\"SLOAN\"\n", "\n", "Lets download data from the Sloan Digital Sky Survey, the all-time favorite dataset for Machine Learning in Astronomy. We could have used data from UCI or Kaggle, but I think SDSS data is very appropaite for this school ;).\n", "\n", "## 1.1.- Download Star Photometry Data using AstroML\n", "We will start cheating a little bit by using the AstroML package\n", "\n", "`conda install -c astropy astroml`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from astroML.datasets import fetch_rrlyrae_combined\n", "sdss_star_feat, sdss_star_type = fetch_rrlyrae_combined()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sdss_star_type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and use the Pandas package..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import warnings\n", "warnings.simplefilter(action='ignore', category=Warning)\n", "%matplotlib inline\n", "pd.set_option('display.max_rows',10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "as our data manager" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "star_feat=pd.DataFrame(sdss_star_feat)\n", "star_feat.columns=['u-g', 'g-r', 'r-i', 'i-z']\n", "star_feat.plot.scatter('u-g', 'g-r')\n", "star_feat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data also have labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "import matplotlib.pyplot as plt\n", "star_label=pd.DataFrame(sdss_star_type)\n", "star_label.columns=['Type']\n", "\n", "fig, ax = plt.subplots()\n", "star_feat[star_label['Type']==0].plot.scatter('u-g','g-r',c='red',ax=ax)\n", "star_feat[star_label['Type']==1].plot.scatter('u-g','g-r',c='blue',ax=ax)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "star_feat['Type']=star_label['Type']\n", "star_feat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.2.- Download Galaxy Photometry Data (with Redshifts)\n", "This dataset are galaxies with known (spectroscopically confirmed) redshifts and colour magnitudes. We're interested in determining the redshift of a galaxy from its colors (photometric redshift). The data can be downloaded from: http://www.astro.caltech.edu/~mjg/sdss_gal.csv.gz, and we will use `urllib`to do this from the notebook" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import urllib\n", "urllib.request.urlretrieve(\"http://www.astro.caltech.edu/~mjg/sdss_gal.csv.gz\", \"sdss_gal.csv.gz\")\n", "!gunzip sdss_gal.csv.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "galaxy_feat = pd.read_csv('sdss_gal.csv', low_memory=False)\n", "galaxy_feat" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gal_sample = galaxy_feat.sample(n=1000)\n", "gal_sample.plot.scatter('g-r','redshift',color='gray',alpha=0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.- Regression\n", "Regression is about predicting values of continous variables.\n", "\n", "$$y' = f(x' \\mid \\mathbf{X},\\mathbf{y})$$\n", "\n", "where $y \\in \\mathbb{R}^n$, $x \\in \\mathbb{R}^m$ and $\\mathbf{y}$ and $\\mathbf{X}$ are the target and non-target features for all the samples respectively.\n", "\n", "We will use the SDSS Galaxy data, well... a portion of it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Simple... not correct\n", "train_data = gal_sample[:750]\n", "test_data = gal_sample[750:]\n", "y_train = train_data['redshift']\n", "X_train = train_data['g-r']\n", "# Formatting hack...\n", "X_train=X_train.values.reshape(len(X_train), 1);\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.1.- Parametric Regression\n", "\n", "We can condense the information found in $\\mathbf{X}$ and $\\mathbf{y}$ by imposing a *parametric model*, meaning to optimize certain parameters for the given data. Now our model is\n", "$$y' = f(x' ; \\theta^*)$$\n", "where \n", "$$\\theta^* = \\underset{\\theta}{\\operatorname{argmax}} \\left\\{ Pr(Y = f(X;\\theta) \\mid \\mathbf{X}, \\mathbf{y}) \\right\\}$$\n", "which under a linear model and a Gaussian noise $\\epsilon$ assumption ($Y = f(X) + \\epsilon $) it becomes\n", "$$ \\theta^* = \\underset{\\theta}{\\operatorname{argmin}} \\left\\{ \\sum_i (y_i - f(X_i;\\theta))^2 \\right\\}$$.\n", "\n", "Consider now a straight line as our model,\n", "$$ f(x;\\theta) = a x + b$$\n", "where our parameters are $\\theta = \\{a,b\\}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.preprocessing import PolynomialFeatures\n", "\n", "import numpy as np\n", "\n", "regression = LinearRegression(fit_intercept=True)\n", "regression.fit(X_train, y_train)\n", "\n", "regression_line = lambda x: regression.intercept_ + regression.coef_ * x\n", "print('The equation of the regression line is: {} + {} * x'.format(regression.intercept_, regression.coef_[0]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "x_vals = np.linspace(0, 3, 100)\n", "\n", "train_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',ax=ax)\n", "ax.plot(x_vals, regression_line(x_vals), color='red', linewidth=1.0, label='regression line')\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not very good... lets try another linear model!\n", "$$ y = ax^3 + bx^2 + cx + d $$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gen_poly_terms = PolynomialFeatures(degree=2)\n", "X_train_with_poly = gen_poly_terms.fit_transform(X_train)\n", "poly_regression = LinearRegression(fit_intercept=True)\n", "poly_regression.fit(X_train_with_poly, y_train)\n", "display(poly_regression.coef_)\n", "poly_regression.intercept_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "coef = poly_regression.coef_\n", "inter = poly_regression.intercept_\n", "poly = lambda x: inter + coef[1] * x + coef[2] * x*x \n", "train_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',ax=ax)\n", "ax.plot(x_vals, poly(x_vals), color='red', linewidth=1.0, label='regression line')\n", "plt.legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gen_poly_terms = PolynomialFeatures(degree=3)\n", "X_train_with_poly = gen_poly_terms.fit_transform(X_train)\n", "poly_regression = LinearRegression(fit_intercept=True)\n", "poly_regression.fit(X_train_with_poly, y_train)\n", "display(poly_regression.coef_)\n", "poly_regression.intercept_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "coef = poly_regression.coef_\n", "inter = poly_regression.intercept_\n", "poly = lambda x: inter + coef[1] * x + coef[2] * x*x + coef[3]*x*x*x\n", "train_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',ax=ax)\n", "ax.plot(x_vals, poly(x_vals), color='red', linewidth=1.0, label='regression line')\n", "plt.legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "test_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',ax=ax)\n", "ax.plot(x_vals, poly(x_vals), color='red', linewidth=1.0, label='regression line')\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have basically *learned* the parameters!\n", "\n", "This is not the best we can do of course!, we can:\n", "* Change the function/model\n", "* Use more dimensions\n", "* Go non-linear...\n", "* Use more/better data\n", "* Use regularized models\n", "* etc...\n", "\n", "## 2.2 Non-parametric Regression\n", "\n", "Consolidating data into model parameters have some advantages and drawbacks. An alternative is to use non-parametric models. Now, we want to predict \n", "$$ y' = f(x'; \\mathbf{X}, \\mathbf{y}, \\theta_0) $$\n", "For example, consider a model based on assigning the same Gaussian function (Kernel) to each sample:\n", "$$ K_\\sigma(x)=\\frac{1}{\\sqrt{2\\pi}\\sigma}exp\\left(\\frac{-x^2}{2\\sigma^2}\\right)$$\n", "$$ y'=\\frac{\\sum_{i=1}^n K_\\sigma(x'-x_i)y_i}{\\sum_{i=1}^nK_\\sigma(x'-x_i)}$$\n", "Please note that $\\theta_0 = \\sigma$: \n", "\n", "Non-parametric $\\neq$ no parameters!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def GKR(x_predict,x_data,y_data,s):\n", " dmat = np.tile(x_data,len(x_predict))\n", " dmat = dmat - np.tile(x_predict,(len(x_data),1))\n", " K = np.exp(-(dmat*dmat)/(2*s*s))/(np.sqrt(2*np.pi)*s)\n", " return(K.T.dot(y_data) / K.sum(axis=0))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_gkr(sigma=0.1):\n", " y_gkr=GKR(x_vals,X_train,y_train,sigma)\n", " fig, ax = plt.subplots()\n", " train_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',s=sigma*500,ax=ax)\n", " ax.plot(x_vals, y_gkr, color='red', linewidth=1.0, label='regression line')\n", " plt.legend()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ipywidgets import interact\n", "interact(plot_gkr,sigma=(0.01,1.0,0.01))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are much smarter ways to do this... for example Gaussian Processes!\n", "\n", "## 3.- Labelling\n", "\n", "Consider now the SDSS star photometry data. \n", "\n", "Warning: we will do this *naively* (i.e., wrongly). During the rest of the week we will improve this...\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "N=2000\n", "star_sample=star_feat[-1:-N-1:-1]\n", "star_sample = star_sample.sample(n=N)\n", "star_train = star_sample[:int(N*0.75)]\n", "star_test = star_sample[int(N*0.75):]\n", "star_train" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "star_train[star_train['Type']==0].plot.scatter('u-g','g-r',c='red',ax=ax)\n", "star_train[star_train['Type']==1].plot.scatter('u-g','g-r',c='blue',ax=ax)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "star_test[star_test['Type']==0].plot.scatter('u-g','g-r',c='red',ax=ax)\n", "star_test[star_test['Type']==1].plot.scatter('u-g','g-r',c='blue',ax=ax)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(star_train['Type'].sum()/len(star_train))\n", "display(star_test['Type'].sum()/len(star_test))\n", "display(star_feat['Type'].sum()/len(star_feat))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.1.- Classification (Supervised)\n", "\n", "Classification is labelling based on previously annotated samples.\n", "\n", "### Discriminative Classification Models\n", "Think on a boundary dividing data. In 2 dimensions is a line/curve, in 3 dimensions a surface, in 4 dimensions a volume, and so on. The boundary divides data into classes. This is what is called a discriminative model. \n", "\n", "#### Support Vector Machines\n", "*Vocabulary:* This is a discriminative (non-parametric) linear model for a supervised batch learning problem" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.svm import SVC\n", "clf = SVC(kernel='linear')\n", "clf.fit(star_train[['u-g','g-r']], star_train['Type'])\n", "y_pred = clf.predict(star_test[['u-g','g-r']])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "star_test['Predict']=y_pred\n", "star_test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "star_test['Predict'].sum()/len(star_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "star_test[star_test['Predict']==0.0].plot.scatter('u-g','g-r',c='red',ax=ax)\n", "star_test[star_test['Predict']==1.0].plot.scatter('u-g','g-r',c='blue',ax=ax)\n", "\n", "# Compute the boundary\n", "w = clf.coef_[0]\n", "a = -w[1] / w[0]\n", "yy = np.linspace(-0.1, 0.4)\n", "xx = a * yy - clf.intercept_[0] / w[0]\n", "\n", "ax.plot(xx, yy, '-k')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "FP = star_test[star_test['Predict']==1.0]; FP = FP[FP['Type']==0.0]\n", "FN = star_test[star_test['Predict']==0.0]; FN = FN[FN['Type']==1.0]\n", "TP = star_test[star_test['Predict']==1.0]; TP = TP[TP['Type']==1.0]\n", "TN = star_test[star_test['Predict']==0.0]; TN = TN[TN['Type']==0.0]\n", "fig, ax = plt.subplots()\n", "TP.plot.scatter('u-g','g-r',c='blue',ax=ax,label=\"TP\")\n", "TN.plot.scatter('u-g','g-r',c='red',ax=ax,label=\"TN\")\n", "FP.plot.scatter('u-g','g-r',c='magenta',ax=ax,label=\"FP\",marker='+',s=100)\n", "FN.plot.scatter('u-g','g-r',c='green',ax=ax,label=\"FN\",marker='+',s=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Radial Basis Function Kernel\n", "We can construct a hyperplane (line) in other space by transforming data to that space, and then come back. This is done using kernels\n", "\n", "$${\\displaystyle K(\\mathbf {x} ,\\mathbf {x'} )=\\exp \\left(-{\\frac {\\|\\mathbf {x} -\\mathbf {x'} \\|^{2}}{2\\sigma ^{2}}}\\right)}$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.ndimage import gaussian_filter\n", "def plot_svm_rbf(gamma=20.0):\n", " clf_rbf = SVC(kernel='rbf', gamma=gamma)\n", " clf_rbf.fit(star_train[['u-g','g-r']], star_train['Type'])\n", " y_pred_rbf = clf_rbf.predict(star_test[['u-g','g-r']])\n", " star_test['PredictRBF']=y_pred_rbf\n", " xlim = (0.7, 1.35)\n", " ylim = (-0.15, 0.4)\n", " xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 101),\n", " np.linspace(ylim[0], ylim[1], 101))\n", " Z = clf_rbf.predict(np.c_[ xx.ravel(),yy.ravel()])\n", " Z = Z.reshape(xx.shape)\n", " Z = gaussian_filter(Z, 2)\n", " FP = star_test[star_test['PredictRBF']==1.0]; FP = FP[FP['Type']==0.0]\n", " FN = star_test[star_test['PredictRBF']==0.0]; FN = FN[FN['Type']==1.0]\n", " TP = star_test[star_test['PredictRBF']==1.0]; TP = TP[TP['Type']==1.0]\n", " TN = star_test[star_test['PredictRBF']==0.0]; TN = TN[TN['Type']==0.0]\n", " fig, ax = plt.subplots()\n", " TP.plot.scatter('u-g','g-r',c='red',ax=ax,label=\"TP\")\n", " TN.plot.scatter('u-g','g-r',c='blue',ax=ax,label=\"TN\")\n", " FP.plot.scatter('u-g','g-r',c='green',ax=ax,label=\"FP\",marker='+',s=100)\n", " FN.plot.scatter('u-g','g-r',c='magenta',ax=ax,label=\"FN\",marker='+',s=100)\n", " ax.contour(xx, yy, Z, [0.5], colors='k')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "interact(plot_svm_rbf,gamma=(0.1,300,10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2. Clustering (Unsupervised)\n", "\n", "Now think trying to put labels but without knowing previous examples on the Galaxy data... but using now all the dimensions!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gal_sample" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def gal_4proj(axes):\n", " ((ax1, ax2), (ax3, ax4)) = axes\n", " gal_sample.plot.scatter('u-g','redshift',color='gray',alpha=0.1,ax=ax1)\n", " gal_sample.plot.scatter('g-r','redshift',color='gray',alpha=0.1,ax=ax2)\n", " gal_sample.plot.scatter('r-i','redshift',color='gray',alpha=0.1,ax=ax3)\n", " gal_sample.plot.scatter('i-z','redshift',color='gray',alpha=0.1,ax=ax4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes =plt.subplots(2,2)\n", "gal_4proj(axes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Gaussian Mixture Model \n", "\n", "Consider a Gaussian Mixture Model:\n", "$$ \\mathcal{N}(x; \\mu, \\Sigma) = \\frac{\\exp \\left(-{\\frac{1}{2}}( x - \\mu )^{\\mathrm {T}}\\Sigma^{-1}(x - \\mu )\\right)}{\\sqrt {(2\\pi )^{k}|\\Sigma| }}$$\n", "$$ p(x) = \\displaystyle\\sum_{j=1}^{k} \\phi_j\\mathcal{N}(x; \\mu_j, \\Sigma_j)$$\n", "$$\\displaystyle\\sum_{j=1}^{k} \\phi_j = 1 $$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.mixture import GaussianMixture\n", "colors = ['red','blue','green','magenta','cyan','orange']\n", "def clust_4proj(mix,axes,n):\n", " for dim in range(4):\n", " ax = axes[int(dim/2),dim%2]\n", " labels=mix.predict(gal_sample)\n", " for i in range(n):\n", " gal_sample[labels==i].plot.scatter(dim,'redshift',color=colors[i],alpha=0.1,ax=ax)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n=4\n", "mix = GaussianMixture(n_components=n,covariance_type='full', max_iter=100)\n", "mix.fit(gal_sample)\n", "fig, axes =plt.subplots(2,2)\n", "clust_4proj(mix,axes,n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Vocabulary:* This is a generative parametric linear model for a unsupervised batch learning problem" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "import matplotlib as mpl\n", "def GMM_4proj(gmm,axes,n):\n", " for clust in range(n):\n", " for dim in range(4):\n", " dims=[dim,4]\n", " ax = axes[int(dim/2),dim%2]\n", " cov = gmm.covariances_[clust]\n", " cov = cov[dims][:,dims]\n", " v, w = np.linalg.eigh(cov)\n", " u = w[0] / np.linalg.norm(w[0])\n", " angle = np.arctan2(u[1], u[0])\n", " angle = 180 * angle / np.pi # convert to degrees\n", " v = 2. * np.sqrt(2.) * np.sqrt(v)\n", " ell = mpl.patches.Ellipse(gmm.means_[clust,dims], v[0], v[1],\n", " 180 + angle, color=colors[clust])\n", " ell.set_clip_box(ax.bbox)\n", " ell.set_alpha(0.3)\n", " ax.add_artist(ell)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def show_clusters(n=2):\n", " mix = GaussianMixture(n_components=n,covariance_type='full', max_iter=100)\n", " mix.fit(gal_sample)\n", " fig, axes =plt.subplots(2,2)\n", " gal_4proj(axes)\n", " GMM_4proj(mix,axes,n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "interact(show_clusters,n=(2,6,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4.- Characterizing\n", "\n", "## Dimensionality Reduction (PCA)\n", "Consider the Singular Value Decomposition of your data (in matrix form)\n", "$$\\mathbf{X} = \\mathbf{U}\\mathbf{\\Sigma}\\mathbf{W}^T$$\n", "Then, you can compute an affine transformation of your data such that\n", "$${\\displaystyle {\\begin{aligned}\\mathbf {X} ^{T}\\mathbf {X} &=\\mathbf {W} \\mathbf {\\Sigma } ^{T}\\mathbf {U} ^{T}\\mathbf {U} \\mathbf {\\Sigma } \\mathbf {W} ^{T}\\\\&=\\mathbf {W} \\mathbf {\\Sigma } ^{T}\\mathbf {\\Sigma } \\mathbf {W} ^{T}\\\\&=\\mathbf {W} \\mathbf {\\Sigma'}\\mathbf {W} ^{T}\\end{aligned}}}$$\n", "Meaning that\n", "$$\\begin{align}\n", "\\mathbf{T} & = \\mathbf{X} \\mathbf{W} \\\\\n", " & = \\mathbf{U}\\mathbf{\\Sigma}\\mathbf{W}^T \\mathbf{W} \\\\\n", " & = \\mathbf{U}\\mathbf{\\Sigma}\n", "\\end{align}$$\n", "PCA for dimensionality reduction is basically \n", "$$ \\mathbf{T}_L = \\mathbf{U}_L\\mathbf{\\Sigma}_L = \\mathbf{X} \\mathbf{W}_L $$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import decomposition" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n=4\n", "mix = GaussianMixture(n_components=n,covariance_type='full', max_iter=100)\n", "mix.fit(gal_sample)\n", "labels=mix.predict(gal_sample) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca = decomposition.PCA(n_components=3)\n", "pca.fit(gal_sample)\n", "lowd = pca.transform(gal_sample)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib notebook\n", "from mpl_toolkits.mplot3d import Axes3D\n", "fig = plt.figure(1, figsize=(7, 5))\n", "ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)\n", "ax.scatter(lowd[:, 0], lowd[:, 1], lowd[:, 2], c=labels, cmap=plt.cm.gist_rainbow,\n", " edgecolor='k')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca_comp=pd.DataFrame(pca.components_)\n", "pca_comp.columns=[['u-g', 'g-r', 'r-i', 'i-z','redshift']]\n", "pca_comp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mix = GaussianMixture(n_components=n,covariance_type='full', max_iter=100)\n", "mix.fit(lowd)\n", "labels_low=mix.predict(lowd)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plt.figure(1, figsize=(7, 5))\n", "ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)\n", "ax.scatter(lowd[:, 0], lowd[:, 1], lowd[:, 2], c=labels_low, cmap=plt.cm.gist_rainbow,\n", " edgecolor='k')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }