{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Practical Introduction to Machine Learning\n",
"by Mauricio Araya\n",
"\n",
"Credits: Francisco Foster, Matthew Graham, Pavlos Protopapas\n",
"\n",
"## 1.- SDSS Data \n",
" \n",
"\n",
"\n",
"Lets download data from the Sloan Digital Sky Survey, the all-time favorite dataset for Machine Learning in Astronomy. We could have used data from UCI or Kaggle, but I think SDSS data is very appropaite for this school ;).\n",
"\n",
"## 1.1.- Download Star Photometry Data using AstroML\n",
"We will start cheating a little bit by using the AstroML package\n",
"\n",
"`conda install -c astropy astroml`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from astroML.datasets import fetch_rrlyrae_combined\n",
"sdss_star_feat, sdss_star_type = fetch_rrlyrae_combined()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sdss_star_type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and use the Pandas package..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import warnings\n",
"warnings.simplefilter(action='ignore', category=Warning)\n",
"%matplotlib inline\n",
"pd.set_option('display.max_rows',10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"as our data manager"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"star_feat=pd.DataFrame(sdss_star_feat)\n",
"star_feat.columns=['u-g', 'g-r', 'r-i', 'i-z']\n",
"star_feat.plot.scatter('u-g', 'g-r')\n",
"star_feat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This data also have labels."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"import matplotlib.pyplot as plt\n",
"star_label=pd.DataFrame(sdss_star_type)\n",
"star_label.columns=['Type']\n",
"\n",
"fig, ax = plt.subplots()\n",
"star_feat[star_label['Type']==0].plot.scatter('u-g','g-r',c='red',ax=ax)\n",
"star_feat[star_label['Type']==1].plot.scatter('u-g','g-r',c='blue',ax=ax)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"star_feat['Type']=star_label['Type']\n",
"star_feat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.2.- Download Galaxy Photometry Data (with Redshifts)\n",
"This dataset are galaxies with known (spectroscopically confirmed) redshifts and colour magnitudes. We're interested in determining the redshift of a galaxy from its colors (photometric redshift). The data can be downloaded from: http://www.astro.caltech.edu/~mjg/sdss_gal.csv.gz, and we will use `urllib`to do this from the notebook"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import urllib\n",
"urllib.request.urlretrieve(\"http://www.astro.caltech.edu/~mjg/sdss_gal.csv.gz\", \"sdss_gal.csv.gz\")\n",
"!gunzip sdss_gal.csv.gz"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"galaxy_feat = pd.read_csv('sdss_gal.csv', low_memory=False)\n",
"galaxy_feat"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gal_sample = galaxy_feat.sample(n=1000)\n",
"gal_sample.plot.scatter('g-r','redshift',color='gray',alpha=0.1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.- Regression\n",
"Regression is about predicting values of continous variables.\n",
"\n",
"$$y' = f(x' \\mid \\mathbf{X},\\mathbf{y})$$\n",
"\n",
"where $y \\in \\mathbb{R}^n$, $x \\in \\mathbb{R}^m$ and $\\mathbf{y}$ and $\\mathbf{X}$ are the target and non-target features for all the samples respectively.\n",
"\n",
"We will use the SDSS Galaxy data, well... a portion of it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Simple... not correct\n",
"train_data = gal_sample[:750]\n",
"test_data = gal_sample[750:]\n",
"y_train = train_data['redshift']\n",
"X_train = train_data['g-r']\n",
"# Formatting hack...\n",
"X_train=X_train.values.reshape(len(X_train), 1);\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.1.- Parametric Regression\n",
"\n",
"We can condense the information found in $\\mathbf{X}$ and $\\mathbf{y}$ by imposing a *parametric model*, meaning to optimize certain parameters for the given data. Now our model is\n",
"$$y' = f(x' ; \\theta^*)$$\n",
"where \n",
"$$\\theta^* = \\underset{\\theta}{\\operatorname{argmax}} \\left\\{ Pr(Y = f(X;\\theta) \\mid \\mathbf{X}, \\mathbf{y}) \\right\\}$$\n",
"which under a linear model and a Gaussian noise $\\epsilon$ assumption ($Y = f(X) + \\epsilon $) it becomes\n",
"$$ \\theta^* = \\underset{\\theta}{\\operatorname{argmin}} \\left\\{ \\sum_i (y_i - f(X_i;\\theta))^2 \\right\\}$$.\n",
"\n",
"Consider now a straight line as our model,\n",
"$$ f(x;\\theta) = a x + b$$\n",
"where our parameters are $\\theta = \\{a,b\\}$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.preprocessing import PolynomialFeatures\n",
"\n",
"import numpy as np\n",
"\n",
"regression = LinearRegression(fit_intercept=True)\n",
"regression.fit(X_train, y_train)\n",
"\n",
"regression_line = lambda x: regression.intercept_ + regression.coef_ * x\n",
"print('The equation of the regression line is: {} + {} * x'.format(regression.intercept_, regression.coef_[0]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"x_vals = np.linspace(0, 3, 100)\n",
"\n",
"train_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',ax=ax)\n",
"ax.plot(x_vals, regression_line(x_vals), color='red', linewidth=1.0, label='regression line')\n",
"plt.legend()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Not very good... lets try another linear model!\n",
"$$ y = ax^3 + bx^2 + cx + d $$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gen_poly_terms = PolynomialFeatures(degree=2)\n",
"X_train_with_poly = gen_poly_terms.fit_transform(X_train)\n",
"poly_regression = LinearRegression(fit_intercept=True)\n",
"poly_regression.fit(X_train_with_poly, y_train)\n",
"display(poly_regression.coef_)\n",
"poly_regression.intercept_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"coef = poly_regression.coef_\n",
"inter = poly_regression.intercept_\n",
"poly = lambda x: inter + coef[1] * x + coef[2] * x*x \n",
"train_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',ax=ax)\n",
"ax.plot(x_vals, poly(x_vals), color='red', linewidth=1.0, label='regression line')\n",
"plt.legend()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gen_poly_terms = PolynomialFeatures(degree=3)\n",
"X_train_with_poly = gen_poly_terms.fit_transform(X_train)\n",
"poly_regression = LinearRegression(fit_intercept=True)\n",
"poly_regression.fit(X_train_with_poly, y_train)\n",
"display(poly_regression.coef_)\n",
"poly_regression.intercept_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"coef = poly_regression.coef_\n",
"inter = poly_regression.intercept_\n",
"poly = lambda x: inter + coef[1] * x + coef[2] * x*x + coef[3]*x*x*x\n",
"train_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',ax=ax)\n",
"ax.plot(x_vals, poly(x_vals), color='red', linewidth=1.0, label='regression line')\n",
"plt.legend()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"test_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',ax=ax)\n",
"ax.plot(x_vals, poly(x_vals), color='red', linewidth=1.0, label='regression line')\n",
"plt.legend()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have basically *learned* the parameters!\n",
"\n",
"This is not the best we can do of course!, we can:\n",
"* Change the function/model\n",
"* Use more dimensions\n",
"* Go non-linear...\n",
"* Use more/better data\n",
"* Use regularized models\n",
"* etc...\n",
"\n",
"## 2.2 Non-parametric Regression\n",
"\n",
"Consolidating data into model parameters have some advantages and drawbacks. An alternative is to use non-parametric models. Now, we want to predict \n",
"$$ y' = f(x'; \\mathbf{X}, \\mathbf{y}, \\theta_0) $$\n",
"For example, consider a model based on assigning the same Gaussian function (Kernel) to each sample:\n",
"$$ K_\\sigma(x)=\\frac{1}{\\sqrt{2\\pi}\\sigma}exp\\left(\\frac{-x^2}{2\\sigma^2}\\right)$$\n",
"$$ y'=\\frac{\\sum_{i=1}^n K_\\sigma(x'-x_i)y_i}{\\sum_{i=1}^nK_\\sigma(x'-x_i)}$$\n",
"Please note that $\\theta_0 = \\sigma$: \n",
"\n",
"Non-parametric $\\neq$ no parameters!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"def GKR(x_predict,x_data,y_data,s):\n",
" dmat = np.tile(x_data,len(x_predict))\n",
" dmat = dmat - np.tile(x_predict,(len(x_data),1))\n",
" K = np.exp(-(dmat*dmat)/(2*s*s))/(np.sqrt(2*np.pi)*s)\n",
" return(K.T.dot(y_data) / K.sum(axis=0))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def plot_gkr(sigma=0.1):\n",
" y_gkr=GKR(x_vals,X_train,y_train,sigma)\n",
" fig, ax = plt.subplots()\n",
" train_data.plot.scatter('g-r','redshift',color='gray',alpha=0.1,label='data',s=sigma*500,ax=ax)\n",
" ax.plot(x_vals, y_gkr, color='red', linewidth=1.0, label='regression line')\n",
" plt.legend()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from ipywidgets import interact\n",
"interact(plot_gkr,sigma=(0.01,1.0,0.01))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are much smarter ways to do this... for example Gaussian Processes!\n",
"\n",
"## 3.- Labelling\n",
"\n",
"Consider now the SDSS star photometry data. \n",
"\n",
"Warning: we will do this *naively* (i.e., wrongly). During the rest of the week we will improve this...\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"N=2000\n",
"star_sample=star_feat[-1:-N-1:-1]\n",
"star_sample = star_sample.sample(n=N)\n",
"star_train = star_sample[:int(N*0.75)]\n",
"star_test = star_sample[int(N*0.75):]\n",
"star_train"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"star_train[star_train['Type']==0].plot.scatter('u-g','g-r',c='red',ax=ax)\n",
"star_train[star_train['Type']==1].plot.scatter('u-g','g-r',c='blue',ax=ax)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"star_test[star_test['Type']==0].plot.scatter('u-g','g-r',c='red',ax=ax)\n",
"star_test[star_test['Type']==1].plot.scatter('u-g','g-r',c='blue',ax=ax)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"display(star_train['Type'].sum()/len(star_train))\n",
"display(star_test['Type'].sum()/len(star_test))\n",
"display(star_feat['Type'].sum()/len(star_feat))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.1.- Classification (Supervised)\n",
"\n",
"Classification is labelling based on previously annotated samples.\n",
"\n",
"### Discriminative Classification Models\n",
"Think on a boundary dividing data. In 2 dimensions is a line/curve, in 3 dimensions a surface, in 4 dimensions a volume, and so on. The boundary divides data into classes. This is what is called a discriminative model. \n",
"\n",
"#### Support Vector Machines\n",
"*Vocabulary:* This is a discriminative (non-parametric) linear model for a supervised batch learning problem"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.svm import SVC\n",
"clf = SVC(kernel='linear')\n",
"clf.fit(star_train[['u-g','g-r']], star_train['Type'])\n",
"y_pred = clf.predict(star_test[['u-g','g-r']])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"star_test['Predict']=y_pred\n",
"star_test"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"star_test['Predict'].sum()/len(star_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"star_test[star_test['Predict']==0.0].plot.scatter('u-g','g-r',c='red',ax=ax)\n",
"star_test[star_test['Predict']==1.0].plot.scatter('u-g','g-r',c='blue',ax=ax)\n",
"\n",
"# Compute the boundary\n",
"w = clf.coef_[0]\n",
"a = -w[1] / w[0]\n",
"yy = np.linspace(-0.1, 0.4)\n",
"xx = a * yy - clf.intercept_[0] / w[0]\n",
"\n",
"ax.plot(xx, yy, '-k')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"FP = star_test[star_test['Predict']==1.0]; FP = FP[FP['Type']==0.0]\n",
"FN = star_test[star_test['Predict']==0.0]; FN = FN[FN['Type']==1.0]\n",
"TP = star_test[star_test['Predict']==1.0]; TP = TP[TP['Type']==1.0]\n",
"TN = star_test[star_test['Predict']==0.0]; TN = TN[TN['Type']==0.0]\n",
"fig, ax = plt.subplots()\n",
"TP.plot.scatter('u-g','g-r',c='blue',ax=ax,label=\"TP\")\n",
"TN.plot.scatter('u-g','g-r',c='red',ax=ax,label=\"TN\")\n",
"FP.plot.scatter('u-g','g-r',c='magenta',ax=ax,label=\"FP\",marker='+',s=100)\n",
"FN.plot.scatter('u-g','g-r',c='green',ax=ax,label=\"FN\",marker='+',s=100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Radial Basis Function Kernel\n",
"We can construct a hyperplane (line) in other space by transforming data to that space, and then come back. This is done using kernels\n",
"\n",
"$${\\displaystyle K(\\mathbf {x} ,\\mathbf {x'} )=\\exp \\left(-{\\frac {\\|\\mathbf {x} -\\mathbf {x'} \\|^{2}}{2\\sigma ^{2}}}\\right)}$$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from scipy.ndimage import gaussian_filter\n",
"def plot_svm_rbf(gamma=20.0):\n",
" clf_rbf = SVC(kernel='rbf', gamma=gamma)\n",
" clf_rbf.fit(star_train[['u-g','g-r']], star_train['Type'])\n",
" y_pred_rbf = clf_rbf.predict(star_test[['u-g','g-r']])\n",
" star_test['PredictRBF']=y_pred_rbf\n",
" xlim = (0.7, 1.35)\n",
" ylim = (-0.15, 0.4)\n",
" xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 101),\n",
" np.linspace(ylim[0], ylim[1], 101))\n",
" Z = clf_rbf.predict(np.c_[ xx.ravel(),yy.ravel()])\n",
" Z = Z.reshape(xx.shape)\n",
" Z = gaussian_filter(Z, 2)\n",
" FP = star_test[star_test['PredictRBF']==1.0]; FP = FP[FP['Type']==0.0]\n",
" FN = star_test[star_test['PredictRBF']==0.0]; FN = FN[FN['Type']==1.0]\n",
" TP = star_test[star_test['PredictRBF']==1.0]; TP = TP[TP['Type']==1.0]\n",
" TN = star_test[star_test['PredictRBF']==0.0]; TN = TN[TN['Type']==0.0]\n",
" fig, ax = plt.subplots()\n",
" TP.plot.scatter('u-g','g-r',c='red',ax=ax,label=\"TP\")\n",
" TN.plot.scatter('u-g','g-r',c='blue',ax=ax,label=\"TN\")\n",
" FP.plot.scatter('u-g','g-r',c='green',ax=ax,label=\"FP\",marker='+',s=100)\n",
" FN.plot.scatter('u-g','g-r',c='magenta',ax=ax,label=\"FN\",marker='+',s=100)\n",
" ax.contour(xx, yy, Z, [0.5], colors='k')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"interact(plot_svm_rbf,gamma=(0.1,300,10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2. Clustering (Unsupervised)\n",
"\n",
"Now think trying to put labels but without knowing previous examples on the Galaxy data... but using now all the dimensions!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gal_sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def gal_4proj(axes):\n",
" ((ax1, ax2), (ax3, ax4)) = axes\n",
" gal_sample.plot.scatter('u-g','redshift',color='gray',alpha=0.1,ax=ax1)\n",
" gal_sample.plot.scatter('g-r','redshift',color='gray',alpha=0.1,ax=ax2)\n",
" gal_sample.plot.scatter('r-i','redshift',color='gray',alpha=0.1,ax=ax3)\n",
" gal_sample.plot.scatter('i-z','redshift',color='gray',alpha=0.1,ax=ax4)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, axes =plt.subplots(2,2)\n",
"gal_4proj(axes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Gaussian Mixture Model \n",
"\n",
"Consider a Gaussian Mixture Model:\n",
"$$ \\mathcal{N}(x; \\mu, \\Sigma) = \\frac{\\exp \\left(-{\\frac{1}{2}}( x - \\mu )^{\\mathrm {T}}\\Sigma^{-1}(x - \\mu )\\right)}{\\sqrt {(2\\pi )^{k}|\\Sigma| }}$$\n",
"$$ p(x) = \\displaystyle\\sum_{j=1}^{k} \\phi_j\\mathcal{N}(x; \\mu_j, \\Sigma_j)$$\n",
"$$\\displaystyle\\sum_{j=1}^{k} \\phi_j = 1 $$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.mixture import GaussianMixture\n",
"colors = ['red','blue','green','magenta','cyan','orange']\n",
"def clust_4proj(mix,axes,n):\n",
" for dim in range(4):\n",
" ax = axes[int(dim/2),dim%2]\n",
" labels=mix.predict(gal_sample)\n",
" for i in range(n):\n",
" gal_sample[labels==i].plot.scatter(dim,'redshift',color=colors[i],alpha=0.1,ax=ax)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n=4\n",
"mix = GaussianMixture(n_components=n,covariance_type='full', max_iter=100)\n",
"mix.fit(gal_sample)\n",
"fig, axes =plt.subplots(2,2)\n",
"clust_4proj(mix,axes,n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Vocabulary:* This is a generative parametric linear model for a unsupervised batch learning problem"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"import matplotlib as mpl\n",
"def GMM_4proj(gmm,axes,n):\n",
" for clust in range(n):\n",
" for dim in range(4):\n",
" dims=[dim,4]\n",
" ax = axes[int(dim/2),dim%2]\n",
" cov = gmm.covariances_[clust]\n",
" cov = cov[dims][:,dims]\n",
" v, w = np.linalg.eigh(cov)\n",
" u = w[0] / np.linalg.norm(w[0])\n",
" angle = np.arctan2(u[1], u[0])\n",
" angle = 180 * angle / np.pi # convert to degrees\n",
" v = 2. * np.sqrt(2.) * np.sqrt(v)\n",
" ell = mpl.patches.Ellipse(gmm.means_[clust,dims], v[0], v[1],\n",
" 180 + angle, color=colors[clust])\n",
" ell.set_clip_box(ax.bbox)\n",
" ell.set_alpha(0.3)\n",
" ax.add_artist(ell)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def show_clusters(n=2):\n",
" mix = GaussianMixture(n_components=n,covariance_type='full', max_iter=100)\n",
" mix.fit(gal_sample)\n",
" fig, axes =plt.subplots(2,2)\n",
" gal_4proj(axes)\n",
" GMM_4proj(mix,axes,n)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"interact(show_clusters,n=(2,6,1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.- Characterizing\n",
"\n",
"## Dimensionality Reduction (PCA)\n",
"Consider the Singular Value Decomposition of your data (in matrix form)\n",
"$$\\mathbf{X} = \\mathbf{U}\\mathbf{\\Sigma}\\mathbf{W}^T$$\n",
"Then, you can compute an affine transformation of your data such that\n",
"$${\\displaystyle {\\begin{aligned}\\mathbf {X} ^{T}\\mathbf {X} &=\\mathbf {W} \\mathbf {\\Sigma } ^{T}\\mathbf {U} ^{T}\\mathbf {U} \\mathbf {\\Sigma } \\mathbf {W} ^{T}\\\\&=\\mathbf {W} \\mathbf {\\Sigma } ^{T}\\mathbf {\\Sigma } \\mathbf {W} ^{T}\\\\&=\\mathbf {W} \\mathbf {\\Sigma'}\\mathbf {W} ^{T}\\end{aligned}}}$$\n",
"Meaning that\n",
"$$\\begin{align}\n",
"\\mathbf{T} & = \\mathbf{X} \\mathbf{W} \\\\\n",
" & = \\mathbf{U}\\mathbf{\\Sigma}\\mathbf{W}^T \\mathbf{W} \\\\\n",
" & = \\mathbf{U}\\mathbf{\\Sigma}\n",
"\\end{align}$$\n",
"PCA for dimensionality reduction is basically \n",
"$$ \\mathbf{T}_L = \\mathbf{U}_L\\mathbf{\\Sigma}_L = \\mathbf{X} \\mathbf{W}_L $$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import decomposition"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n=4\n",
"mix = GaussianMixture(n_components=n,covariance_type='full', max_iter=100)\n",
"mix.fit(gal_sample)\n",
"labels=mix.predict(gal_sample) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pca = decomposition.PCA(n_components=3)\n",
"pca.fit(gal_sample)\n",
"lowd = pca.transform(gal_sample)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib notebook\n",
"from mpl_toolkits.mplot3d import Axes3D\n",
"fig = plt.figure(1, figsize=(7, 5))\n",
"ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)\n",
"ax.scatter(lowd[:, 0], lowd[:, 1], lowd[:, 2], c=labels, cmap=plt.cm.gist_rainbow,\n",
" edgecolor='k')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pca_comp=pd.DataFrame(pca.components_)\n",
"pca_comp.columns=[['u-g', 'g-r', 'r-i', 'i-z','redshift']]\n",
"pca_comp"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mix = GaussianMixture(n_components=n,covariance_type='full', max_iter=100)\n",
"mix.fit(lowd)\n",
"labels_low=mix.predict(lowd)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig = plt.figure(1, figsize=(7, 5))\n",
"ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)\n",
"ax.scatter(lowd[:, 0], lowd[:, 1], lowd[:, 2], c=labels_low, cmap=plt.cm.gist_rainbow,\n",
" edgecolor='k')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}