{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Examples - Decision trees\n",
"--- \n",
"\n",
"### La Serena School for Data Science\n",
"\n",
"August 2017
\n",
"Instructors: P. Protopapas\n",
"\n",
"\n",
"\n",
"\n",
"***\n",
"\n",
"* Decision Trees\n",
"* Random Forests\n",
"* Ensemble Methods\n",
"\n",
"For more reading, check out An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani (available through [Springer](http://www-bcf.usc.edu/~gareth/ISL/), as well as the scikit-learn [documentation](http://scikit-learn.org/stable/modules/tree.html).\n",
"\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# IMPORT STUFF\n",
"%matplotlib inline \n",
"\n",
"import io\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import scipy.stats as stats\n",
"import matplotlib.pyplot as plt\n",
"import sklearn\n",
"import statsmodels.api as sm\n",
"from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz\n",
"\n",
"import seaborn as sns\n",
"sns.set_style(\"whitegrid\")\n",
"sns.set_context(\"poster\")\n",
"\n",
"# special matplotlib argument for improved plots\n",
"from matplotlib import rcParams\n",
"from IPython.display import Image\n",
"import pydotplus\n",
"\n",
"\n",
"\n",
"from sklearn.grid_search import GridSearchCV\n",
"from sklearn.cross_validation import train_test_split\n",
"from sklearn.metrics import confusion_matrix\n",
"\n",
"from sklearn import tree\n"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"# COLOR STUFF \n",
"from matplotlib.colors import ListedColormap\n",
"# cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])\n",
"cmap_light = ListedColormap(['#FFAAAA', '#AAAAFF'])\n",
"cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])\n",
"cm = plt.cm.RdBu\n",
"cm_bright = ListedColormap(['#FF0000', '#0000FF'])"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# A generic function to do CV\n",
"\n",
"def cv_optimize(clf, parameters, X, y, n_jobs=1, n_folds=5, score_func=None):\n",
" if score_func:\n",
" gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, n_jobs=n_jobs, scoring=score_func)\n",
" else:\n",
" gs = GridSearchCV(clf, param_grid=parameters, n_jobs=n_jobs, cv=n_folds)\n",
" gs.fit(X, y)\n",
"\n",
" best = gs.best_estimator_\n",
" return best"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #\n",
"# Important parameters\n",
"# indf - Input dataframe\n",
"# featurenames - vector of names of predictors\n",
"# targetname - name of column you want to predict (e.g. 0 or 1, 'M' or 'F', \n",
"# 'yes' or 'no')\n",
"# target1val - particular value you want to have as a 1 in the target\n",
"# mask - boolean vector indicating test set (~mask is training set)\n",
"# reuse_split - dictionary that contains traning and testing dataframes \n",
"# (we'll use this to test different classifiers on the same \n",
"# test-train splits)\n",
"# score_func - we've used the accuracy as a way of scoring algorithms but \n",
"# this can be more general later on\n",
"# n_folds - Number of folds for cross validation ()\n",
"# n_jobs - used for parallelization\n",
"# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #\n",
"\n",
"def do_classify(clf, parameters, indf, featurenames, targetname, target1val, mask=None, reuse_split=None, score_func=None, n_folds=5, n_jobs=1):\n",
" subdf=indf[featurenames]\n",
" X=subdf.values\n",
" y=(indf[targetname].values==target1val)*1\n",
" if mask.any() !=None:\n",
" print(\"using mask\")\n",
" Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]\n",
" if reuse_split !=None:\n",
" print(\"using reuse split\")\n",
" Xtrain, Xtest, ytrain, ytest = reuse_split['Xtrain'], reuse_split['Xtest'], reuse_split['ytrain'], reuse_split['ytest']\n",
" if parameters:\n",
" clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)\n",
" clf=clf.fit(Xtrain, ytrain)\n",
" training_accuracy = clf.score(Xtrain, ytrain)\n",
" test_accuracy = clf.score(Xtest, ytest)\n",
" print(\"############# based on standard predict ################\")\n",
" print(\"Accuracy on training data: %0.2f\" % (training_accuracy))\n",
" print(\"Accuracy on test data: %0.2f\" % (test_accuracy))\n",
" print(confusion_matrix(ytest, clf.predict(Xtest)))\n",
" print(\"########################################################\")\n",
" return(clf, Xtrain, ytrain, Xtest, ytest)\n"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"\n",
"# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #\n",
"# Plot tree containing only two covariates\n",
"# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - #\n",
"\n",
"def plot_2tree(ax, Xtr, Xte, ytr, yte, clf, plot_train = True, plot_test = True, lab = ['Feature 1', 'Feature 2'], mesh=True, colorscale=cmap_light, cdiscrete=cmap_bold, alpha=0.3, psize=10, zfunc=False):\n",
" # Create a meshgrid as our test data\n",
" plt.figure(figsize=(15,10))\n",
" plot_step= 0.05\n",
" xmin, xmax= Xtr[:,0].min(), Xtr[:,0].max()\n",
" ymin, ymax= Xtr[:,1].min(), Xtr[:,1].max()\n",
" xx, yy = np.meshgrid(np.arange(xmin, xmax, plot_step), np.arange(ymin, ymax, plot_step) )\n",
"\n",
" # Re-cast every coordinate in the meshgrid as a 2D point\n",
" Xplot= np.c_[xx.ravel(), yy.ravel()]\n",
"\n",
"\n",
" # Predict the class\n",
" Z = clf.predict( Xplot )\n",
"\n",
" # Re-shape the results\n",
" Z= Z.reshape( xx.shape )\n",
" cs = plt.contourf(xx, yy, Z, cmap= cmap_light, alpha=0.3)\n",
" \n",
" # Overlay training samples\n",
" if (plot_train == True):\n",
" plt.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr-1, cmap=cmap_bold, alpha=alpha,edgecolor=\"k\") \n",
" # and testing points\n",
" if (plot_test == True):\n",
" plt.scatter(Xte[:, 0], Xte[:, 1], c=yte-1, cmap=cmap_bold, alpha=alpha, marker=\"s\")\n",
"\n",
" plt.xlabel(lab[0])\n",
" plt.ylabel(lab[1])\n",
" plt.title(\"Boundary for decision tree classifier\",fontsize=7.5)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# This function creates images of tree models using pydotplus\n",
"# https://github.com/JWarmenhoven/ISLR-python\n",
"def print_tree(estimator, features, class_names=None, filled=True):\n",
" tree = estimator\n",
" names = features\n",
" color = filled\n",
" classn = class_names\n",
" \n",
" dot_data = io.StringIO()\n",
" export_graphviz(estimator, out_file=dot_data, feature_names=features, proportion=True, class_names=classn, filled=filled)\n",
" graph = pydotplus.graph_from_dot_data(dot_data.getvalue())\n",
" return(graph)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Print decision tree model 'dt'\n",
"def display_dt(dt):\n",
" dummy_io = io.StringIO() \n",
" tree.export_graphviz(dt, out_file = dummy_io, proportion=True) \n",
" print(dummy_io.getvalue())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The wine aficionado "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Can a wine maker predict how a wine will be received based on the chemical properties of the wine? Are there chemical indicators that correlate more strongly with the perceived \"quality\" of a wine?\n",
"\n",
"We examine the wine quality dataset hosted on the UCI website. This data records 11 chemical properties (such as the concentrations of sugar, citric acid, alcohol, pH etc.) of thousands of red and white wines from northern Portugal, as well as the quality of the wines, recorded on a scale from 1 to 10. In this problem, we will only look at the data for *red* wine."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import only the data for **red** wine from the dataset repository."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | fixed acidity | \n", "volatile acidity | \n", "citric acid | \n", "residual sugar | \n", "chlorides | \n", "free sulfur dioxide | \n", "total sulfur dioxide | \n", "density | \n", "pH | \n", "sulphates | \n", "alcohol | \n", "quality | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "7.4 | \n", "0.70 | \n", "0.00 | \n", "1.9 | \n", "0.076 | \n", "11.0 | \n", "34.0 | \n", "0.9978 | \n", "3.51 | \n", "0.56 | \n", "9.4 | \n", "5 | \n", "
1 | \n", "7.8 | \n", "0.88 | \n", "0.00 | \n", "2.6 | \n", "0.098 | \n", "25.0 | \n", "67.0 | \n", "0.9968 | \n", "3.20 | \n", "0.68 | \n", "9.8 | \n", "5 | \n", "
2 | \n", "7.8 | \n", "0.76 | \n", "0.04 | \n", "2.3 | \n", "0.092 | \n", "15.0 | \n", "54.0 | \n", "0.9970 | \n", "3.26 | \n", "0.65 | \n", "9.8 | \n", "5 | \n", "
3 | \n", "11.2 | \n", "0.28 | \n", "0.56 | \n", "1.9 | \n", "0.075 | \n", "17.0 | \n", "60.0 | \n", "0.9980 | \n", "3.16 | \n", "0.58 | \n", "9.8 | \n", "6 | \n", "
4 | \n", "7.4 | \n", "0.70 | \n", "0.00 | \n", "1.9 | \n", "0.076 | \n", "11.0 | \n", "34.0 | \n", "0.9978 | \n", "3.51 | \n", "0.56 | \n", "9.4 | \n", "5 | \n", "