{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Machine Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1 - Reactive vs Learning Agent\n",
    "\n",
    "<b>NOTE</b>: We will use the `numpy` package for vector-based computations\n",
    "\n",
    "### 1.1 - A very simple machine that learns a pattern\n",
    "\n",
    "* A <b>reactive agent</b> is a program that reacts to a predefined set of rules (patterns), for example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def reactive_agent(x):\n",
    "    if x > 10.0:\n",
    "        return True\n",
    "    else:\n",
    "        return False\n",
    "    \n",
    "vreact = np.vectorize(reactive_agent)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Given some data, it applies the rules:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.array([10.9, 5.34, 8.32, 12.43, 20.32, 7.24])\n",
    "y = vreact(X)\n",
    "print(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vtrue = np.mean(X[y==True])\n",
    "vfalse = np.mean(X[y==False])\n",
    "x = 10.75\n",
    "print(np.abs(x - vtrue))\n",
    "print(np.abs(x - vfalse))\n",
    "print((vtrue + vfalse)/2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* A <b>learning agent</b>, learns from data (in this case labels) to infer the pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def learning_agent(x,Data,labels):\n",
    "    v_true  = np.mean(Data[labels==True])\n",
    "    v_false = np.mean(Data[labels==False])\n",
    "    d_true  = np.abs(x - v_true)\n",
    "    d_false  = np.abs(x - v_false)\n",
    "    if d_true < d_false:\n",
    "        return True\n",
    "    else:\n",
    "        return False\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us define a random vector of data to test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vect = np.random.random(10)*20\n",
    "learnag = lambda x: learning_agent(x,X,y)\n",
    "vlearn = np.vectorize(learnag)\n",
    "print(vect)\n",
    "vlearn(vect)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which pattern is the learning machine using?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_pattern():\n",
    "    v_true  = np.mean(X[y==True])\n",
    "    v_false = np.mean(X[y==False])\n",
    "    return (v_true + v_false)/2\n",
    "\n",
    "get_pattern()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 - How the pattern evolve with the data size?\n",
    "\n",
    "* Let us now change `X` and `y` for a random vector and the output of the ractive agent respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scale = 20.0\n",
    "def generate_data(n):\n",
    "    X = np.random.rand(n)*scale\n",
    "    y = vreact(X)\n",
    "    return(X,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* And see how the pattern behave..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X,y = generate_data(100)\n",
    "get_pattern()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Now, we will repeat that for several values of `n` many times. We will record the mean and the standard deviation for every `n` value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reps = 100\n",
    "nvals =np.arange(10,1000,10)\n",
    "pmean = []\n",
    "pvar = []\n",
    "for n in nvals:\n",
    "    pm = 0\n",
    "    pv = 0\n",
    "    for i in range(reps):\n",
    "        X,y = generate_data(int(n))\n",
    "        val = get_pattern()\n",
    "        pm += val \n",
    "        pv += val*val\n",
    "    pm=pm/np.float(reps)\n",
    "    pv=np.sqrt(pv/np.float(reps)- pm*pm)\n",
    "    pmean.append(pm)   \n",
    "    pvar.append(pv)\n",
    "pmean=np.array(pmean)\n",
    "pvar=np.array(pvar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* And plot the mean and variance using the `matplotlib` library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib notebook\n",
    "import matplotlib.pyplot as plt\n",
    "plt.plot(nvals,np.ones(len(nvals))*10,color='red',label='true pattern')\n",
    "plt.plot(nvals,pmean,color='blue',label='mean')\n",
    "plt.fill_between(nvals, pmean + pvar, pmean-pvar, facecolor='green', alpha=0.3,label=\"std\")\n",
    "plt.legend()\n",
    "plt.ylabel(\"pattern\")\n",
    "plt.xlabel(\"n\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3 - what about wrong labels?\n",
    "* Labels are not always reliable. \n",
    "* To simulate this, let us fix `n = 1000` and randomly modify labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reps = 200\n",
    "n = 1000\n",
    "jvals =np.arange(1,1000,5)\n",
    "pmean = []\n",
    "pvar = []\n",
    "for j in jvals: \n",
    "    pm = 0\n",
    "    pv = 0\n",
    "    for i in range(reps):\n",
    "        X,y = generate_data(int(n))\n",
    "        inds = np.random.choice(y.size, size=j,replace=False)\n",
    "        y[inds]=np.invert(y[inds])\n",
    "        val = get_pattern()\n",
    "        pm += val \n",
    "        pv += val*val\n",
    "    pm=pm/np.float(reps)\n",
    "    pv=np.sqrt(pv/np.float(reps)- pm*pm)\n",
    "    pmean.append(pm)   \n",
    "    pvar.append(pv)\n",
    "pmean=np.array(pmean)\n",
    "pvar=np.array(pvar)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* And plot the mean and variance w.r.t. the number of random changes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "plt.plot(jvals,np.ones(len(jvals))*10,color='red',label='true pattern')\n",
    "plt.plot(jvals,pmean,color='blue',label='mean')\n",
    "plt.fill_between(jvals, pmean + pvar, pmean-pvar, facecolor='green', alpha=0.3,label=\"std\")\n",
    "plt.legend()\n",
    "plt.ylabel(\"pattern\")\n",
    "plt.xlabel(\"modified elements\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2 - Exploratory Analysis in 7 Questions About Data\n",
    "\n",
    "We will explore the FIFA 2019 data (you can find it in Kaggle). \n",
    "\n",
    "Here is the textbook data science process.\n",
    "\n",
    "![Data Science](Data-Science-Process.png)\n",
    "\n",
    "However, in practice one goes back and forward to achieve an exploratory data analysis.\n",
    "\n",
    "### 2.1 -What is the data made of?\n",
    "\n",
    "* FIFA data comes in a `.csv` format\n",
    "* We will use pandas package as our data manager, and it can read CSVs!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* We can read a comma separated values file as a pandas dataframe (i.e. a Table object)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "table=pandas.read_csv(\"data.csv\")\n",
    "table['CAM']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* To explore this data, first we need to check the column names and be sure about the semantics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "table[['Name','Age','Nationality','Overall','Potential','Value']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "v=table['Value'][0]\n",
    "v"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(v)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "# HTML hack to see images\n",
    "img_lst = []\n",
    "for purl in table['Photo']:\n",
    "    img_lst.append('<img src=\"'+ str(purl) + '\"/>')\n",
    "table['Picture']=img_lst\n",
    "img_lst = []\n",
    "for purl in table['Flag']:\n",
    "    img_lst.append('<img src=\"'+ purl + '\"/>')\n",
    "table['Country']=img_lst\n",
    "img_lst = []\n",
    "for purl in table['Club Logo']:\n",
    "    img_lst.append('<img src=\"'+ purl + '\"/>')\n",
    "table['FCLogo']=img_lst\n",
    "pandas.set_option('display.max_colwidth', -1)\n",
    "t100 = table[1:100]\n",
    "HTML(t100[['Picture','Name','Age','Nationality','Country','Club','FCLogo','Overall','Potential','Value']].to_html(escape=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "table.plot.scatter(\"Overall\",\"Potential\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.scatter(table.Age,table.Potential)\n",
    "plt.xlabel(\"Age\")\n",
    "plt.ylabel(\"Potential\")\n",
    "plt.title(\"All\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2.2 - What we need to fix of the Data?\n",
    "* Usually, not all fields are used for every sample, and some values are in human-readable form (not numerical, i.e., String).\n",
    "* Let us fix the currency values first"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "# Convert currency to floats\n",
    "table['Unit'] = table['Value'].str[-1]\n",
    "table['ValueNum'] = np.where(table['Unit'] == '0', 0, \n",
    "                                    table['Value'].str[1:-1].replace(r'[a-zA-Z]',''))\n",
    "table['ValueNum'] = table['ValueNum'].astype(float)\n",
    "table['ValueNum'] = np.where(table['Unit'] == 'M', \n",
    "                                    table['ValueNum'], \n",
    "                                    table['ValueNum']/1000)\n",
    "\n",
    "table['Unit2'] = table['Wage'].str[-1]\n",
    "table['WageNum'] = np.where(table['Unit2'] == '0', 0, \n",
    "                                    table['Wage'].str[1:-1].replace(r'[a-zA-Z]',''))\n",
    "table['WageNum'] = table['WageNum'].astype(float)\n",
    "table['WageNum'] = np.where(table['Unit2'] == 'M', \n",
    "                                    table['WageNum'], \n",
    "                                    table['WageNum']/1000)\n",
    "table[['Value','ValueNum','Wage','WageNum']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* That allowed us reach more data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure()\n",
    "plt.scatter(table['Overall'],table['ValueNum'],alpha=0.3)\n",
    "plt.xlabel(\"Overall\")\n",
    "plt.ylabel(\"Price\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 - How to we organize the data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = table.groupby('Nationality')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cant=grouped.size()\n",
    "top15 = cant.sort_values(ascending=False)[:15]\n",
    "top15"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "import numpy as np\n",
    "color=plt.cm.rainbow(np.linspace(0,1,top15.size))\n",
    "\n",
    "i=0\n",
    "for country in top15.keys():\n",
    "    plt.figure()\n",
    "    elms=grouped.groups[country]\n",
    "    plt.scatter(table['Overall'][elms],table['ValueNum'][elms],c=color[i],alpha=0.3)\n",
    "    plt.title(country)\n",
    "    plt.xlabel(\"Overall\")\n",
    "    plt.ylabel(\"Price\")\n",
    "    i+=1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2.4 - How do we clean/select the data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fulltab=table.dropna(axis=1)\n",
    "print(str(len(table.columns) - len(fulltab.columns)) + \" columns removed for incompleteness\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fulltab.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "num_feat = ['Age', 'Overall', 'Potential', 'Special',\n",
    "       'Acceleration', 'Aggression', 'Agility', 'Balance', 'BallControl',\n",
    "       'Composure', 'Crossing', 'Curve', 'Dribbling',\n",
    "       'FKAccuracy', 'Finishing', 'GKDiving', 'GKHandling', 'GKKicking',\n",
    "       'GKPositioning', 'GKReflexes', 'HeadingAccuracy', 'Interceptions',\n",
    "       'Jumping', 'LongPassing', 'LongShots', 'Marking', 'Penalties',\n",
    "       'Positioning', 'Reactions',\n",
    "       'ShortPassing', 'ShotPower', 'Skill Moves', 'SlidingTackle',\n",
    "       'SprintSpeed', 'Stamina', 'StandingTackle', 'Strength', 'Vision',\n",
    "       'Volleys','ValueNum','WageNum']\n",
    "santab=fulltab[num_feat].astype(float)\n",
    "santab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.5 - Does the content of our data make sense?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "def plot_corr_matrix(data,features=None,annot=True,s=(16,10)):\n",
    "    fig= plt.figure(figsize=s)\n",
    "    ax= fig.add_subplot(111)\n",
    "    if features is None:\n",
    "        corr = data.corr()\n",
    "    else:\n",
    "        corr = data[features].corr()\n",
    "    ax= sns.heatmap(corr,annot=annot,\n",
    "        xticklabels=corr.columns,\n",
    "    yticklabels=corr.columns, cmap=\"seismic\",vmin=-1,vmax=1)\n",
    "    plt.title(\"Correlation Matrix\", fontsize = 15)\n",
    "    plt.show()\n",
    "    \n",
    "plot_corr_matrix(santab,annot=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feat_select = ['Age','Overall',\n",
    "       'Potential', 'Special','ValueNum','WageNum']\n",
    "plot_corr_matrix(santab,features=feat_select)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.6 -Can we simplify things?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "scaler = StandardScaler().fit(santab)\n",
    "stdtab = pandas.DataFrame(scaler.transform(santab))\n",
    "n = len(stdtab.columns)\n",
    "sklearn_pca = PCA(n_components=n,random_state=1)\n",
    "xpca = sklearn_pca.fit_transform(stdtab)\n",
    "varx=sklearn_pca.explained_variance_ratio_\n",
    "plt.plot(np.arange(1,n+1),varx.cumsum())\n",
    "plt.ylabel(\"% of variance\")\n",
    "plt.xlabel(\"components\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "components = sklearn_pca.components_\n",
    "ind=[]\n",
    "for i in range(components.shape[0]):\n",
    "    ind.append(\"PC\"+str(i+1))\n",
    "feature_weights= pandas.DataFrame(np.abs(components),columns=santab.columns,index=ind)\n",
    "fig= plt.figure(figsize=(16,10))\n",
    "ax= fig.add_subplot(111)\n",
    "ax = sns.heatmap(feature_weights,cmap=\"jet\",vmin=0,vmax=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = 5\n",
    "sklearn_pca = PCA(n_components=n,random_state=1)\n",
    "ind=[]\n",
    "for i in range(n):\n",
    "    ind.append(\"PC\"+str(i+1))\n",
    "xpca = sklearn_pca.fit_transform(stdtab)\n",
    "varx=sklearn_pca.explained_variance_ratio_\n",
    "plt.plot(np.arange(1,n+1),varx.cumsum())\n",
    "plt.ylabel(\"% of variance\")\n",
    "plt.xlabel(\"components\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transtab = pandas.DataFrame(xpca,columns=ind)\n",
    "sns.pairplot(transtab,diag_kind=\"kde\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.7 -Can we automatize the pattern recognition?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans, DBSCAN\n",
    "from ipywidgets import interact\n",
    "rad = 5.0\n",
    "db = DBSCAN(rad,min_samples=50).fit(xpca)  \n",
    "transtab['cluster']=db.labels_\n",
    "sns.pairplot(transtab,vars=ind, hue=\"cluster\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tclust2 = table[transtab['cluster']==0]\n",
    "HTML(tclust2[['Unnamed: 0','Picture','Name','Age','Country','FCLogo']].to_html(escape=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tclust1 = xpca[transtab['cluster']==0]\n",
    "torig1 = table[transtab['cluster']==0].copy()\n",
    "km = KMeans(5).fit(tclust1) \n",
    "newtab = pandas.DataFrame(tclust1,columns=ind)\n",
    "newtab['cluster']=km.labels_\n",
    "sns.pairplot(newtab,vars=ind, hue=\"cluster\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#newtab\n",
    "#HTML(tclust2[['Unnamed: 0','Picture','Name','Age','Country','FCLogo']].to_html(escape=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}