{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to Machine Learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1 - Reactive vs Learning Agent\n",
"\n",
"NOTE: We will use the `numpy` package for vector-based computations\n",
"\n",
"### 1.1 - A very simple machine that learns a pattern\n",
"\n",
"* A reactive agent is a program that reacts to a predefined set of rules (patterns), for example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"def reactive_agent(x):\n",
" if x > 10.0:\n",
" return True\n",
" else:\n",
" return False\n",
" \n",
"vreact = np.vectorize(reactive_agent)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Given some data, it applies the rules:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = np.array([10.9, 5.34, 8.32, 12.43, 20.32, 7.24])\n",
"y = vreact(X)\n",
"print(y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vtrue = np.mean(X[y==True])\n",
"vfalse = np.mean(X[y==False])\n",
"x = 10.75\n",
"print(np.abs(x - vtrue))\n",
"print(np.abs(x - vfalse))\n",
"print((vtrue + vfalse)/2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* A learning agent, learns from data (in this case labels) to infer the pattern."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def learning_agent(x,Data,labels):\n",
" v_true = np.mean(Data[labels==True])\n",
" v_false = np.mean(Data[labels==False])\n",
" d_true = np.abs(x - v_true)\n",
" d_false = np.abs(x - v_false)\n",
" if d_true < d_false:\n",
" return True\n",
" else:\n",
" return False\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us define a random vector of data to test"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vect = np.random.random(10)*20\n",
"learnag = lambda x: learning_agent(x,X,y)\n",
"vlearn = np.vectorize(learnag)\n",
"print(vect)\n",
"vlearn(vect)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which pattern is the learning machine using?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_pattern():\n",
" v_true = np.mean(X[y==True])\n",
" v_false = np.mean(X[y==False])\n",
" return (v_true + v_false)/2\n",
"\n",
"get_pattern()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 - How the pattern evolve with the data size?\n",
"\n",
"* Let us now change `X` and `y` for a random vector and the output of the ractive agent respectively."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"scale = 20.0\n",
"def generate_data(n):\n",
" X = np.random.rand(n)*scale\n",
" y = vreact(X)\n",
" return(X,y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* And see how the pattern behave..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X,y = generate_data(100)\n",
"get_pattern()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Now, we will repeat that for several values of `n` many times. We will record the mean and the standard deviation for every `n` value."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"reps = 100\n",
"nvals =np.arange(10,1000,10)\n",
"pmean = []\n",
"pvar = []\n",
"for n in nvals:\n",
" pm = 0\n",
" pv = 0\n",
" for i in range(reps):\n",
" X,y = generate_data(int(n))\n",
" val = get_pattern()\n",
" pm += val \n",
" pv += val*val\n",
" pm=pm/np.float(reps)\n",
" pv=np.sqrt(pv/np.float(reps)- pm*pm)\n",
" pmean.append(pm) \n",
" pvar.append(pv)\n",
"pmean=np.array(pmean)\n",
"pvar=np.array(pvar)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* And plot the mean and variance using the `matplotlib` library."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib notebook\n",
"import matplotlib.pyplot as plt\n",
"plt.plot(nvals,np.ones(len(nvals))*10,color='red',label='true pattern')\n",
"plt.plot(nvals,pmean,color='blue',label='mean')\n",
"plt.fill_between(nvals, pmean + pvar, pmean-pvar, facecolor='green', alpha=0.3,label=\"std\")\n",
"plt.legend()\n",
"plt.ylabel(\"pattern\")\n",
"plt.xlabel(\"n\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.3 - what about wrong labels?\n",
"* Labels are not always reliable. \n",
"* To simulate this, let us fix `n = 1000` and randomly modify labels."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"reps = 200\n",
"n = 1000\n",
"jvals =np.arange(1,1000,5)\n",
"pmean = []\n",
"pvar = []\n",
"for j in jvals: \n",
" pm = 0\n",
" pv = 0\n",
" for i in range(reps):\n",
" X,y = generate_data(int(n))\n",
" inds = np.random.choice(y.size, size=j,replace=False)\n",
" y[inds]=np.invert(y[inds])\n",
" val = get_pattern()\n",
" pm += val \n",
" pv += val*val\n",
" pm=pm/np.float(reps)\n",
" pv=np.sqrt(pv/np.float(reps)- pm*pm)\n",
" pmean.append(pm) \n",
" pvar.append(pv)\n",
"pmean=np.array(pmean)\n",
"pvar=np.array(pvar)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* And plot the mean and variance w.r.t. the number of random changes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"plt.plot(jvals,np.ones(len(jvals))*10,color='red',label='true pattern')\n",
"plt.plot(jvals,pmean,color='blue',label='mean')\n",
"plt.fill_between(jvals, pmean + pvar, pmean-pvar, facecolor='green', alpha=0.3,label=\"std\")\n",
"plt.legend()\n",
"plt.ylabel(\"pattern\")\n",
"plt.xlabel(\"modified elements\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2 - Exploratory Analysis in 7 Questions About Data\n",
"\n",
"We will explore the FIFA 2019 data (you can find it in Kaggle). \n",
"\n",
"Here is the textbook data science process.\n",
"\n",
"![Data Science](Data-Science-Process.png)\n",
"\n",
"However, in practice one goes back and forward to achieve an exploratory data analysis.\n",
"\n",
"### 2.1 -What is the data made of?\n",
"\n",
"* FIFA data comes in a `.csv` format\n",
"* We will use pandas package as our data manager, and it can read CSVs!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* We can read a comma separated values file as a pandas dataframe (i.e. a Table object)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"table=pandas.read_csv(\"data.csv\")\n",
"table['CAM']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* To explore this data, first we need to check the column names and be sure about the semantics."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"table.columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"table[['Name','Age','Nationality','Overall','Potential','Value']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"v=table['Value'][0]\n",
"v"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(v)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from IPython.core.display import display, HTML\n",
"# HTML hack to see images\n",
"img_lst = []\n",
"for purl in table['Photo']:\n",
" img_lst.append('')\n",
"table['Picture']=img_lst\n",
"img_lst = []\n",
"for purl in table['Flag']:\n",
" img_lst.append('')\n",
"table['Country']=img_lst\n",
"img_lst = []\n",
"for purl in table['Club Logo']:\n",
" img_lst.append('')\n",
"table['FCLogo']=img_lst\n",
"pandas.set_option('display.max_colwidth', -1)\n",
"t100 = table[1:100]\n",
"HTML(t100[['Picture','Name','Age','Nationality','Country','Club','FCLogo','Overall','Potential','Value']].to_html(escape=False))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"table.plot.scatter(\"Overall\",\"Potential\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"plt.scatter(table.Age,table.Potential)\n",
"plt.xlabel(\"Age\")\n",
"plt.ylabel(\"Potential\")\n",
"plt.title(\"All\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2.2 - What we need to fix of the Data?\n",
"* Usually, not all fields are used for every sample, and some values are in human-readable form (not numerical, i.e., String).\n",
"* Let us fix the currency values first"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"# Convert currency to floats\n",
"table['Unit'] = table['Value'].str[-1]\n",
"table['ValueNum'] = np.where(table['Unit'] == '0', 0, \n",
" table['Value'].str[1:-1].replace(r'[a-zA-Z]',''))\n",
"table['ValueNum'] = table['ValueNum'].astype(float)\n",
"table['ValueNum'] = np.where(table['Unit'] == 'M', \n",
" table['ValueNum'], \n",
" table['ValueNum']/1000)\n",
"\n",
"table['Unit2'] = table['Wage'].str[-1]\n",
"table['WageNum'] = np.where(table['Unit2'] == '0', 0, \n",
" table['Wage'].str[1:-1].replace(r'[a-zA-Z]',''))\n",
"table['WageNum'] = table['WageNum'].astype(float)\n",
"table['WageNum'] = np.where(table['Unit2'] == 'M', \n",
" table['WageNum'], \n",
" table['WageNum']/1000)\n",
"table[['Value','ValueNum','Wage','WageNum']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* That allowed us reach more data!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"plt.scatter(table['Overall'],table['ValueNum'],alpha=0.3)\n",
"plt.xlabel(\"Overall\")\n",
"plt.ylabel(\"Price\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 - How to we organize the data?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"grouped = table.groupby('Nationality')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cant=grouped.size()\n",
"top15 = cant.sort_values(ascending=False)[:15]\n",
"top15"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"import numpy as np\n",
"color=plt.cm.rainbow(np.linspace(0,1,top15.size))\n",
"\n",
"i=0\n",
"for country in top15.keys():\n",
" plt.figure()\n",
" elms=grouped.groups[country]\n",
" plt.scatter(table['Overall'][elms],table['ValueNum'][elms],c=color[i],alpha=0.3)\n",
" plt.title(country)\n",
" plt.xlabel(\"Overall\")\n",
" plt.ylabel(\"Price\")\n",
" i+=1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2.4 - How do we clean/select the data?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fulltab=table.dropna(axis=1)\n",
"print(str(len(table.columns) - len(fulltab.columns)) + \" columns removed for incompleteness\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fulltab.columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"num_feat = ['Age', 'Overall', 'Potential', 'Special',\n",
" 'Acceleration', 'Aggression', 'Agility', 'Balance', 'BallControl',\n",
" 'Composure', 'Crossing', 'Curve', 'Dribbling',\n",
" 'FKAccuracy', 'Finishing', 'GKDiving', 'GKHandling', 'GKKicking',\n",
" 'GKPositioning', 'GKReflexes', 'HeadingAccuracy', 'Interceptions',\n",
" 'Jumping', 'LongPassing', 'LongShots', 'Marking', 'Penalties',\n",
" 'Positioning', 'Reactions',\n",
" 'ShortPassing', 'ShotPower', 'Skill Moves', 'SlidingTackle',\n",
" 'SprintSpeed', 'Stamina', 'StandingTackle', 'Strength', 'Vision',\n",
" 'Volleys','ValueNum','WageNum']\n",
"santab=fulltab[num_feat].astype(float)\n",
"santab"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.5 - Does the content of our data make sense?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import seaborn as sns\n",
"def plot_corr_matrix(data,features=None,annot=True,s=(16,10)):\n",
" fig= plt.figure(figsize=s)\n",
" ax= fig.add_subplot(111)\n",
" if features is None:\n",
" corr = data.corr()\n",
" else:\n",
" corr = data[features].corr()\n",
" ax= sns.heatmap(corr,annot=annot,\n",
" xticklabels=corr.columns,\n",
" yticklabels=corr.columns, cmap=\"seismic\",vmin=-1,vmax=1)\n",
" plt.title(\"Correlation Matrix\", fontsize = 15)\n",
" plt.show()\n",
" \n",
"plot_corr_matrix(santab,annot=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"feat_select = ['Age','Overall',\n",
" 'Potential', 'Special','ValueNum','WageNum']\n",
"plot_corr_matrix(santab,features=feat_select)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.6 -Can we simplify things?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"scaler = StandardScaler().fit(santab)\n",
"stdtab = pandas.DataFrame(scaler.transform(santab))\n",
"n = len(stdtab.columns)\n",
"sklearn_pca = PCA(n_components=n,random_state=1)\n",
"xpca = sklearn_pca.fit_transform(stdtab)\n",
"varx=sklearn_pca.explained_variance_ratio_\n",
"plt.plot(np.arange(1,n+1),varx.cumsum())\n",
"plt.ylabel(\"% of variance\")\n",
"plt.xlabel(\"components\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"components = sklearn_pca.components_\n",
"ind=[]\n",
"for i in range(components.shape[0]):\n",
" ind.append(\"PC\"+str(i+1))\n",
"feature_weights= pandas.DataFrame(np.abs(components),columns=santab.columns,index=ind)\n",
"fig= plt.figure(figsize=(16,10))\n",
"ax= fig.add_subplot(111)\n",
"ax = sns.heatmap(feature_weights,cmap=\"jet\",vmin=0,vmax=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"n = 5\n",
"sklearn_pca = PCA(n_components=n,random_state=1)\n",
"ind=[]\n",
"for i in range(n):\n",
" ind.append(\"PC\"+str(i+1))\n",
"xpca = sklearn_pca.fit_transform(stdtab)\n",
"varx=sklearn_pca.explained_variance_ratio_\n",
"plt.plot(np.arange(1,n+1),varx.cumsum())\n",
"plt.ylabel(\"% of variance\")\n",
"plt.xlabel(\"components\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"transtab = pandas.DataFrame(xpca,columns=ind)\n",
"sns.pairplot(transtab,diag_kind=\"kde\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.7 -Can we automatize the pattern recognition?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.cluster import KMeans, DBSCAN\n",
"from ipywidgets import interact\n",
"rad = 5.0\n",
"db = DBSCAN(rad,min_samples=50).fit(xpca) \n",
"transtab['cluster']=db.labels_\n",
"sns.pairplot(transtab,vars=ind, hue=\"cluster\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tclust2 = table[transtab['cluster']==0]\n",
"HTML(tclust2[['Unnamed: 0','Picture','Name','Age','Country','FCLogo']].to_html(escape=False))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tclust1 = xpca[transtab['cluster']==0]\n",
"torig1 = table[transtab['cluster']==0].copy()\n",
"km = KMeans(5).fit(tclust1) \n",
"newtab = pandas.DataFrame(tclust1,columns=ind)\n",
"newtab['cluster']=km.labels_\n",
"sns.pairplot(newtab,vars=ind, hue=\"cluster\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#newtab\n",
"#HTML(tclust2[['Unnamed: 0','Picture','Name','Age','Country','FCLogo']].to_html(escape=False))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}