# Introduction to Machine Learning

## 1 - Reactive vs Learning Agent

NOTE: We will use the `numpy` package for vector-based computations

### 1.1 - A very simple machine that learns a pattern

* A reactive agent is a program that reacts to a predefined set of rules (patterns), for example:

In [None]:
import numpy as np

def reactive_agent(x):
 if x > 10.0:
 return True
 else:
 return False
 
vreact = np.vectorize(reactive_agent)

* Given some data, it applies the rules:



In [None]:
X = np.array([10.9, 5.34, 8.32, 12.43, 20.32, 7.24])
y = vreact(X)
print(y)

In [None]:
vtrue = np.mean(X[y==True])
vfalse = np.mean(X[y==False])
x = 10.75
print(np.abs(x - vtrue))
print(np.abs(x - vfalse))
print((vtrue + vfalse)/2)

* A learning agent, learns from data (in this case labels) to infer the pattern.

In [None]:
def learning_agent(x,Data,labels):
 v_true = np.mean(Data[labels==True])
 v_false = np.mean(Data[labels==False])
 d_true = np.abs(x - v_true)
 d_false = np.abs(x - v_false)
 if d_true < d_false:
 return True
 else:
 return False
 

Let us define a random vector of data to test

In [None]:
vect = np.random.random(10)*20
learnag = lambda x: learning_agent(x,X,y)
vlearn = np.vectorize(learnag)
print(vect)
vlearn(vect)

Which pattern is the learning machine using?

In [None]:
def get_pattern():
 v_true = np.mean(X[y==True])
 v_false = np.mean(X[y==False])
 return (v_true + v_false)/2

get_pattern()

### 1.2 - How the pattern evolve with the data size?

* Let us now change `X` and `y` for a random vector and the output of the ractive agent respectively.

In [None]:
scale = 20.0
def generate_data(n):
 X = np.random.rand(n)*scale
 y = vreact(X)
 return(X,y)

* And see how the pattern behave...

In [None]:
X,y = generate_data(100)
get_pattern()

* Now, we will repeat that for several values of `n` many times. We will record the mean and the standard deviation for every `n` value.

In [None]:
reps = 100
nvals =np.arange(10,1000,10)
pmean = []
pvar = []
for n in nvals:
 pm = 0
 pv = 0
 for i in range(reps):
 X,y = generate_data(int(n))
 val = get_pattern()
 pm += val 
 pv += val*val
 pm=pm/np.float(reps)
 pv=np.sqrt(pv/np.float(reps)- pm*pm)
 pmean.append(pm) 
 pvar.append(pv)
pmean=np.array(pmean)
pvar=np.array(pvar)

* And plot the mean and variance using the `matplotlib` library.

In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
plt.plot(nvals,np.ones(len(nvals))*10,color='red',label='true pattern')
plt.plot(nvals,pmean,color='blue',label='mean')
plt.fill_between(nvals, pmean + pvar, pmean-pvar, facecolor='green', alpha=0.3,label="std")
plt.legend()
plt.ylabel("pattern")
plt.xlabel("n")
plt.show()

### 1.3 - what about wrong labels?
* Labels are not always reliable. 
* To simulate this, let us fix `n = 1000` and randomly modify labels.

In [None]:
reps = 200
n = 1000
jvals =np.arange(1,1000,5)
pmean = []
pvar = []
for j in jvals: 
 pm = 0
 pv = 0
 for i in range(reps):
 X,y = generate_data(int(n))
 inds = np.random.choice(y.size, size=j,replace=False)
 y[inds]=np.invert(y[inds])
 val = get_pattern()
 pm += val 
 pv += val*val
 pm=pm/np.float(reps)
 pv=np.sqrt(pv/np.float(reps)- pm*pm)
 pmean.append(pm) 
 pvar.append(pv)
pmean=np.array(pmean)
pvar=np.array(pvar)

* And plot the mean and variance w.r.t. the number of random changes

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(jvals,np.ones(len(jvals))*10,color='red',label='true pattern')
plt.plot(jvals,pmean,color='blue',label='mean')
plt.fill_between(jvals, pmean + pvar, pmean-pvar, facecolor='green', alpha=0.3,label="std")
plt.legend()
plt.ylabel("pattern")
plt.xlabel("modified elements")
plt.show()

## 2 - Exploratory Analysis in 7 Questions About Data

We will explore the FIFA 2019 data (you can find it in Kaggle). 

Here is the textbook data science process.

![Data Science](Data-Science-Process.png)

However, in practice one goes back and forward to achieve an exploratory data analysis.

### 2.1 -What is the data made of?

* FIFA data comes in a `.csv` format
* We will use pandas package as our data manager, and it can read CSVs!

In [None]:
import pandas

* We can read a comma separated values file as a pandas dataframe (i.e. a Table object).

In [None]:
table=pandas.read_csv("data.csv")
table['CAM']

* To explore this data, first we need to check the column names and be sure about the semantics.

In [None]:
table.columns

In [None]:
table[['Name','Age','Nationality','Overall','Potential','Value']]

In [None]:
v=table['Value'][0]
v

In [None]:
type(v)

In [None]:
from IPython.core.display import display, HTML
# HTML hack to see images
img_lst = []
for purl in table['Photo']:
 img_lst.append('')
table['Picture']=img_lst
img_lst = []
for purl in table['Flag']:
 img_lst.append('')
table['Country']=img_lst
img_lst = []
for purl in table['Club Logo']:
 img_lst.append('')
table['FCLogo']=img_lst
pandas.set_option('display.max_colwidth', -1)
t100 = table[1:100]
HTML(t100[['Picture','Name','Age','Nationality','Country','Club','FCLogo','Overall','Potential','Value']].to_html(escape=False))

In [None]:
%matplotlib inline
table.plot.scatter("Overall","Potential")

In [None]:
import matplotlib.pyplot as plt
plt.scatter(table.Age,table.Potential)
plt.xlabel("Age")
plt.ylabel("Potential")
plt.title("All")

### 1.2.2 - What we need to fix of the Data?
* Usually, not all fields are used for every sample, and some values are in human-readable form (not numerical, i.e., String).
* Let us fix the currency values first

In [None]:
import numpy as np
# Convert currency to floats
table['Unit'] = table['Value'].str[-1]
table['ValueNum'] = np.where(table['Unit'] == '0', 0, 
 table['Value'].str[1:-1].replace(r'[a-zA-Z]',''))
table['ValueNum'] = table['ValueNum'].astype(float)
table['ValueNum'] = np.where(table['Unit'] == 'M', 
 table['ValueNum'], 
 table['ValueNum']/1000)

table['Unit2'] = table['Wage'].str[-1]
table['WageNum'] = np.where(table['Unit2'] == '0', 0, 
 table['Wage'].str[1:-1].replace(r'[a-zA-Z]',''))
table['WageNum'] = table['WageNum'].astype(float)
table['WageNum'] = np.where(table['Unit2'] == 'M', 
 table['WageNum'], 
 table['WageNum']/1000)
table[['Value','ValueNum','Wage','WageNum']]

* That allowed us reach more data!

In [None]:
plt.figure()
plt.scatter(table['Overall'],table['ValueNum'],alpha=0.3)
plt.xlabel("Overall")
plt.ylabel("Price")

### 2.3 - How to we organize the data?

In [None]:
grouped = table.groupby('Nationality')

In [None]:
cant=grouped.size()
top15 = cant.sort_values(ascending=False)[:15]
top15

In [None]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
color=plt.cm.rainbow(np.linspace(0,1,top15.size))

i=0
for country in top15.keys():
 plt.figure()
 elms=grouped.groups[country]
 plt.scatter(table['Overall'][elms],table['ValueNum'][elms],c=color[i],alpha=0.3)
 plt.title(country)
 plt.xlabel("Overall")
 plt.ylabel("Price")
 i+=1

### 1.2.4 - How do we clean/select the data?

In [None]:
fulltab=table.dropna(axis=1)
print(str(len(table.columns) - len(fulltab.columns)) + " columns removed for incompleteness")

In [None]:
fulltab.columns

In [None]:
num_feat = ['Age', 'Overall', 'Potential', 'Special',
 'Acceleration', 'Aggression', 'Agility', 'Balance', 'BallControl',
 'Composure', 'Crossing', 'Curve', 'Dribbling',
 'FKAccuracy', 'Finishing', 'GKDiving', 'GKHandling', 'GKKicking',
 'GKPositioning', 'GKReflexes', 'HeadingAccuracy', 'Interceptions',
 'Jumping', 'LongPassing', 'LongShots', 'Marking', 'Penalties',
 'Positioning', 'Reactions',
 'ShortPassing', 'ShotPower', 'Skill Moves', 'SlidingTackle',
 'SprintSpeed', 'Stamina', 'StandingTackle', 'Strength', 'Vision',
 'Volleys','ValueNum','WageNum']
santab=fulltab[num_feat].astype(float)
santab

### 2.5 - Does the content of our data make sense?

In [None]:
import seaborn as sns
def plot_corr_matrix(data,features=None,annot=True,s=(16,10)):
 fig= plt.figure(figsize=s)
 ax= fig.add_subplot(111)
 if features is None:
 corr = data.corr()
 else:
 corr = data[features].corr()
 ax= sns.heatmap(corr,annot=annot,
 xticklabels=corr.columns,
 yticklabels=corr.columns, cmap="seismic",vmin=-1,vmax=1)
 plt.title("Correlation Matrix", fontsize = 15)
 plt.show()
 
plot_corr_matrix(santab,annot=False)

In [None]:
feat_select = ['Age','Overall',
 'Potential', 'Special','ValueNum','WageNum']
plot_corr_matrix(santab,features=feat_select)

### 2.6 -Can we simplify things?

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(santab)
stdtab = pandas.DataFrame(scaler.transform(santab))
n = len(stdtab.columns)
sklearn_pca = PCA(n_components=n,random_state=1)
xpca = sklearn_pca.fit_transform(stdtab)
varx=sklearn_pca.explained_variance_ratio_
plt.plot(np.arange(1,n+1),varx.cumsum())
plt.ylabel("% of variance")
plt.xlabel("components")

In [None]:
components = sklearn_pca.components_
ind=[]
for i in range(components.shape[0]):
 ind.append("PC"+str(i+1))
feature_weights= pandas.DataFrame(np.abs(components),columns=santab.columns,index=ind)
fig= plt.figure(figsize=(16,10))
ax= fig.add_subplot(111)
ax = sns.heatmap(feature_weights,cmap="jet",vmin=0,vmax=1)

In [None]:
n = 5
sklearn_pca = PCA(n_components=n,random_state=1)
ind=[]
for i in range(n):
 ind.append("PC"+str(i+1))
xpca = sklearn_pca.fit_transform(stdtab)
varx=sklearn_pca.explained_variance_ratio_
plt.plot(np.arange(1,n+1),varx.cumsum())
plt.ylabel("% of variance")
plt.xlabel("components")

In [None]:
transtab = pandas.DataFrame(xpca,columns=ind)
sns.pairplot(transtab,diag_kind="kde")

### 2.7 -Can we automatize the pattern recognition?

In [None]:
from sklearn.cluster import KMeans, DBSCAN
from ipywidgets import interact
rad = 5.0
db = DBSCAN(rad,min_samples=50).fit(xpca) 
transtab['cluster']=db.labels_
sns.pairplot(transtab,vars=ind, hue="cluster")

In [None]:
tclust2 = table[transtab['cluster']==0]
HTML(tclust2[['Unnamed: 0','Picture','Name','Age','Country','FCLogo']].to_html(escape=False))

In [None]:
tclust1 = xpca[transtab['cluster']==0]
torig1 = table[transtab['cluster']==0].copy()
km = KMeans(5).fit(tclust1) 
newtab = pandas.DataFrame(tclust1,columns=ind)
newtab['cluster']=km.labels_
sns.pairplot(newtab,vars=ind, hue="cluster")

In [None]:
#newtab
#HTML(tclust2[['Unnamed: 0','Picture','Name','Age','Country','FCLogo']].to_html(escape=False))