\n",
"\n",
"# Introduction to Bayesian Methods for Data Science\n",
"**Tamás Budavári** - budavari@jhu.edu \n",
"\n",
"- Bayesian inference\n",
"- Prior: proper vs improper\n",
"- Likelihood function\n",
"- Maximum Likelihood Estimation\n",
"- Link to least squares\n",
"- Online learning\n",
"- Sampling for posteriors\n",
"- Hypothesis testing\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
Bayesian Inference
\n",
"\n",
"Rev. Thomas Bayes (c.1701-1761)\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "9e13cc51-e810-49c3-ac4c-82398c17211d"
}
},
"source": [
"### Joint & Conditional Probability\n",
"- Consider random variables $X$, $Y$ of events. Their **joint probability** is\n",
"\n",
">$\\displaystyle P(X, Y) \\neq P(X)\\,P(Y)$ \n",
">\n",
"> instead\n",
">\n",
">$\\displaystyle P(X, Y) = P(X)\\,P(Y \\lvert X)$ \n",
">\n",
"> where $P(Y \\lvert X)$ is the **conditional probability** of $Y$ given $X$\n",
"\n",
"- For example, if $X$ represents the event of flipping head and $Y$ is tail on the same trial, $P(X,Y)=0$ because $P(Y \\lvert X)=0$. \n",
"\n",
"- But on separate trials, the events would be independent and we would have $P(Y \\lvert X)=P(Y)$.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "4349fff5-7a9a-49be-a174-0f302cc3e27b"
}
},
"source": [
"### Bayes' Theorem\n",
"- The joint probability of $X$ and $Y$ discrete events\n",
"\n",
">$\\displaystyle P(X,Y) = P(X)\\,P(Y \\lvert X)$ \n",
">\n",
"> and \n",
">\n",
">$\\displaystyle P(Y,X) = P(Y)\\,P(X \\lvert Y)$ \n",
">\n",
"> Their equality yields\n",
">\n",
">$\\displaystyle P(X \\lvert Y) = \\frac{P(X)\\,P(Y \\lvert X)}{P(Y)}$ \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "a4ed11a3-064c-4aba-afc5-27020039d755"
}
},
"source": [
"### Probability Densities\n",
"- It is also true on the continous case and PDFs\n",
"\n",
">$\\displaystyle P(X \\lvert y) = \\frac{P(X)\\,p(y \\lvert X)}{p(y)}$ \n",
">\n",
"> and\n",
">\n",
">$\\displaystyle p(x \\lvert Y) = \\frac{p(x)\\,P(Y \\lvert x)}{P(Y)}$ \n",
">"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "be3e9a65-83ef-4b03-a7e1-3fdc67433261"
}
},
"source": [
"- Also\n",
"\n",
">$\\displaystyle p(x \\lvert y) = \\frac{p(x)\\,p(y \\lvert x)}{p(y)}$ \n",
">\n",
"> where\n",
">\n",
">$\\displaystyle p(y) = \\int p(x)\\,p(y \\lvert x)\\,dx$ \n",
">\n",
"> to ensure that\n",
">\n",
">$\\displaystyle \\int p(x \\lvert y)\\,dx = 1$ "
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "c0cae685-cadf-4c80-abd1-5bfdaa093971"
}
},
"source": [
"### Probabilitistic Model\n",
"- From data $D$ we can **infer** the parameters $\\theta$ of model $M$ \n",
"\n",
">$\\displaystyle p(\\theta \\lvert D) = \\frac{p(\\theta)\\,p(D \\lvert \\theta)}{p(D)}$ \n",
">\n",
"> or including the model $M$ explicitly\n",
">\n",
">$\\displaystyle p(\\theta \\lvert D,M) = \\frac{p(\\theta \\lvert M)\\,p(D \\lvert \\theta,M)}{p(D \\lvert M)}$ \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "6f060d06-d47e-4e84-9d27-313e2e9f2515"
}
},
"source": [
"### Terminology\n",
"- From data $D$ we can **infer** the parameters $\\theta$ of model $M$ \n",
"\n",
">$\\displaystyle p(\\theta \\lvert D) = \\frac{\\pi(\\theta)\\,{\\cal{}L}_{\\!D}(\\theta)}{Z}$ \n",
">\n",
"> where the normalization\n",
">\n",
">$\\displaystyle Z = \\int \\pi(\\theta)\\,{\\cal{}L}_{\\!D}(\\theta)\\ d\\theta $ \n",
"\n",
"- The **posterior** is proportional to the **prior** times the **likelihood function** "
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "b8920874-4b6e-4946-a26e-461506850628"
}
},
"source": [
"### Data\n",
"- A set of independent measurements\n",
"\n",
">$\\displaystyle D = \\Big\\{x_i\\Big\\}_{i=1}^N$\n",
"\n",
"- E.g., measuring the temperature in $N$ cities"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "28ccfbf6-3a33-4e28-a9f3-93e3e230a3fe"
}
},
"source": [
"### Model Parameterization\n",
"\n",
"- For example, the model is that all cities have the same temperature\n",
"\n",
"> We also need to state our prior knowledge about the temperature\n",
"\n",
"- Let $\\mu$ represent that temperature in all cities (same for all)\n",
"\n",
"> We pick an appropriate prior - often people say we use a \"flat\" prior because we don't know..."
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "fa034999-d4ca-4a7a-bfe1-738e03123fac"
}
},
"source": [
"### Alternative Parameterization\n",
"\n",
"- We could have chosen another parametrization, say $\\tan \\phi$ with $\\phi \\in \\left(-\\frac{\\pi}{2},\\frac{\\pi}{2} \\right)$\n",
"\n",
"> Clearly a \"flat prior\" means something different!\n",
" \n",
"> What should be the prior? Needs careful consideration!\n",
"\n",
"- Non-informative prior?\n",
"\n",
"> For more, see [Jeffreys](https://en.wikipedia.org/wiki/Harold_Jeffreys) prior"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "16aa483a-9bdc-419c-ab32-ddb981316486"
}
},
"source": [
"### What is the likelihood function?\n",
"\n",
"- For a set of independent measurements\n",
"\n",
">$\\displaystyle {\\cal L}_{\\!D}(\\mu) = p(D \\lvert \\mu) = p(\\{x_i\\!\\}\\lvert\\mu) = \\prod_{i=1}^N p(x_i\\lvert\\mu) = \\prod_{i=1}^N \\ell_{\\!i}(\\mu)$\n",
"\n",
"- For example, Gaussian uncertainties\n",
"\n",
">$\\displaystyle \\ell_{\\!i}(\\mu) = \\frac{1}{\\sqrt{2\\pi\\sigma_i^2}}\\ \\exp\\left\\{-\\frac{(x_i-\\mu)^2}{2\\sigma_i^2}\\right\\}$\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "821311cd-c763-4c57-be59-e545cd81188e"
}
},
"source": [
"### Detour: Improper Priors\n",
"\n",
"- The posterior PDF is\n",
"\n",
">$\\displaystyle p(\\mu|D) = \\frac{\\pi(\\mu) \\prod {\\ell}_{\\!i}(\\mu)}{\\int \\pi(\\mu) \\prod {\\ell}_{\\!i}(\\mu)\\,d\\mu}\\ $ \n",
"\n",
"- Uniform prior?\n",
"\n",
"> Using $\\pi(\\mu)\\!=\\!1$ is clearly wrong but what if the prior is flat over the interval where likelihood function is non-zero (if!), the normalization cancels from the ratio\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "37cdbbee-2d2a-4214-b26b-ba7a91ba566e"
}
},
"source": [
"### Estimation\n",
"\n",
"- Expected value\n",
"\n",
">$\\displaystyle \\int \\mu\\, p(\\mu \\lvert D)\\, d\\mu$\n",
"\n",
"- Variance: 2nd central moment\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "3ee5d8bb-95e3-48f1-bc1d-89b2874f040f"
}
},
"source": [
"### Maximum Likelihood Estimation\n",
"\n",
"- Maximizing ${\\cal{}L}$ is the same as minimizing $-\\log{\\cal{}L}$ \n",
"\n",
"> $\\displaystyle -\\log{\\cal{}L(\\mu)} = \\mathrm{const.} + \\sum_{i=1}^N \\frac{(x_i\\!-\\!\\mu)^2}{2\\sigma_i^2}$\n",
"\n",
"> Cf. the method of least squares\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"nbpresent": {
"id": "14fae3f4-c070-495a-a7b6-2253f5539878"
}
},
"source": [
"### Result\n",
"\n",
"- Weighted average! Using $w_i = 1 \\big/ \\sigma_i^2$\n",
"\n",
"> $\\displaystyle \\hat{\\mu} = \\frac{\\sum w_i x_i}{\\sum w_i}$\n",
"\n",
"- Also variance!\n",
"\n",
">$\\displaystyle \\frac{1}{\\sigma_{\\mu}^2} = \\sum w_i = \\sum \\frac{1}{\\sigma_i^2}$\n",
"\n",
"> If all the same $\\sigma$, we have\n",
" \n",
">$\\displaystyle \\frac{1}{\\sigma_{\\mu}^2} = \\frac{N}{\\sigma^2}$\n",
"$\\ \\ \\ \\rightarrow\\ \\ \\ \\\n",
"\\displaystyle \\sigma_{\\mu} = {\\sigma} \\big/{\\sqrt{N}}$\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: your 1st classification problem \n",
"\n",
"Among some observed objects 1% belongs to a special type, e.g., quasars mixed with many stars. Using a classification method 99% of these special objects can be correctly selected. This method also selects 0.5% of the other types of objects erroneously.\n",
"
\n",
"What is the probability of having a special type if an object is selected by the method?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pylab inline\n",
"pylab.rcParams['figure.figsize'] = (4,3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: numerical intergration in 1D \n",
"\n",
"Implement Bayes' rule to infer a constant based on $N$ (independent) measurements\n",
"1. Assume Gaussian likelihood with $\\sigma=1$ and improper prior\n",
"1. Use function `np.trapz(f,x)` for numerical integration\n",
"1. Start from the code below "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = np.random.randn(5) # fake data points from normal distribution\n",
"mu = np.linspace(-2,2,1000) # grid over the parameter\n",
"\n",
"# missing code here...\n",
"\n",
"plot(mu,pdf); xlabel('mu'); ylabel('posterior');\n",
"np.trapz(mu*pdf,mu) # expectation value"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Another improper prior\n",
"\n",
"Uniform but cannot be negative, e.g., temperature in Kelvin\n",
"> $\n",
"\\pi(\\mu) = \\left\\{ \\begin{array}{ll}\n",
" 0 & \\mbox{if $\\mu < 0$} \\\\\n",
" 1 & \\mbox{if $\\mu \\geq 0$} \n",
"\\end{array}\\right. \n",
"$\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prior = np.ones_like(lk)\n",
"prior[mu < 0] = 0\n",
"\n",
"numerator = prior * lk\n",
"pdf0 = numerator / np.trapz(numerator,mu)\n",
"\n",
"plot(mu,pdf,'r')\n",
"plot(mu,pdf0,'g--') \n",
"xlabel('mu'); ylabel('posterior');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Normal prior\n",
"\n",
"Compare with previous results for different $\\sigma$ values, i.e., `scale`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from scipy.stats import norm as gauss\n",
"\n",
"plot(mu,pdf, 'r', alpha=0.9)\n",
"plot(mu,pdf0,'g--', alpha=0.9) \n",
"\n",
"for s in [5,0.2]:\n",
" numerator = lk * gauss.pdf(mu,scale=s)\n",
" pdfG = numerator / np.trapz(numerator,mu)\n",
" plot(mu,pdfG,'k:',lw=2, alpha=0.9)\n",
" \n",
"xlabel('mu'); ylabel('posterior');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multiple Datasets\n",
"\n",
"- If the data set $D$ consists of two subsets of $D_1$ and $D_2$, we can consider them together or separately\n",
"\n",
">$\\displaystyle p(\\theta \\lvert D_1,D_2) = \\frac{p(\\theta)\\, p(D_1, D_2 \\lvert \\theta)}{p(D_1, D_2)}$\n",
" \n",
"> also \n",
" \n",
">$\\displaystyle p(\\theta \\lvert D_1, D_2) = \\frac{p(\\theta \\lvert D_1)\\, p(D_2 \\lvert \\theta, D_1)}{p(D_2 \\lvert D_1)}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Because\n",
"\n",
">$\\displaystyle p(\\theta \\lvert D) = p(\\theta \\lvert \\color{green}{D_1}, \\color{red}{D_2}) = \\frac{p(\\theta \\lvert \\color{green}{D_1})\\, p(\\color{red}{D_2} \\lvert \\theta, \\color{green}{D_1})}{p(\\color{red}{D_2} \\lvert \\color{green}{D_1})}$\n",
">$\\displaystyle = \\frac{p(\\theta)\\,p(\\color{green}{D_1} \\lvert \\theta)\\, p(\\color{red}{D_2} \\lvert \\theta, \\color{green}{D_1})}{p(\\color{green}{D_1})\\,p(\\color{red}{D_2} \\lvert \\color{green}{D_1})}$\n",
">$\\displaystyle = \\frac{p(\\theta)\\,p(\\color{green}{D_1},\\color{red}{D_2} \\lvert \\theta)}{p(\\color{green}{D_1}, \\color{red}{D_2})}$\n",
">$\\displaystyle = \\frac{p(\\theta)\\,p(D \\lvert \\theta)}{p(D)}$\n",
"\n",
"- Incremental learning\n",
"\n",
"\n",
">$\\displaystyle D = \\big\\{ \\color{green}{D_1},\\ \\color{red}{D_2},\\ \\color{darkblue}{D_3}, \\dots, \\color{black}{D_N} \\big\\}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Characterization of Posterior PDF\n",
"\n",
"- Mode, Mean, Covariance, etc... For example,\n",
"\n",
">$ \\displaystyle \\bar{\\theta} = \\int {\\color{default}\\theta}\\ p(\\theta)\\ d\\theta$\n",
">$ \\displaystyle = \\frac{\\int \\theta\\,\\pi(\\theta)\\,{\\cal{}L}(\\theta)\\,d\\theta}{\\int \\pi(\\theta)\\,{\\cal{}L}(\\theta)\\,d\\theta }$\n",
"\n",
"\n",
"- In general, numerical evaluation is required \n",
"\n",
"> Randomized algorithms;\n",
"> Sampling from distributions\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Caution!\n",
"\n",
"- Noisy likelihood function with false peak(s)\n",
" \n",
"> Misleading MLE by an erroneous spike?\n",
" \n",
"- Mean could be completely off\n",
"\n",
"> E.g., center of a ring "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sampling \n",
"\n",
"- How to calculate an integral such as\n",
"\n",
">$ \\displaystyle \\langle f(\\theta)\\rangle = \\int f(\\theta)\\,p(\\theta)\\,d\\theta $\n",
"\n",
"- Approximation using $\\{\\theta_i\\}$ sample from $p(\\cdot)$\n",
"\n",
">$ \\displaystyle \\langle f(\\theta)\\rangle \\approx \\frac{1}{n}\\sum_{i=1}^{n} f(\\theta_i) $\n",
"\n",
"- But we really don't know the posterior that well!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sampling from Prior\n",
"\n",
"- Prior is better known \n",
"\n",
"> $ \\displaystyle \\langle f(\\theta)\\rangle =$\n",
">$ \\displaystyle \\frac{\\int f(\\theta)\\,\\pi(\\theta)\\,{\\cal{}L}(\\theta)\\,d\\theta}{\\int \\pi(\\theta)\\,{\\cal{}L}(\\theta)\\,d\\theta }$\n",
"\n",
"\n",
"- Approximation using $\\{\\theta_i\\}$ sample from $\\pi(\\cdot)$\n",
"\n",
">$ \\displaystyle \\langle f(\\theta)\\rangle \\approx \\frac{\\sum f(\\theta_i)\\,{\\cal{}L}(\\theta_i)}{\\sum {\\cal{}L}(\\theta_i)} $"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sampling from ...\n",
"\n",
"- E.g., likelihood?\n",
"\n",
">$ \\displaystyle \\langle f(\\theta)\\rangle \\approx \\frac{\\sum f(\\theta_i)\\,\\pi(\\theta_i)}{\\sum \\pi(\\theta_i)} $\n",
"\n",
"- What about something \"similar\"?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Importance Sampling\n",
"\n",
"- We usually have integrals such as\n",
"\n",
">$ \\displaystyle \\langle f(\\theta)\\rangle = \\int f(\\theta)\\,g(\\theta)\\,d\\theta$\n",
"\n",
"- If we can't sample from $g(\\cdot)$ but can from a $h(\\cdot)$ \n",
"\n",
"> s.t. $\\ \\ \\ g(\\theta) \\leq K \\cdot h(\\theta) \\ \\ \\ $ for any $\\theta$ and a suitably large $K$\n",
"\n",
">$\\displaystyle \\langle f(\\theta)\\rangle \\approx \\frac{1}{n} \\sum_i^n f(\\theta_i)\\,\\frac{g(\\theta_i)}{h(\\theta_i)}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Markov-chain Monte Carlo a.k.a. MCMC\n",
"\n",
"- Instead of independent samples, produce a chain of samples in a special way\n",
"\n",
"> **Metropolis-Hastings**\n",
"> 0. Start from a random $\\theta_t = \\theta_0$ parameter set\n",
"> 0. Obtain a new $\\theta'$ from a proposal distribution $Q(\\theta;\\theta_t)$\n",
"> 0. Accept $\\theta_{t+1} = \\theta'$ with probability $g(\\theta')/g(\\theta_t)$\n",
"> 0. Let $t\\leftarrow t\\!+\\!1$ and go to Step 2.\n",
"\n",
"- Use the samples of the chain as if taken from the posterior PDF\n",
"\n",
" - Many other variants \n",
"\n",
" - Watch out for burn in, correlations, etc..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### For example\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Model Comparison\n",
"\n",
"- Bayesian hypothesis testing\n",
"\n",
"> Posterior probability of a model given the data vs another (odds)\n",
"