Bayesian Linear Regression

The bayesian linear regression formulation allows to obtain uncertainty estimates for the predictive distribution that are not available in its point-wise estimate counterpart. This notebook is based on Chapter 3 of Bishop’s Pattern Recognition and Machine Learning book.
This notebook can be downloaded here.
import numpy as np
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt
%matplotlib inline
Generate sample dataset
Generate N pairs
N = 12
sigma = 0.1
x = np.random.uniform(low=-1, high=1, size=N)
n = np.random.normal(loc=0, scale=sigma, size=N)
y = 0.3*x -0.8 +n
plt.plot(x,y, 'r.');
plt.show()
Point estimate
We are trying to design a model
Note that this model and noise assumption result in the following likelihood function:
In general we aim for the Lease Squares (LS) solution:
Note that the LS solution is equivalent to the Maximum Likelihood Estimator. The solution can be obtained through minimizing the loss function through Gradient Descent. However, in the case of this simple linear model it is possible to use normal equations (closed form minimization result):
X = np.zeros((x.shape[0], 2))
X[:,0] = x
X[:,1] = 1
X
array([[ 0.07747538, 1. ],
[-0.72983355, 1. ],
[-0.08385175, 1. ],
[ 0.04152017, 1. ],
[-0.27236207, 1. ],
[-0.16471106, 1. ],
[ 0.43409736, 1. ],
[-0.33582112, 1. ],
[-0.48323886, 1. ],
[ 0.54369188, 1. ],
[-0.29194542, 1. ],
[ 0.98406384, 1. ]])
w = np.dot(np.dot(np.linalg.inv(np.dot(X.T,X)), X.T), y)
w
array([ 0.28106915, -0.75913605])
However, this solution only provides a point estimate and lacks uncertainity information.
Bayesian inference
In turn, a bayesian approach treat
In order to keep the solutions in closed-form, we use a Gaussian prior, allowing for a conjugate prior, for the vector
Which then results in a Gaussian posterior
For simplicity, let’s assume
#prior parameters
a = 0.2
m0 = np.zeros(2)
def getPosterior(n):
#Get n points from sample dataset
x_ = X[:n]
y_ = y[:n]
#Covariance Matrix
S0I = a*np.identity(2)
SnI = S0I+ 1/sigma*np.dot(x_.T,x_)
Sn = np.linalg.inv(SnI)
#Mean
tt = np.dot(S0I, m0) + 1/sigma*np.dot(x_.T,y_)
Mn = np.dot(Sn, tt)
return multivariate_normal(mean=Mn, cov=Sn)
def plot_dist2D(dist):
x, y = np.mgrid[-1:1:.01, -1:1:.01]
pos = np.empty(x.shape + (2,))
pos[:, :, 0] = y; pos[:, :, 1] = x
plt.contourf(x, y, dist.pdf(pos))
plt.title('Posterior Distribution $p(w|X,Y)$')
plt.xlabel('w0')
plt.ylabel('w1')
Posterior distribution plots
We can plot the posterior after aggregating different number of points. Observe how the posterior distributions become narrower when more observation are aggregated
plot_dist2D(getPosterior(1))
plot_dist2D(getPosterior(4))
plot_dist2D(getPosterior(6))
plot_dist2D(getPosterior(10))
The full posterior (when all points are incorporated) will have a peak on the mean,
The predictive distribution
Although we have estimated the posterior of parameters
Given the likelihood and posterior following Gaussian distributions, this predicitive distribution is also Gaussian:
Note that the variance of the predictive distribution depends both on the assumed noise model (
def predictive(x, nTrainingSamples):
xp = np.zeros((2,1))
xp[0,0] = x
xp[1,0] = 1
xp = np.matrix(xp)
#Get posterior given nTrainingSamples
posterior = getPosterior(nTrainingSamples)
Mn = np.matrix(posterior.mean)
Sn = np.matrix(posterior.cov)
#Predictive mean
m = np.matmul(Mn,xp)
#Predictive cov
s = sigma**2 + np.dot(xp.T, np.dot(Sn,xp))
return multivariate_normal(mean=m, cov=s)
def plot_dist1D(dist):
x = np.linspace(-4,4, 100)
y = dist.pdf(x)
plt.plot(y,x)
plt.title('Predictive Distribution $p(\hat{y}|x, X,Y)$')
plt.xlabel('pdf')
plt.ylabel('$\hat{y}$')
We now observe how the predictive distributions become more certain as more training data is obtained
#New values of x where we want to predict y
x = 1.2
plot_dist1D(predictive(x, 2))
plot_dist1D(predictive(x, 6))
plot_dist1D(predictive(x, 12))
We would also observe how the uncertainity changes with the values of x
plot_dist1D(predictive(1.2, 12))
plot_dist1D(predictive(2, 12))
plot_dist1D(predictive(3, 12))
plot_dist1D(predictive(6, 12))
The predictive distribution variance grows as x increases, as expected from