Locally Weight Linear Regression

Recap

LR: Fit θθ to minimize i=1m(y(i)θTx(i))2\sum_{i=1}^m(y^{(i)}-θ^Tx^{(i)})^2.

LWLR: To evaluate hθ(x)h_θ(x) at a certain xx, fit θθ minimize i=1mw(i)(y(i)θTx(i))2\sum_{i=1}^mw^{(i)}(y^{(i)}-θ^Tx^{(i)})^2,where w(i)=exp((x(i)x)22τ2)w^{(i)} = exp(-\frac{(x^{(i)}-x)^2}{2\tau^2}).

τ\tau is the bandwidth parameter, controls how fast weight change with distance x(i)x|x^{(i)}-x|; make τ\tau fixed, when x(i)x|x^{(i)}-x| is small, the weight is closed to 1, and closed to 0 when x(i)x|x^{(i)}-x| is large.

When all w(i)w^{(i)}'s are equal to 1, then all examples has the same weight 1, that's exactly basic Linear Regression.

(a)

Consider a linear regression problem in which we want to "weight" different training examples differently. Specifically, suppose we want to minimize

J(θ)=12i=1mw(i)(θTx(i)y(i))2J(θ) = \frac{1}{2} \sum_{i=1}^m w^{(i)}(θ^Tx^{(i)}-y^{(i)})^2
(i)

Show that J(θ)J(θ) can also be written

J(θ)=(Xθy)TW(Xθy)J(θ) = (Xθ-\vec y)^TW(Xθ-\vec y)

for an appropiate diagonal WW, and state clearly what WW is.

J(θ)=(Xθy)TW(Xθy)J(θ) = (Xθ-\vec y)^TW(Xθ-\vec y)
=i,jm(Xθy)iWij(Xθy)j= \sum_{i,j}^m (Xθ-\vec y)_i W_{ij} (Xθ-\vec y)_j

Since WW is a diagonal matrix

J(θ)=i,jm(Xθy)iWij(Xθy)jJ(θ) = \sum_{i,j}^m (Xθ-\vec y)_i W_{ij} (Xθ-\vec y)_j
=im(Xθy)i2Wii= \sum_i^m (Xθ-\vec y)_i^2 W_{ii}

So

Wii=12w(i)=12exp((x(i)x)22τ2)W_{ii} = \frac{1}{2}w^{(i)} = \frac{1}{2} exp( -\frac {(x^{(i)}-x)^2} {2\tau^2} )
(ii)

If all the w(i)w^{(i)}'s equal 1, then the normal equation is

XTXθ=XTyX^TXθ = X^T \vec y

and that the value of θθ that minimize J(θ)J(θ) is given by (XTX)1XTy(XTX)^{-1}X^T\vec y. By finding the derivative θJ(θ)∇_θJ(θ) and setting that to zero, generalize the normal equation to this weighted setting, and give the new value of θθ that minimizes J(θ)J(θ) in closed form as a function of XX, WW and y\vec y.

Linear Algebra Lemma:xXTAX=2XTA∇_xX^TAX = 2X^TA

J(θ)=(Xθy)TW(Xθy)J(θ) = (Xθ-\vec y)^TW(Xθ-\vec y)
θJ(θ)=XθyJ(θ)θ(Xθy)∇_θJ(θ) = ∇_{Xθ-\vec y} J(θ) * ∇_θ(Xθ-\vec y)
=2(Xθy)TWX= 2(Xθ-\vec y)^TW * X
=[2WT(Xθy)]TX= [2W^T(Xθ-\vec y)]^T*X
(W  is  a  diagonal  mat)(W~~is~~a~~diagonal~~mat)
=[2W(Xθy)]TX= [2W(Xθ-\vec y)]^T*X
:=0:= 0
[XT[2W(Xθy)]]T=0[X^T*[2W(Xθ-\vec y)]]^T = 0
XT[W(Xθy)]=0X^T*[W(Xθ-\vec y)] =0
XTWXθXTWy=0X^TWXθ - X^TW\vec y = 0
θ=(XTWX)1XTWyθ = (X^TWX)^{-1}X^TW\vec y

When all w(i)w^{(i)}'s equal 1

θ=(XTX)1XTyθ = (X^TX)^{-1}X^T\vec y

θθ is same just the normal linear regression.

(iii)

Suppose we have a training set {(x(i),y(i));i=1,2,...,m}\{(x^{(i)},y^{(i)});i=1,2,...,m\} of m independent examples, but in which the y(i)y^{(i)}'s were observed with differing variances.Specifically, suppose that

p(y(i)x(i);θ)=1(2π)12σ(i)exp((y(i)θTx(i))22(σ(i))2)p(y^{(i)}|x^{(i)};θ) = \frac {1} {(2\pi)^\frac{1}{2}\sigma^{(i)}} exp( -\frac {(y^{(i)}-θ^Tx^{(i)})^2} {2(\sigma^{(i)})^2} )

I.e. y(i)y^{(i)} has mean θTx(i)θ^Tx^{(i)} and variance (σ(i))2(\sigma^{(i)})^2(where the σ(i) \sigma^{(i)} 's are fixed constants). Show that finding the maximum likelihood estimate of θθ reduces to solving a locally weighted linear regression. State clearly what the w(i)w^{(i)} are in terms of the σ(i)\sigma^{(i)}'s.

l(θ)=logi=1mp(y(i)x(i);θ)l(θ) = log \prod_{i=1}^m p(y^{(i)}|x^{(i)};θ)
=i=1mlog(1(2π)12σ(i))i=1m(y(i)θTx(i))22(σ(i))2= \sum_{i=1}^m log( \frac {1} {(2\pi)^\frac{1}{2}\sigma^{(i)}} )- \sum_{i=1}^m \frac {(y^{(i)}-θ^Tx^{(i)})^2} {2(\sigma^{(i)})^2}

Since σ(i)\sigma^{(i)}'s fixed, maximizing l(θ)l(θ) is equvilent to minimizing

i=1m(y(i)θTx(i))22(σ(i))2\sum_{i=1}^m \frac {(y^{(i)}-θ^Tx^{(i)})^2} {2(\sigma^{(i)})^2}

this equation is in the form of locally weighted linear regression, where w(i)=12(σ(i))2w^{(i)} = \frac{1}{2(\sigma^{(i)})^2}.

So fixing a Gaussian Distribution is somehow like fixing a LWLR?

(b)

Training dataset is provided here and testing data is provided here

(i)

Use the normal equations to implement (unweight) linear regression on the first training example. On one figure plot both the raw data abd tge straight line resulting from your fit. State the optimal θθ resulting from linear regression.

First let plot the raw data.

import pandas as pd
import matplotlib.pyplot as plt

quasar_train = pd.read_csv("quasar_train.csv")
wave_lengthes = quasar_train.columns.values.astype(float).astype(int)

plt.plot(wave_lengthes, quasar_train.values[0])

Now let do the math stuff.

First, a function to get local weights. For a single x, have a serial of weights respect to diffent distance.

And again, here's the θθ formula.

θ=(XTWX)1XTWyθ = (X^TWX)^{-1}X^TW \vec y

When w(i)w^{(i)}'s equal 1, then it's same as linear regression.

import numpy as np

def getLocalTheta(X, Y_vector, Weight_mat = None):
        if Weight_mat is None:
        return np.linalg.inv(X.T.dot(X)).dot(X.T).dot(Y_vector)
    else:
        return np.linalg.inv(X.T.dot(Weight_mat).dot(X)).dot(X.T).dot(Weight_mat).dot(Y_vector)
        

X1 = np.ones(shape=[len(wave_lengthes)])
X = np.stack([wave_lengthes, X1]).T

theta = getWeightedTheta(X, quasar_train.values[0],None)

plt.plot(wave_lengthes, quasar_train.values[0])
plt.plot(wave_lengthes, X.dot(theta),'r')
plt.show()

Looks like this

(ii)

Implement locally weighted linear regression on the first training example. Use normal equations you derived. When evaluating h()h() at a query point x,use weights

w(i)=exp((xx(i))22τ2)w^{(i)} = exp( -\frac{(x-x^{(i)})^2}{2\tau^2} )

with bandwidth parameter τ=5\tau=5.

def getLocalWeights(x, X_range, Tau):
    return np.diag(
        np.exp(
            (-(x-X_range[:])**2)/(2 * Tau**2)
        )
    )
    
def getWeightedRegression(X_range, Y_vector, tau):
    X1 = np.ones(shape=[len(X_range)])
    X = np.stack([X_range,X1]).T
    Weighted_Y = []

    for x in X:
        localWeights = getLocalWeights(x[0], X_range, tau)
        localTheta = getWeightedTheta(X, Y_vector, Weight_mat=localWeights)
        Weighted_Y.append(x.dot(localTheta))

    return Weighted_Y
    
first_example = quasar_train.values[0]
weighted_wave = getWeightedRegression(wave_lengthes, first_example,5)

plt.plot(wave_lengthes, first_example,'b')
plt.plot(wave_lengthes, weighted_wave,'r')
plt.show()

For each query point x we perform weighted linear regression, notice that when getting weighted theta, we use uppercase XX.

And the weighted regression looks like

For sake of layout, we place

def getLocalWeights(x, X_range, Tau):
def getWeightedTheta(X, Y_vector, Weight_mat = None):
def getWeightedRegression(X_range, Y_vector, tau):

into functions.py

(iii)

Repeat (b)(ii) four more times with τ=1,10,100\tau =1,10,100 and 10001000, plot the resulting curves. In 2-3 sentences comments on what happens to the locally weighted linear regression line as τ\tau varies.

import functions as fn
import pandas as pd
import matplotlib.pyplot as plt

qusar_train = pd.read_csv('quasar_train.csv')
first_example = qusar_train.values[0]
wave_lengthes = qusar_train.columns.values.astype(float).astype(int)

fig, axes = plt.subplots(2,2)
axes = axes.ravel()
for k,tau in enumerate([1,10,100,1000]):
    ax = axes[k]
    weightedWave = fn.getWeightedRegression(wave_lengthes, first_example, tau)
    ax.plot(wave_lengthes, first_example, 'r')
    ax.plot(wave_lengthes, weightedWave, 'b')
    ax.set_title('Tau={}'.format(tau))

plt.legend(loc='best')
plt.show()

And looks like this

Namely, τ\tau is the bandwidth parameter, from the formula

w(i)=exp((xx(i))22τ2)w^{(i)} = exp( -\frac{(x-x^{(i)})^2}{2\tau^2} )

we can see that, when the distance xx(i)|x-x^{(i)}| changes, τ\tau controls how fast w(i)w^{(i)} changes, when τ\tau is very large, the weight barely changes, so the line is more smooth.