Gaussian Discriminant Analysis

Suppose we are given a dataset {(x(i),y(i));i=1,2,...,m}\{ ( x^{(i)}, y^{(i)});i=1,2,...,m\} consisting of mm independent examples, where x(i)Rnx^{(i)}∈R^n are n-dimensional vectors, and y(i){1,1}y^{(i)}∈\{-1,1\}. We will model the joint distribution of (x,y)(x,y) according to:

p(y)=φ     (y=1)p(y) = φ~~~~~(y=1)
p(y)=1φ    (y=1)p(y) = 1-φ~~~~(y=-1)
p(xy=1)=1(2π)n2Σ12exp(12(xμ1)TΣ1(xμ1)  )p(x|y=-1) = \frac {1} { (2π)^{\frac{n}{2}} |Σ|^{ \frac{1}{2}} } exp( -\frac{1}{2} (x-μ_{-1})^{T} |Σ|^{-1} (x-μ_{-1}) ~~ )
p(xy=1)=1(2π)n2Σ12exp(12(xμ1)TΣ1(xμ1)  )p(x|y=1) = \frac {1} { (2π)^{\frac{n}{2}} |Σ|^{ \frac{1}{2}} } exp( -\frac{1}{2} (x-μ_{1})^{T} |Σ|^{-1} (x-μ_{1}) ~~ )

Here, the parameters of our model are φφ, \sum, μ1μ_{1}, μ1μ_{-1}.

(Note that while there're two different means vectors μ1μ_{1} and μ1μ_{-1}, there's only one covariance matrix \sum.)

(a)

Suppose we have already fit φφ, ΣΣ, μ1μ_{1} and μ1μ_{-1}, and now we want to make a prediction at some new query point xx.

Show that the posterior distribution of the label at xx takes the form of a logistic function, and can be written as

p(yx;φ,Σ,μ1,μ1)=11+exp(y(θTx+θ0))p(y|x;φ,Σ,μ_{1},μ_{-1}) = \frac{1} {1+exp(-y(θ^{T}x + θ_0))}

To make equations simple, suppose we have

1(2π)n2Σ12=C\frac {1} { (2π)^{\frac{n}{2}} |Σ|^{ \frac{1}{2}} } = C
12(xμ1)TΣ1(xμ1)=A-\frac{1}{2} (x-μ_{-1})^{T} |Σ|^{-1} (x-μ_{-1}) = A
12(xμ1)TΣ1(xμ1)=B-\frac{1}{2} (x-μ_{1})^{T} |Σ|^{-1} (x-μ_{1}) = B

Then we have

p(yx)=p(xy)p(y)p(x)p(y|x) = \frac {p(x|y)*p(y)} {p(x)}
=p(xy)p(y)p(xy=1)p(y=1)+p(xy=0)p(y=0)= \frac {p(x|y)*p(y)} {p(x|y=1)*p(y=1) + p(x|y=0)*p(y=0)}
For y=1y=1 we have:
p(y=1x)=p(xy=1)p(y=1)p(xy=1)p(y=1)+p(xy=0)p(y=0)p(y=1|x) = \frac {p(x|y=1)*p(y=1)} {p(x|y=1)*p(y=1) + p(x|y=0)*p(y=0)}
=Cexp(B)φCexp(B)φ+Cexp(A)(1φ)= \frac {C*exp(B)*φ} {C*exp(B)*φ + C*exp(A)*(1-φ)}
=11+exp(A)(1φ)exp(B)φ= \frac {1} {1+ \frac{exp(A)*(1-φ)}{exp(B)*φ} }
=11+exp(AB)exp(ln(1φφ))= \frac {1} {1+ exp(A-B)*exp(ln(\frac{1-φ}{φ})) }
=11+exp(1)(BA+lnφln(1φ))= \frac {1} {1+ exp(-1)(B-A+lnφ-ln(1-φ)) }
And for y=1y=-1 we have:
p(y=1x)=p(xy=1)p(y=1)p(xy=1)p(y=1)+p(xy=0)p(y=0)p(y=-1|x) = \frac {p(x|y=-1)*p(y=-1)} {p(x|y=1)*p(y=1) + p(x|y=0)*p(y=0)}
=Cexp(A)(1φ)Cexp(A)(1φ)+Cexp(B)φ= \frac {C*exp(A)*(1-φ)} {C*exp(A)*(1-φ) + C*exp(B)φ}
=11+exp(BA+lnφln(1φ))= \frac {1} {1 + exp(B-A+lnφ-ln(1-φ))}

So for both y=1y=1 and y=1y=-1,have

p(yx)=11+exp(y(θTx+θ0))p(y|x) = \frac {1} {1+exp(-y(θ^{T}x + θ_0))}

Where

BA=θTxB-A = θ^{T}x
lnφln(1φ)=θ0lnφ-ln(1-φ) = θ_0

Notice that BA=θTxB-A = θ^{T}x is linear transform of xx, and lnφln(1φ)=θ0lnφ-ln(1-φ) = θ_0 is the parameter fit from training set.

Exponential family and sigmoid function

From the derivation, we can build the intuition that the sigmoidsigmoid function is actually the ratio of probability,and if we have a exponential family distributionp(y;η)=b(y)exp(ηT T(y)a(η))p(y;η) = b(y)exp(η^T~T(y) - a(η)), by dividing both numerator and denominator with numerator which is a exp()exp() term,we get a hypothesis with form of sigmoidsigmoid function.

(b)

For this part of problem only, you may assume nn (the dimension of xx) is 1, so Σ=[σ2]Σ=[\sigma^2] is just a real number, and likewise, the determinant of ΣΣ is given by Σ=σ2|Σ| = \sigma^2. Given the data set, we claim the maximum likelihood estimate of the parameters are given by

φ=1mi=1m1{y(i)=1}φ = \frac{1}{m} \sum^m_{i=1}1\{y^{(i)}=1\}
μ1=i=1m1{y(i)=1}x(i)i=1m1{y(i)=1}μ_{-1} = \frac {\sum_{i=1}^m 1\{y^{(i)}=-1\} x^{(i)}} {\sum^m_{i=1}1\{y^{(i)}=-1\}}
μ1=i=1m1{y(i)=1}x(i)i=1m1{y(i)=1}μ_{1} = \frac {\sum_{i=1}^m 1\{y^{(i)}=1\} x^{(i)}} {\sum^m_{i=1}1\{y^{(i)}=1\}}
Σ=1mi=1m(x(i)μy(i))(x(i)μy(i))TΣ = \frac{1}{m} \sum_{i=1}^m ( x^{(i)} - μ_{y^{(i)}} ) ( x^{(i)} - μ_{y^{(i)}} )^T

The likelihood of the data is

l(φ,Σ,μ1,μ1)=logi=1mp(x(i),y(i);φ,Σ,μ1,μ1)l(φ,Σ,μ_{1},μ_{-1}) = log\prod^m_{i=1} p(x^{(i)},y^{(i)};φ,Σ,μ_{1},μ_{-1})
=logi=1mp(x(i)y(i);Σ,μ1,μ1)p(y(i);φ)= log\prod^m_{i=1} p(x^{(i)}|y^{(i)};Σ,μ_{1},μ_{-1})*p(y^{(i)};φ)

By maximizing ll with respect to the four parameters, prove that the maximum likelihood estimate of φφ, \sum, μ1μ_{1} and μ1μ_{-1} are indeed as given in the formula above.

(You may assume that there is at least ont positive and one negtive example, so that the denominators in the definitions of μ1μ_{1} and μ1μ_{-1} and non-zero.)

Since n=1n=1, then we have

p(xy=1)=1(2πσ2)12exp((xμ1)22σ2)p(x|y=-1)= \frac {1} { (2\pi\sigma^2)^{\frac{1}{2}} } exp( -\frac {(x-μ_{-1})^2} {2\sigma^2} )
=Cexp(A)= C*exp(A)
p(xy=1)=1(2πσ2)12exp((xμ1)22σ2)p(x|y=1)= \frac {1} { (2\pi\sigma^2)^{\frac{1}{2}} } exp( -\frac {(x-μ_{1})^2} {2\sigma^2} )
=Cexp(B)= C*exp(B)

We split m samples into m1m_1 positive samples and m1m_{-1} negative samples.

And the likelihood can be written as

l(φ,Σ,μ1,μ1)=logi=1m1p(x(i)y(i)=1;Σ,μ1,μ1)p(y(i)=1;φ)i=1m1p(x(i)y(i)=1;Σ,μ1,μ1)p(y(i)=1;φ)l(φ,Σ,μ_{1},μ_{-1}) = log \prod_{i=1}^{m_1} p(x^{(i)}|y^{(i)}=1;Σ,μ_{1},μ_{-1})*p(y^{(i)=1};φ) \prod_{i=1}^{m_{-1}} p(x^{(i)}|y^{(i)}=-1;Σ,μ_{1},μ_{-1})*p(y^{(i)=-1};φ)
=logi=1m1Cexp(B)φi=1m1Cexp(A)(1φ)= log \prod_{i=1}^{m_1}C*exp(B)*φ \prod_{i=1}^{m_{-1}}C*exp(A)*(1-φ)
=i=1m1log(Cexp(B)φ)+i=1m1log(Cexp(A)(1φ))= \sum_{i=1}^{m_1}log(C*exp(B)*φ) + \sum_{i=1}^{m_{-1}}log(C*exp(A)*(1-φ))
=i=1mlogC+i=1m1B+i=1m1A+i=1m1logφ+i=1m1log(1φ)= \sum_{i=1}^{m}logC + \sum_{i=1}^{m_1}B + \sum_{i=1}^{m_{-1}}A + \sum_{i=1}^{m_1}logφ + \sum_{i=1}^{m_{-1}}log(1-φ)

To maximize l(φ,Σ,μ1,μ1)l(φ,Σ,μ_{1},μ_{-1}), we set each partial detivative to 0.

lφ\frac{\partial l}{\partial φ}

lφ=i=1m11φ+i=1m111φ1\frac{\partial l}{\partial φ} = \sum_{i=1}^{m_1} \frac{1}{φ} + \sum_{i=1}^{m_{-1}} \frac{1}{1-φ}*-1
=m1φm11φ:=0= \frac{m_1}{φ} - \frac{m_{-1}}{1-φ} := 0
m1(1φ)m1φ=0m_1*(1-φ) - {m_{-1}}*φ = 0
m1mφ=0m_1 - m*φ = 0
φ=m1m=1mi=1m1{y(i)=1}φ = \frac{m_1}{m} = \frac{1}{m} \sum^m_{i=1}1\{y^{(i)}=1\}

lΣ\frac{\partial l}{\partial Σ}

lΣ=lσ2\frac{\partial l}{\partial Σ} = \frac{\partial l}{\partial \sigma^2}
=i=1mlogCσ2+i=1m1Aσ2+i=1m1Bσ2= \sum_{i=1}^m \frac{\partial logC}{\partial \sigma^2} + \sum_{i=1}^{m_{-1}} \frac{\partial A}{\partial \sigma^2} + \sum_{i=1}^{m_1} \frac{\partial B}{\partial \sigma^2}
=i=1m1212πσ22π+i=1m1(x(i)μ1)221(σ2)2+(x(i)μ1)221(σ2)2= \sum_{i=1}^m -\frac{1}{2}*\frac{1}{2\pi\sigma^2}*2\pi + \sum_{i=1}^{m_{-1}} -\frac{(x^{(i)}-μ_{-1})^2}{2}*-1* (\sigma^2)^{-2}+ \sum -\frac{(x^{(i)}-μ_1)^2}{2}*-1*(\sigma^2)^{-2}
=i=1m12σ2+i=1m1(x(i)μ1)22σ4+i=1m1(x(i)μ1)22σ4:=0= \sum_{i=1}^m -\frac{1}{2\sigma^2} + \sum_{i=1}^{m_{-1}} \frac{(x^{(i)}-μ_{-1})^2}{2\sigma^4} + \sum_{i=1}^{m_1} \frac{(x^{(i)}-μ_1)^2}{2\sigma^4} := 0
i=1mσ2=i=1m1(x(i)μ1)2+i=1m1(x(i)μ1)2\sum_{i=1}^m \sigma^2 = \sum_{i=1}^{m_{-1}} {(x^{(i)}-μ_{-1})^2} + \sum_{i=1}^{m_1} (x^{(i)}-μ_1)^2
σ2=i=1m(x(i)μy(i))2m\sigma^2 = \frac { \sum_{i=1}^m (x^{(i)}-μ_{y^{(i)}})^2 } {m}

lμ1\frac{\partial l}{\partial μ_1}

lμ1=i=1m1Bμ1\frac{\partial l}{\partial μ_1} = \sum_{i=1}^{m_1} \frac{\partial B}{\partial μ_1}
=i=1m112σ22(x(i)μ1)1= \sum_{i=1}^{m_1} -\frac{1}{2\sigma^2}*2*(x^{(i)}-μ_1)*-1
=i=1m11σ2(x(i)μ1):=0= \sum_{i=1}^{m_1} \frac{1}{\sigma^2}*(x^{(i)}-μ_1) :=0
i=1m1(x(i)μ1)=0\sum_{i=1}^{m_1} (x^{(i)}-μ_1) = 0
i=1m1x(i)=m1μ1\sum_{i=1}^{m_1}x^{(i)} = m_1*μ_1
μ1=i=1m1{y(i)=1}x(i)i=1m1{y(i)=1}μ_1 = \frac {\sum_{i=1}^{m}1\{y^{(i)}=1\}x^{(i)}} {\sum_{i=1}^{m}1\{y^{(i)}=1\}}

lμ1\frac{\partial l}{\partial μ_{-1}}

Same as μ1μ_1

μ1=i=1m1{y(i)=1}x(i)i=1m1{y(i)=1}μ_{-1} = \frac {\sum_{i=1}^{m}1\{y^{(i)}=-1\}x^{(i)}} {\sum_{i=1}^{m}1\{y^{(i)}=-1\}}