Gaussian Discriminant Analysis

Suppose we are given a dataset $\{ ( x^{(i)}, y^{(i)});i=1,2,...,m\}$ consisting of $m$ independent examples, where $x^{(i)}∈R^n$ are n-dimensional vectors, and $y^{(i)}∈\{-1,1\}$ . We will model the joint distribution of $(x,y)$ according to:

p(y) = φ~~~~~(y=1)

p(y) = 1-φ~~~~(y=-1)

p(x|y=-1) = \frac {1} { (2π)^{\frac{n}{2}} |Σ|^{ \frac{1}{2}} } exp( -\frac{1}{2} (x-μ_{-1})^{T} |Σ|^{-1} (x-μ_{-1}) ~~ )

p(x|y=1) = \frac {1} { (2π)^{\frac{n}{2}} |Σ|^{ \frac{1}{2}} } exp( -\frac{1}{2} (x-μ_{1})^{T} |Σ|^{-1} (x-μ_{1}) ~~ )

Here, the parameters of our model are $φ$ , $\sum$ , $μ_{1}$ , $μ_{-1}$ .

(Note that while there're two different means vectors $μ_{1}$ and $μ_{-1}$ , there's only one covariance matrix $\sum$ .)

(a)

Suppose we have already fit $φ$ , $Σ$ , $μ_{1}$ and $μ_{-1}$ , and now we want to make a prediction at some new query point $x$ .

Show that the posterior distribution of the label at $x$ takes the form of a logistic function, and can be written as

p(y|x;φ,Σ,μ_{1},μ_{-1}) = \frac{1} {1+exp(-y(θ^{T}x + θ_0))}

To make equations simple, suppose we have

\frac {1} { (2π)^{\frac{n}{2}} |Σ|^{ \frac{1}{2}} } = C

-\frac{1}{2} (x-μ_{-1})^{T} |Σ|^{-1} (x-μ_{-1}) = A

-\frac{1}{2} (x-μ_{1})^{T} |Σ|^{-1} (x-μ_{1}) = B

Then we have

p(y|x) = \frac {p(x|y)*p(y)} {p(x)}

= \frac {p(x|y)*p(y)} {p(x|y=1)*p(y=1) + p(x|y=0)*p(y=0)}

For $y=1$ we have:

p(y=1|x) = \frac {p(x|y=1)*p(y=1)} {p(x|y=1)*p(y=1) + p(x|y=0)*p(y=0)}

= \frac {C*exp(B)*φ} {C*exp(B)*φ + C*exp(A)*(1-φ)}

= \frac {1} {1+ \frac{exp(A)*(1-φ)}{exp(B)*φ} }

= \frac {1} {1+ exp(A-B)*exp(ln(\frac{1-φ}{φ})) }

= \frac {1} {1+ exp(-1)(B-A+lnφ-ln(1-φ)) }

And for $y=-1$ we have:

p(y=-1|x) = \frac {p(x|y=-1)*p(y=-1)} {p(x|y=1)*p(y=1) + p(x|y=0)*p(y=0)}

= \frac {C*exp(A)*(1-φ)} {C*exp(A)*(1-φ) + C*exp(B)φ}

= \frac {1} {1 + exp(B-A+lnφ-ln(1-φ))}

So for both $y=1$ and $y=-1$ ,have

p(y|x) = \frac {1} {1+exp(-y(θ^{T}x + θ_0))}

Where

B-A = θ^{T}x

lnφ-ln(1-φ) = θ_0

Notice that $B-A = θ^{T}x$ is linear transform of $x$ , and $lnφ-ln(1-φ) = θ_0$ is the parameter fit from training set.

Exponential family and sigmoid function

From the derivation, we can build the intuition that the $sigmoid$ function is actually the ratio of probability,and if we have a exponential family distribution $p(y;η) = b(y)exp(η^T~T(y) - a(η))$ , by dividing both numerator and denominator with numerator which is a $exp()$ term,we get a hypothesis with form of $sigmoid$ function.

(b)

For this part of problem only, you may assume $n$ (the dimension of $x$ ) is 1, so $Σ=[\sigma^2]$ is just a real number, and likewise, the determinant of $Σ$ is given by $|Σ| = \sigma^2$ . Given the data set, we claim the maximum likelihood estimate of the parameters are given by

φ = \frac{1}{m} \sum^m_{i=1}1\{y^{(i)}=1\}

μ_{-1} = \frac {\sum_{i=1}^m 1\{y^{(i)}=-1\} x^{(i)}} {\sum^m_{i=1}1\{y^{(i)}=-1\}}

μ_{1} = \frac {\sum_{i=1}^m 1\{y^{(i)}=1\} x^{(i)}} {\sum^m_{i=1}1\{y^{(i)}=1\}}

Σ = \frac{1}{m} \sum_{i=1}^m ( x^{(i)} - μ_{y^{(i)}} ) ( x^{(i)} - μ_{y^{(i)}} )^T

The likelihood of the data is

l(φ,Σ,μ_{1},μ_{-1}) = log\prod^m_{i=1} p(x^{(i)},y^{(i)};φ,Σ,μ_{1},μ_{-1})

= log\prod^m_{i=1} p(x^{(i)}|y^{(i)};Σ,μ_{1},μ_{-1})*p(y^{(i)};φ)

By maximizing $l$ with respect to the four parameters, prove that the maximum likelihood estimate of $φ$ , $\sum$ , $μ_{1}$ and $μ_{-1}$ are indeed as given in the formula above.

(You may assume that there is at least ont positive and one negtive example, so that the denominators in the definitions of $μ_{1}$ and $μ_{-1}$ and non-zero.)

Since $n=1$ , then we have

p(x|y=-1)= \frac {1} { (2\pi\sigma^2)^{\frac{1}{2}} } exp( -\frac {(x-μ_{-1})^2} {2\sigma^2} )

= C*exp(A)

p(x|y=1)= \frac {1} { (2\pi\sigma^2)^{\frac{1}{2}} } exp( -\frac {(x-μ_{1})^2} {2\sigma^2} )

= C*exp(B)

We split m samples into $m_1$ positive samples and $m_{-1}$ negative samples.

And the likelihood can be written as

l(φ,Σ,μ_{1},μ_{-1}) = log \prod_{i=1}^{m_1} p(x^{(i)}|y^{(i)}=1;Σ,μ_{1},μ_{-1})*p(y^{(i)=1};φ) \prod_{i=1}^{m_{-1}} p(x^{(i)}|y^{(i)}=-1;Σ,μ_{1},μ_{-1})*p(y^{(i)=-1};φ)

= log \prod_{i=1}^{m_1}C*exp(B)*φ \prod_{i=1}^{m_{-1}}C*exp(A)*(1-φ)

= \sum_{i=1}^{m_1}log(C*exp(B)*φ) + \sum_{i=1}^{m_{-1}}log(C*exp(A)*(1-φ))

= \sum_{i=1}^{m}logC + \sum_{i=1}^{m_1}B + \sum_{i=1}^{m_{-1}}A + \sum_{i=1}^{m_1}logφ + \sum_{i=1}^{m_{-1}}log(1-φ)

To maximize $l(φ,Σ,μ_{1},μ_{-1})$ , we set each partial detivative to 0.

$\frac{\partial l}{\partial φ}$

\frac{\partial l}{\partial φ} = \sum_{i=1}^{m_1} \frac{1}{φ} + \sum_{i=1}^{m_{-1}} \frac{1}{1-φ}*-1

= \frac{m_1}{φ} - \frac{m_{-1}}{1-φ} := 0

m_1*(1-φ) - {m_{-1}}*φ = 0

m_1 - m*φ = 0

φ = \frac{m_1}{m} = \frac{1}{m} \sum^m_{i=1}1\{y^{(i)}=1\}

$\frac{\partial l}{\partial Σ}$

\frac{\partial l}{\partial Σ} = \frac{\partial l}{\partial \sigma^2}

= \sum_{i=1}^m \frac{\partial logC}{\partial \sigma^2} + \sum_{i=1}^{m_{-1}} \frac{\partial A}{\partial \sigma^2} + \sum_{i=1}^{m_1} \frac{\partial B}{\partial \sigma^2}

= \sum_{i=1}^m -\frac{1}{2}*\frac{1}{2\pi\sigma^2}*2\pi + \sum_{i=1}^{m_{-1}} -\frac{(x^{(i)}-μ_{-1})^2}{2}*-1* (\sigma^2)^{-2}+ \sum -\frac{(x^{(i)}-μ_1)^2}{2}*-1*(\sigma^2)^{-2}

= \sum_{i=1}^m -\frac{1}{2\sigma^2} + \sum_{i=1}^{m_{-1}} \frac{(x^{(i)}-μ_{-1})^2}{2\sigma^4} + \sum_{i=1}^{m_1} \frac{(x^{(i)}-μ_1)^2}{2\sigma^4} := 0

\sum_{i=1}^m \sigma^2 = \sum_{i=1}^{m_{-1}} {(x^{(i)}-μ_{-1})^2} + \sum_{i=1}^{m_1} (x^{(i)}-μ_1)^2

\sigma^2 = \frac { \sum_{i=1}^m (x^{(i)}-μ_{y^{(i)}})^2 } {m}

$\frac{\partial l}{\partial μ_1}$

\frac{\partial l}{\partial μ_1} = \sum_{i=1}^{m_1} \frac{\partial B}{\partial μ_1}

= \sum_{i=1}^{m_1} -\frac{1}{2\sigma^2}*2*(x^{(i)}-μ_1)*-1

= \sum_{i=1}^{m_1} \frac{1}{\sigma^2}*(x^{(i)}-μ_1) :=0

\sum_{i=1}^{m_1} (x^{(i)}-μ_1) = 0

\sum_{i=1}^{m_1}x^{(i)} = m_1*μ_1

μ_1 = \frac {\sum_{i=1}^{m}1\{y^{(i)}=1\}x^{(i)}} {\sum_{i=1}^{m}1\{y^{(i)}=1\}}

$\frac{\partial l}{\partial μ_{-1}}$

Same as $μ_1$

μ_{-1} = \frac {\sum_{i=1}^{m}1\{y^{(i)}=-1\}x^{(i)}} {\sum_{i=1}^{m}1\{y^{(i)}=-1\}}

Gaussian Discriminant Analysis

(a)

For y=1y=1y=1 we have:

And for y=−1y=-1y=−1 we have:

Exponential family and sigmoid function

(b)

∂l∂φ\frac{\partial l}{\partial φ}∂φ∂l​

∂l∂Σ\frac{\partial l}{\partial Σ}∂Σ∂l​