Gaussian Discriminant Analysis
Suppose we are given a dataset { ( x ( i ) , y ( i ) ) ; i = 1 , 2 , . . . , m } \{ ( x^{(i)}, y^{(i)});i=1,2,...,m\} { ( x ( i ) , y ( i ) ) ; i = 1 , 2 , . . . , m }
consisting of m m m
independent examples, where x ( i ) ∈ R n x^{(i)}∈R^n x ( i ) ∈ R n
are n-dimensional vectors, and y ( i ) ∈ { − 1 , 1 } y^{(i)}∈\{-1,1\} y ( i ) ∈ { − 1 , 1 }
. We will model the joint distribution of ( x , y ) (x,y) ( x , y )
according to:
p ( y ) = φ ( y = 1 ) p(y) = φ~~~~~(y=1) p ( y ) = φ ( y = 1 ) p ( y ) = 1 − φ ( y = − 1 ) p(y) = 1-φ~~~~(y=-1) p ( y ) = 1 − φ ( y = − 1 ) p ( x ∣ y = − 1 ) = 1 ( 2 π ) n 2 ∣ Σ ∣ 1 2 e x p ( − 1 2 ( x − μ − 1 ) T ∣ Σ ∣ − 1 ( x − μ − 1 ) ) p(x|y=-1) =
\frac
{1}
{ (2π)^{\frac{n}{2}} |Σ|^{ \frac{1}{2}} }
exp(
-\frac{1}{2}
(x-μ_{-1})^{T}
|Σ|^{-1}
(x-μ_{-1}) ~~
) p ( x ∣ y = − 1 ) = ( 2 π ) 2 n ∣ Σ ∣ 2 1 1 e x p ( − 2 1 ( x − μ − 1 ) T ∣ Σ ∣ − 1 ( x − μ − 1 ) ) p ( x ∣ y = 1 ) = 1 ( 2 π ) n 2 ∣ Σ ∣ 1 2 e x p ( − 1 2 ( x − μ 1 ) T ∣ Σ ∣ − 1 ( x − μ 1 ) ) p(x|y=1) =
\frac
{1}
{ (2π)^{\frac{n}{2}} |Σ|^{ \frac{1}{2}} }
exp(
-\frac{1}{2}
(x-μ_{1})^{T}
|Σ|^{-1}
(x-μ_{1}) ~~
) p ( x ∣ y = 1 ) = ( 2 π ) 2 n ∣ Σ ∣ 2 1 1 e x p ( − 2 1 ( x − μ 1 ) T ∣ Σ ∣ − 1 ( x − μ 1 ) ) Here, the parameters of our model are φ φ φ
, ∑ \sum ∑
, μ 1 μ_{1} μ 1
, μ − 1 μ_{-1} μ − 1
.
(Note that while there're two different means vectors μ 1 μ_{1} μ 1
and μ − 1 μ_{-1} μ − 1
, there's only one covariance matrix ∑ \sum ∑
.)
(a)
Suppose we have already fit φ φ φ
, Σ Σ Σ
, μ 1 μ_{1} μ 1
and μ − 1 μ_{-1} μ − 1
, and now we want to make a prediction at some new query point x x x
.
Show that the posterior distribution of the label at x x x
takes the form of a logistic function, and can be written as
p ( y ∣ x ; φ , Σ , μ 1 , μ − 1 ) = 1 1 + e x p ( − y ( θ T x + θ 0 ) ) p(y|x;φ,Σ,μ_{1},μ_{-1}) = \frac{1}
{1+exp(-y(θ^{T}x + θ_0))} p ( y ∣ x ; φ , Σ , μ 1 , μ − 1 ) = 1 + e x p ( − y ( θ T x + θ 0 ) ) 1 To make equations simple, suppose we have
1 ( 2 π ) n 2 ∣ Σ ∣ 1 2 = C \frac
{1}
{ (2π)^{\frac{n}{2}} |Σ|^{ \frac{1}{2}} } = C ( 2 π ) 2 n ∣ Σ ∣ 2 1 1 = C − 1 2 ( x − μ − 1 ) T ∣ Σ ∣ − 1 ( x − μ − 1 ) = A -\frac{1}{2}
(x-μ_{-1})^{T}
|Σ|^{-1}
(x-μ_{-1}) = A − 2 1 ( x − μ − 1 ) T ∣ Σ ∣ − 1 ( x − μ − 1 ) = A − 1 2 ( x − μ 1 ) T ∣ Σ ∣ − 1 ( x − μ 1 ) = B -\frac{1}{2}
(x-μ_{1})^{T}
|Σ|^{-1}
(x-μ_{1}) = B − 2 1 ( x − μ 1 ) T ∣ Σ ∣ − 1 ( x − μ 1 ) = B Then we have
p ( y ∣ x ) = p ( x ∣ y ) ∗ p ( y ) p ( x ) p(y|x) = \frac
{p(x|y)*p(y)}
{p(x)} p ( y ∣ x ) = p ( x ) p ( x ∣ y ) ∗ p ( y ) = p ( x ∣ y ) ∗ p ( y ) p ( x ∣ y = 1 ) ∗ p ( y = 1 ) + p ( x ∣ y = 0 ) ∗ p ( y = 0 ) = \frac
{p(x|y)*p(y)}
{p(x|y=1)*p(y=1) + p(x|y=0)*p(y=0)} = p ( x ∣ y = 1 ) ∗ p ( y = 1 ) + p ( x ∣ y = 0 ) ∗ p ( y = 0 ) p ( x ∣ y ) ∗ p ( y ) For y = 1 y=1 y = 1
we have:
p ( y = 1 ∣ x ) = p ( x ∣ y = 1 ) ∗ p ( y = 1 ) p ( x ∣ y = 1 ) ∗ p ( y = 1 ) + p ( x ∣ y = 0 ) ∗ p ( y = 0 ) p(y=1|x) = \frac
{p(x|y=1)*p(y=1)}
{p(x|y=1)*p(y=1) + p(x|y=0)*p(y=0)} p ( y = 1 ∣ x ) = p ( x ∣ y = 1 ) ∗ p ( y = 1 ) + p ( x ∣ y = 0 ) ∗ p ( y = 0 ) p ( x ∣ y = 1 ) ∗ p ( y = 1 ) = C ∗ e x p ( B ) ∗ φ C ∗ e x p ( B ) ∗ φ + C ∗ e x p ( A ) ∗ ( 1 − φ ) = \frac
{C*exp(B)*φ}
{C*exp(B)*φ + C*exp(A)*(1-φ)} = C ∗ e x p ( B ) ∗ φ + C ∗ e x p ( A ) ∗ ( 1 − φ ) C ∗ e x p ( B ) ∗ φ = 1 1 + e x p ( A ) ∗ ( 1 − φ ) e x p ( B ) ∗ φ = \frac
{1}
{1+
\frac{exp(A)*(1-φ)}{exp(B)*φ}
} = 1 + e x p ( B ) ∗ φ e x p ( A ) ∗ ( 1 − φ ) 1 = 1 1 + e x p ( A − B ) ∗ e x p ( l n ( 1 − φ φ ) ) = \frac
{1}
{1+
exp(A-B)*exp(ln(\frac{1-φ}{φ}))
} = 1 + e x p ( A − B ) ∗ e x p ( l n ( φ 1 − φ ) ) 1 = 1 1 + e x p ( − 1 ) ( B − A + l n φ − l n ( 1 − φ ) ) = \frac
{1}
{1+
exp(-1)(B-A+lnφ-ln(1-φ))
} = 1 + e x p ( − 1 ) ( B − A + l n φ − l n ( 1 − φ ) ) 1 And for y = − 1 y=-1 y = − 1
we have:
p ( y = − 1 ∣ x ) = p ( x ∣ y = − 1 ) ∗ p ( y = − 1 ) p ( x ∣ y = 1 ) ∗ p ( y = 1 ) + p ( x ∣ y = 0 ) ∗ p ( y = 0 ) p(y=-1|x)
= \frac
{p(x|y=-1)*p(y=-1)}
{p(x|y=1)*p(y=1) + p(x|y=0)*p(y=0)} p ( y = − 1 ∣ x ) = p ( x ∣ y = 1 ) ∗ p ( y = 1 ) + p ( x ∣ y = 0 ) ∗ p ( y = 0 ) p ( x ∣ y = − 1 ) ∗ p ( y = − 1 ) = C ∗ e x p ( A ) ∗ ( 1 − φ ) C ∗ e x p ( A ) ∗ ( 1 − φ ) + C ∗ e x p ( B ) φ = \frac
{C*exp(A)*(1-φ)}
{C*exp(A)*(1-φ) + C*exp(B)φ} = C ∗ e x p ( A ) ∗ ( 1 − φ ) + C ∗ e x p ( B ) φ C ∗ e x p ( A ) ∗ ( 1 − φ ) = 1 1 + e x p ( B − A + l n φ − l n ( 1 − φ ) ) = \frac
{1}
{1 + exp(B-A+lnφ-ln(1-φ))} = 1 + e x p ( B − A + l n φ − l n ( 1 − φ ) ) 1 So for both y = 1 y=1 y = 1
and y = − 1 y=-1 y = − 1
,have
p ( y ∣ x ) = 1 1 + e x p ( − y ( θ T x + θ 0 ) ) p(y|x) = \frac
{1}
{1+exp(-y(θ^{T}x + θ_0))} p ( y ∣ x ) = 1 + e x p ( − y ( θ T x + θ 0 ) ) 1 Where
B − A = θ T x B-A = θ^{T}x B − A = θ T x l n φ − l n ( 1 − φ ) = θ 0 lnφ-ln(1-φ) = θ_0 l n φ − l n ( 1 − φ ) = θ 0 Notice that B − A = θ T x B-A = θ^{T}x B − A = θ T x
is linear transform of x x x
, and l n φ − l n ( 1 − φ ) = θ 0 lnφ-ln(1-φ) = θ_0 l n φ − l n ( 1 − φ ) = θ 0
is the parameter fit from training set.
Exponential family and sigmoid function
From the derivation, we can build the intuition that the s i g m o i d sigmoid s i g m o i d
function is actually the ratio of probability,and if we have a exponential family distributionp ( y ; η ) = b ( y ) e x p ( η T T ( y ) − a ( η ) ) p(y;η) = b(y)exp(η^T~T(y) - a(η)) p ( y ; η ) = b ( y ) e x p ( η T T ( y ) − a ( η ) )
, by dividing both numerator and denominator with numerator which is a e x p ( ) exp() e x p ( )
term,we get a hypothesis with form of s i g m o i d sigmoid s i g m o i d
function.
(b)
For this part of problem only, you may assume n n n
(the dimension of x x x
) is 1, so Σ = [ σ 2 ] Σ=[\sigma^2] Σ = [ σ 2 ]
is just a real number, and likewise, the determinant of Σ Σ Σ
is given by ∣ Σ ∣ = σ 2 |Σ| = \sigma^2 ∣ Σ ∣ = σ 2
.
Given the data set, we claim the maximum likelihood estimate of the parameters are given by
φ = 1 m ∑ i = 1 m 1 { y ( i ) = 1 } φ = \frac{1}{m} \sum^m_{i=1}1\{y^{(i)}=1\} φ = m 1 i = 1 ∑ m 1 { y ( i ) = 1 } μ − 1 = ∑ i = 1 m 1 { y ( i ) = − 1 } x ( i ) ∑ i = 1 m 1 { y ( i ) = − 1 } μ_{-1} = \frac
{\sum_{i=1}^m 1\{y^{(i)}=-1\} x^{(i)}}
{\sum^m_{i=1}1\{y^{(i)}=-1\}} μ − 1 = ∑ i = 1 m 1 { y ( i ) = − 1 } ∑ i = 1 m 1 { y ( i ) = − 1 } x ( i ) μ 1 = ∑ i = 1 m 1 { y ( i ) = 1 } x ( i ) ∑ i = 1 m 1 { y ( i ) = 1 } μ_{1} = \frac
{\sum_{i=1}^m 1\{y^{(i)}=1\} x^{(i)}}
{\sum^m_{i=1}1\{y^{(i)}=1\}} μ 1 = ∑ i = 1 m 1 { y ( i ) = 1 } ∑ i = 1 m 1 { y ( i ) = 1 } x ( i ) Σ = 1 m ∑ i = 1 m ( x ( i ) − μ y ( i ) ) ( x ( i ) − μ y ( i ) ) T Σ = \frac{1}{m}
\sum_{i=1}^m
( x^{(i)} - μ_{y^{(i)}} )
( x^{(i)} - μ_{y^{(i)}} )^T Σ = m 1 i = 1 ∑ m ( x ( i ) − μ y ( i ) ) ( x ( i ) − μ y ( i ) ) T The likelihood of the data is
l ( φ , Σ , μ 1 , μ − 1 ) = l o g ∏ i = 1 m p ( x ( i ) , y ( i ) ; φ , Σ , μ 1 , μ − 1 ) l(φ,Σ,μ_{1},μ_{-1})
= log\prod^m_{i=1}
p(x^{(i)},y^{(i)};φ,Σ,μ_{1},μ_{-1}) l ( φ , Σ , μ 1 , μ − 1 ) = l o g i = 1 ∏ m p ( x ( i ) , y ( i ) ; φ , Σ , μ 1 , μ − 1 ) = l o g ∏ i = 1 m p ( x ( i ) ∣ y ( i ) ; Σ , μ 1 , μ − 1 ) ∗ p ( y ( i ) ; φ ) = log\prod^m_{i=1}
p(x^{(i)}|y^{(i)};Σ,μ_{1},μ_{-1})*p(y^{(i)};φ) = l o g i = 1 ∏ m p ( x ( i ) ∣ y ( i ) ; Σ , μ 1 , μ − 1 ) ∗ p ( y ( i ) ; φ ) By maximizing l l l
with respect to the four parameters, prove that the maximum likelihood estimate of φ φ φ
, ∑ \sum ∑
, μ 1 μ_{1} μ 1
and μ − 1 μ_{-1} μ − 1
are indeed as given in the formula above.
(You may assume that there is at least ont positive and one negtive example, so that the denominators in the definitions of μ 1 μ_{1} μ 1
and μ − 1 μ_{-1} μ − 1
and non-zero.)
Since n = 1 n=1 n = 1
, then we have
p ( x ∣ y = − 1 ) = 1 ( 2 π σ 2 ) 1 2 e x p ( − ( x − μ − 1 ) 2 2 σ 2 ) p(x|y=-1)= \frac {1}
{
(2\pi\sigma^2)^{\frac{1}{2}}
}
exp(
-\frac {(x-μ_{-1})^2}
{2\sigma^2}
) p ( x ∣ y = − 1 ) = ( 2 π σ 2 ) 2 1 1 e x p ( − 2 σ 2 ( x − μ − 1 ) 2 ) = C ∗ e x p ( A ) = C*exp(A) = C ∗ e x p ( A ) p ( x ∣ y = 1 ) = 1 ( 2 π σ 2 ) 1 2 e x p ( − ( x − μ 1 ) 2 2 σ 2 ) p(x|y=1)= \frac {1}
{
(2\pi\sigma^2)^{\frac{1}{2}}
}
exp(
-\frac {(x-μ_{1})^2}
{2\sigma^2}
) p ( x ∣ y = 1 ) = ( 2 π σ 2 ) 2 1 1 e x p ( − 2 σ 2 ( x − μ 1 ) 2 ) = C ∗ e x p ( B ) = C*exp(B) = C ∗ e x p ( B ) We split m samples into m 1 m_1 m 1
positive samples and
m − 1 m_{-1} m − 1
negative samples.
And the likelihood can be written as
l ( φ , Σ , μ 1 , μ − 1 ) = l o g ∏ i = 1 m 1 p ( x ( i ) ∣ y ( i ) = 1 ; Σ , μ 1 , μ − 1 ) ∗ p ( y ( i ) = 1 ; φ ) ∏ i = 1 m − 1 p ( x ( i ) ∣ y ( i ) = − 1 ; Σ , μ 1 , μ − 1 ) ∗ p ( y ( i ) = − 1 ; φ ) l(φ,Σ,μ_{1},μ_{-1})
= log \prod_{i=1}^{m_1}
p(x^{(i)}|y^{(i)}=1;Σ,μ_{1},μ_{-1})*p(y^{(i)=1};φ)
\prod_{i=1}^{m_{-1}}
p(x^{(i)}|y^{(i)}=-1;Σ,μ_{1},μ_{-1})*p(y^{(i)=-1};φ) l ( φ , Σ , μ 1 , μ − 1 ) = l o g i = 1 ∏ m 1 p ( x ( i ) ∣ y ( i ) = 1 ; Σ , μ 1 , μ − 1 ) ∗ p ( y ( i ) = 1 ; φ ) i = 1 ∏ m − 1 p ( x ( i ) ∣ y ( i ) = − 1 ; Σ , μ 1 , μ − 1 ) ∗ p ( y ( i ) = − 1 ; φ ) = l o g ∏ i = 1 m 1 C ∗ e x p ( B ) ∗ φ ∏ i = 1 m − 1 C ∗ e x p ( A ) ∗ ( 1 − φ ) = log \prod_{i=1}^{m_1}C*exp(B)*φ
\prod_{i=1}^{m_{-1}}C*exp(A)*(1-φ) = l o g i = 1 ∏ m 1 C ∗ e x p ( B ) ∗ φ i = 1 ∏ m − 1 C ∗ e x p ( A ) ∗ ( 1 − φ ) = ∑ i = 1 m 1 l o g ( C ∗ e x p ( B ) ∗ φ ) + ∑ i = 1 m − 1 l o g ( C ∗ e x p ( A ) ∗ ( 1 − φ ) ) = \sum_{i=1}^{m_1}log(C*exp(B)*φ) +
\sum_{i=1}^{m_{-1}}log(C*exp(A)*(1-φ)) = i = 1 ∑ m 1 l o g ( C ∗ e x p ( B ) ∗ φ ) + i = 1 ∑ m − 1 l o g ( C ∗ e x p ( A ) ∗ ( 1 − φ ) ) = ∑ i = 1 m l o g C + ∑ i = 1 m 1 B + ∑ i = 1 m − 1 A + ∑ i = 1 m 1 l o g φ + ∑ i = 1 m − 1 l o g ( 1 − φ ) = \sum_{i=1}^{m}logC +
\sum_{i=1}^{m_1}B +
\sum_{i=1}^{m_{-1}}A +
\sum_{i=1}^{m_1}logφ +
\sum_{i=1}^{m_{-1}}log(1-φ) = i = 1 ∑ m l o g C + i = 1 ∑ m 1 B + i = 1 ∑ m − 1 A + i = 1 ∑ m 1 l o g φ + i = 1 ∑ m − 1 l o g ( 1 − φ ) To maximize l ( φ , Σ , μ 1 , μ − 1 ) l(φ,Σ,μ_{1},μ_{-1}) l ( φ , Σ , μ 1 , μ − 1 )
, we set each partial detivative to 0.
∂ l ∂ φ \frac{\partial l}{\partial φ} ∂ φ ∂ l
∂ l ∂ φ = ∑ i = 1 m 1 1 φ + ∑ i = 1 m − 1 1 1 − φ ∗ − 1 \frac{\partial l}{\partial φ}
= \sum_{i=1}^{m_1} \frac{1}{φ} +
\sum_{i=1}^{m_{-1}} \frac{1}{1-φ}*-1 ∂ φ ∂ l = i = 1 ∑ m 1 φ 1 + i = 1 ∑ m − 1 1 − φ 1 ∗ − 1 = m 1 φ − m − 1 1 − φ : = 0 = \frac{m_1}{φ} -
\frac{m_{-1}}{1-φ} := 0 = φ m 1 − 1 − φ m − 1 : = 0 m 1 ∗ ( 1 − φ ) − m − 1 ∗ φ = 0 m_1*(1-φ) - {m_{-1}}*φ = 0 m 1 ∗ ( 1 − φ ) − m − 1 ∗ φ = 0 m 1 − m ∗ φ = 0 m_1 - m*φ = 0 m 1 − m ∗ φ = 0 φ = m 1 m = 1 m ∑ i = 1 m 1 { y ( i ) = 1 } φ = \frac{m_1}{m}
= \frac{1}{m} \sum^m_{i=1}1\{y^{(i)}=1\} φ = m m 1 = m 1 i = 1 ∑ m 1 { y ( i ) = 1 } ∂ l ∂ Σ \frac{\partial l}{\partial Σ} ∂ Σ ∂ l
∂ l ∂ Σ = ∂ l ∂ σ 2 \frac{\partial l}{\partial Σ}
= \frac{\partial l}{\partial \sigma^2} ∂ Σ ∂ l = ∂ σ 2 ∂ l = ∑ i = 1 m ∂ l o g C ∂ σ 2 + ∑ i = 1 m − 1 ∂ A ∂ σ 2 + ∑ i = 1 m 1 ∂ B ∂ σ 2 = \sum_{i=1}^m
\frac{\partial logC}{\partial \sigma^2} +
\sum_{i=1}^{m_{-1}}
\frac{\partial A}{\partial \sigma^2} +
\sum_{i=1}^{m_1}
\frac{\partial B}{\partial \sigma^2} = i = 1 ∑ m ∂ σ 2 ∂ l o g C + i = 1 ∑ m − 1 ∂ σ 2 ∂ A + i = 1 ∑ m 1 ∂ σ 2 ∂ B = ∑ i = 1 m − 1 2 ∗ 1 2 π σ 2 ∗ 2 π + ∑ i = 1 m − 1 − ( x ( i ) − μ − 1 ) 2 2 ∗ − 1 ∗ ( σ 2 ) − 2 + ∑ − ( x ( i ) − μ 1 ) 2 2 ∗ − 1 ∗ ( σ 2 ) − 2 = \sum_{i=1}^m
-\frac{1}{2}*\frac{1}{2\pi\sigma^2}*2\pi +
\sum_{i=1}^{m_{-1}}
-\frac{(x^{(i)}-μ_{-1})^2}{2}*-1*
(\sigma^2)^{-2}+
\sum
-\frac{(x^{(i)}-μ_1)^2}{2}*-1*(\sigma^2)^{-2} = i = 1 ∑ m − 2 1 ∗ 2 π σ 2 1 ∗ 2 π + i = 1 ∑ m − 1 − 2 ( x ( i ) − μ − 1 ) 2 ∗ − 1 ∗ ( σ 2 ) − 2 + ∑ − 2 ( x ( i ) − μ 1 ) 2 ∗ − 1 ∗ ( σ 2 ) − 2 = ∑ i = 1 m − 1 2 σ 2 + ∑ i = 1 m − 1 ( x ( i ) − μ − 1 ) 2 2 σ 4 + ∑ i = 1 m 1 ( x ( i ) − μ 1 ) 2 2 σ 4 : = 0 = \sum_{i=1}^m
-\frac{1}{2\sigma^2} +
\sum_{i=1}^{m_{-1}}
\frac{(x^{(i)}-μ_{-1})^2}{2\sigma^4} +
\sum_{i=1}^{m_1}
\frac{(x^{(i)}-μ_1)^2}{2\sigma^4}
:= 0 = i = 1 ∑ m − 2 σ 2 1 + i = 1 ∑ m − 1 2 σ 4 ( x ( i ) − μ − 1 ) 2 + i = 1 ∑ m 1 2 σ 4 ( x ( i ) − μ 1 ) 2 : = 0 ∑ i = 1 m σ 2 = ∑ i = 1 m − 1 ( x ( i ) − μ − 1 ) 2 + ∑ i = 1 m 1 ( x ( i ) − μ 1 ) 2 \sum_{i=1}^m \sigma^2
= \sum_{i=1}^{m_{-1}}
{(x^{(i)}-μ_{-1})^2} +
\sum_{i=1}^{m_1}
(x^{(i)}-μ_1)^2 i = 1 ∑ m σ 2 = i = 1 ∑ m − 1 ( x ( i ) − μ − 1 ) 2 + i = 1 ∑ m 1 ( x ( i ) − μ 1 ) 2 σ 2 = ∑ i = 1 m ( x ( i ) − μ y ( i ) ) 2 m \sigma^2 = \frac
{
\sum_{i=1}^m
(x^{(i)}-μ_{y^{(i)}})^2
}
{m} σ 2 = m ∑ i = 1 m ( x ( i ) − μ y ( i ) ) 2 ∂ l ∂ μ 1 \frac{\partial l}{\partial μ_1} ∂ μ 1 ∂ l
∂ l ∂ μ 1 = ∑ i = 1 m 1 ∂ B ∂ μ 1 \frac{\partial l}{\partial μ_1}
= \sum_{i=1}^{m_1}
\frac{\partial B}{\partial μ_1} ∂ μ 1 ∂ l = i = 1 ∑ m 1 ∂ μ 1 ∂ B = ∑ i = 1 m 1 − 1 2 σ 2 ∗ 2 ∗ ( x ( i ) − μ 1 ) ∗ − 1 = \sum_{i=1}^{m_1}
-\frac{1}{2\sigma^2}*2*(x^{(i)}-μ_1)*-1 = i = 1 ∑ m 1 − 2 σ 2 1 ∗ 2 ∗ ( x ( i ) − μ 1 ) ∗ − 1 = ∑ i = 1 m 1 1 σ 2 ∗ ( x ( i ) − μ 1 ) : = 0 = \sum_{i=1}^{m_1}
\frac{1}{\sigma^2}*(x^{(i)}-μ_1) :=0 = i = 1 ∑ m 1 σ 2 1 ∗ ( x ( i ) − μ 1 ) : = 0 ∑ i = 1 m 1 ( x ( i ) − μ 1 ) = 0 \sum_{i=1}^{m_1}
(x^{(i)}-μ_1) = 0 i = 1 ∑ m 1 ( x ( i ) − μ 1 ) = 0 ∑ i = 1 m 1 x ( i ) = m 1 ∗ μ 1 \sum_{i=1}^{m_1}x^{(i)} = m_1*μ_1 i = 1 ∑ m 1 x ( i ) = m 1 ∗ μ 1 μ 1 = ∑ i = 1 m 1 { y ( i ) = 1 } x ( i ) ∑ i = 1 m 1 { y ( i ) = 1 } μ_1 = \frac
{\sum_{i=1}^{m}1\{y^{(i)}=1\}x^{(i)}}
{\sum_{i=1}^{m}1\{y^{(i)}=1\}} μ 1 = ∑ i = 1 m 1 { y ( i ) = 1 } ∑ i = 1 m 1 { y ( i ) = 1 } x ( i ) ∂ l ∂ μ − 1 \frac{\partial l}{\partial μ_{-1}} ∂ μ − 1 ∂ l
Same as μ 1 μ_1 μ 1
μ − 1 = ∑ i = 1 m 1 { y ( i ) = − 1 } x ( i ) ∑ i = 1 m 1 { y ( i ) = − 1 } μ_{-1} =
\frac
{\sum_{i=1}^{m}1\{y^{(i)}=-1\}x^{(i)}}
{\sum_{i=1}^{m}1\{y^{(i)}=-1\}} μ − 1 = ∑ i = 1 m 1 { y ( i ) = − 1 } ∑ i = 1 m 1 { y ( i ) = − 1 } x ( i )