Gaussian Discriminant Analysis 
Suppose we are given a dataset { ( x ( i ) , y ( i ) ) ; i = 1 , 2 , . . . , m } \{ ( x^{(i)}, y^{(i)});i=1,2,...,m\} { ( x ( i ) , y ( i ) ) ; i = 1 , 2 , . . . , m }  consisting of m m m  independent examples, where x ( i ) ∈ R n x^{(i)}∈R^n x ( i ) ∈ R n  are n-dimensional vectors, and y ( i ) ∈ { − 1 , 1 } y^{(i)}∈\{-1,1\} y ( i ) ∈ { − 1 , 1 } . We will model the joint distribution of ( x , y ) (x,y) ( x , y )  according to:
p ( y ) = φ           ( y = 1 ) p(y) = φ~~~~~(y=1) p ( y ) = φ           ( y = 1 ) p ( y ) = 1 − φ         ( y = − 1 ) p(y) = 1-φ~~~~(y=-1) p ( y ) = 1 − φ         ( y = − 1 ) p ( x ∣ y = − 1 ) = 1 ( 2 π ) n 2 ∣ Σ ∣ 1 2 e x p ( − 1 2 ( x − μ − 1 ) T ∣ Σ ∣ − 1 ( x − μ − 1 )     ) p(x|y=-1)  =
    \frac
    {1}
    { (2π)^{\frac{n}{2}} |Σ|^{ \frac{1}{2}} }
    exp(
        -\frac{1}{2}
        (x-μ_{-1})^{T}
        |Σ|^{-1}
        (x-μ_{-1}) ~~
    ) p ( x ∣ y = − 1 ) = ( 2 π ) 2 n  ∣ Σ ∣ 2 1  1  e x p ( − 2 1  ( x − μ − 1  ) T ∣ Σ ∣ − 1 ( x − μ − 1  )     ) p ( x ∣ y = 1 ) = 1 ( 2 π ) n 2 ∣ Σ ∣ 1 2 e x p ( − 1 2 ( x − μ 1 ) T ∣ Σ ∣ − 1 ( x − μ 1 )     ) p(x|y=1)  =
    \frac
    {1}
    { (2π)^{\frac{n}{2}} |Σ|^{ \frac{1}{2}} }
    exp(
        -\frac{1}{2}
        (x-μ_{1})^{T}
        |Σ|^{-1}
        (x-μ_{1}) ~~
    ) p ( x ∣ y = 1 ) = ( 2 π ) 2 n  ∣ Σ ∣ 2 1  1  e x p ( − 2 1  ( x − μ 1  ) T ∣ Σ ∣ − 1 ( x − μ 1  )     ) Here, the parameters of our model are φ φ φ , ∑ \sum ∑ , μ 1 μ_{1} μ 1  , μ − 1 μ_{-1} μ − 1  .
(Note that while there're two different means vectors μ 1 μ_{1} μ 1   and μ − 1 μ_{-1} μ − 1  , there's only one covariance matrix ∑ \sum ∑ .)
 (a) 
Suppose we have already fit φ φ φ , Σ Σ Σ , μ 1 μ_{1} μ 1   and μ − 1 μ_{-1} μ − 1  , and now we want to make a prediction at some new query point x x x .
Show that the posterior distribution of the label at x x x  takes the form of a logistic function, and can be written as
p ( y ∣ x ; φ , Σ , μ 1 , μ − 1 ) = 1 1 + e x p ( − y ( θ T x + θ 0 ) ) p(y|x;φ,Σ,μ_{1},μ_{-1}) = \frac{1}
                            {1+exp(-y(θ^{T}x + θ_0))} p ( y ∣ x ; φ , Σ , μ 1  , μ − 1  ) = 1 + e x p ( − y ( θ T x + θ 0  ) ) 1  To make equations simple, suppose we have
1 ( 2 π ) n 2 ∣ Σ ∣ 1 2 = C \frac
    {1}
    { (2π)^{\frac{n}{2}} |Σ|^{ \frac{1}{2}} } = C ( 2 π ) 2 n  ∣ Σ ∣ 2 1  1  = C − 1 2 ( x − μ − 1 ) T ∣ Σ ∣ − 1 ( x − μ − 1 ) = A -\frac{1}{2}
        (x-μ_{-1})^{T}
        |Σ|^{-1}
        (x-μ_{-1}) = A − 2 1  ( x − μ − 1  ) T ∣ Σ ∣ − 1 ( x − μ − 1  ) = A − 1 2 ( x − μ 1 ) T ∣ Σ ∣ − 1 ( x − μ 1 ) = B -\frac{1}{2}
        (x-μ_{1})^{T}
        |Σ|^{-1}
        (x-μ_{1}) = B − 2 1  ( x − μ 1  ) T ∣ Σ ∣ − 1 ( x − μ 1  ) = B Then we have
p ( y ∣ x ) = p ( x ∣ y ) ∗ p ( y ) p ( x ) p(y|x)  =   \frac
                {p(x|y)*p(y)}
                {p(x)} p ( y ∣ x ) = p ( x ) p ( x ∣ y ) ∗ p ( y )  = p ( x ∣ y ) ∗ p ( y ) p ( x ∣ y = 1 ) ∗ p ( y = 1 ) + p ( x ∣ y = 0 ) ∗ p ( y = 0 ) =    \frac
                {p(x|y)*p(y)}
                {p(x|y=1)*p(y=1) + p(x|y=0)*p(y=0)} = p ( x ∣ y = 1 ) ∗ p ( y = 1 ) + p ( x ∣ y = 0 ) ∗ p ( y = 0 ) p ( x ∣ y ) ∗ p ( y )   For y = 1 y=1 y = 1  we have: 
p ( y = 1 ∣ x ) = p ( x ∣ y = 1 ) ∗ p ( y = 1 ) p ( x ∣ y = 1 ) ∗ p ( y = 1 ) + p ( x ∣ y = 0 ) ∗ p ( y = 0 ) p(y=1|x)    =   \frac
                    {p(x|y=1)*p(y=1)}
                    {p(x|y=1)*p(y=1) + p(x|y=0)*p(y=0)} p ( y = 1 ∣ x ) = p ( x ∣ y = 1 ) ∗ p ( y = 1 ) + p ( x ∣ y = 0 ) ∗ p ( y = 0 ) p ( x ∣ y = 1 ) ∗ p ( y = 1 )  = C ∗ e x p ( B ) ∗ φ C ∗ e x p ( B ) ∗ φ + C ∗ e x p ( A ) ∗ ( 1 − φ ) =   \frac
        {C*exp(B)*φ}
        {C*exp(B)*φ + C*exp(A)*(1-φ)} = C ∗ e x p ( B ) ∗ φ + C ∗ e x p ( A ) ∗ ( 1 − φ ) C ∗ e x p ( B ) ∗ φ  = 1 1 + e x p ( A ) ∗ ( 1 − φ ) e x p ( B ) ∗ φ =   \frac
        {1}
        {1+
            \frac{exp(A)*(1-φ)}{exp(B)*φ}
        } = 1 + e x p ( B ) ∗ φ e x p ( A ) ∗ ( 1 − φ )  1  = 1 1 + e x p ( A − B ) ∗ e x p ( l n ( 1 − φ φ ) ) =   \frac
        {1}
        {1+
            exp(A-B)*exp(ln(\frac{1-φ}{φ}))
        } = 1 + e x p ( A − B ) ∗ e x p ( l n ( φ 1 − φ  ) ) 1  = 1 1 + e x p ( − 1 ) ( B − A + l n φ − l n ( 1 − φ ) ) =   \frac
        {1}
        {1+
            exp(-1)(B-A+lnφ-ln(1-φ))
        } = 1 + e x p ( − 1 ) ( B − A + l n φ − l n ( 1 − φ ) ) 1   And for y = − 1 y=-1 y = − 1  we have: 
p ( y = − 1 ∣ x ) = p ( x ∣ y = − 1 ) ∗ p ( y = − 1 ) p ( x ∣ y = 1 ) ∗ p ( y = 1 ) + p ( x ∣ y = 0 ) ∗ p ( y = 0 ) p(y=-1|x)    
    =   \frac
        {p(x|y=-1)*p(y=-1)}
        {p(x|y=1)*p(y=1) + p(x|y=0)*p(y=0)} p ( y = − 1 ∣ x ) = p ( x ∣ y = 1 ) ∗ p ( y = 1 ) + p ( x ∣ y = 0 ) ∗ p ( y = 0 ) p ( x ∣ y = − 1 ) ∗ p ( y = − 1 )  = C ∗ e x p ( A ) ∗ ( 1 − φ ) C ∗ e x p ( A ) ∗ ( 1 − φ ) + C ∗ e x p ( B ) φ =   \frac
        {C*exp(A)*(1-φ)}
        {C*exp(A)*(1-φ) + C*exp(B)φ} = C ∗ e x p ( A ) ∗ ( 1 − φ ) + C ∗ e x p ( B ) φ C ∗ e x p ( A ) ∗ ( 1 − φ )  = 1 1 + e x p ( B − A + l n φ − l n ( 1 − φ ) ) =   \frac
        {1}
        {1 + exp(B-A+lnφ-ln(1-φ))} = 1 + e x p ( B − A + l n φ − l n ( 1 − φ ) ) 1  So for both y = 1 y=1 y = 1  and y = − 1 y=-1 y = − 1 ,have
p ( y ∣ x ) = 1 1 + e x p ( − y ( θ T x + θ 0 ) ) p(y|x)  =   \frac
            {1}
            {1+exp(-y(θ^{T}x + θ_0))} p ( y ∣ x ) = 1 + e x p ( − y ( θ T x + θ 0  ) ) 1  Where
B − A = θ T x B-A = θ^{T}x B − A = θ T x l n φ − l n ( 1 − φ ) = θ 0 lnφ-ln(1-φ) = θ_0 l n φ − l n ( 1 − φ ) = θ 0  Notice that B − A = θ T x B-A = θ^{T}x B − A = θ T x  is linear transform of x x x , and l n φ − l n ( 1 − φ ) = θ 0 lnφ-ln(1-φ) = θ_0 l n φ − l n ( 1 − φ ) = θ 0   is the parameter fit from training set.
 Exponential family and sigmoid function 
From the derivation, we can build the intuition that the s i g m o i d sigmoid s i g m o i d  function is actually the ratio of probability,and if we have a exponential family distributionp ( y ; η ) = b ( y ) e x p ( η T   T ( y ) − a ( η ) ) p(y;η) = b(y)exp(η^T~T(y) - a(η)) p ( y ; η ) = b ( y ) e x p ( η T   T ( y ) − a ( η ) ) ,  by dividing both numerator and denominator with numerator which is a e x p ( ) exp() e x p ( )  term,we get a hypothesis with form of s i g m o i d sigmoid s i g m o i d  function.
 (b) 
For this part of problem only, you may assume n n n 
(the dimension of x x x ) is 1, so Σ = [ σ 2 ] Σ=[\sigma^2] Σ = [ σ 2 ]  is just a real number, and likewise, the determinant of Σ Σ Σ  is given by ∣ Σ ∣ = σ 2 |Σ| = \sigma^2 ∣ Σ ∣ = σ 2 .
Given the data set, we claim the maximum likelihood estimate of the parameters are given by
φ = 1 m ∑ i = 1 m 1 { y ( i ) = 1 } φ = \frac{1}{m} \sum^m_{i=1}1\{y^{(i)}=1\} φ = m 1  i = 1 ∑ m  1 { y ( i ) = 1 } μ − 1 = ∑ i = 1 m 1 { y ( i ) = − 1 } x ( i ) ∑ i = 1 m 1 { y ( i ) = − 1 } μ_{-1}  =   \frac
            {\sum_{i=1}^m 1\{y^{(i)}=-1\} x^{(i)}}
            {\sum^m_{i=1}1\{y^{(i)}=-1\}} μ − 1  = ∑ i = 1 m  1 { y ( i ) = − 1 } ∑ i = 1 m  1 { y ( i ) = − 1 } x ( i )  μ 1 = ∑ i = 1 m 1 { y ( i ) = 1 } x ( i ) ∑ i = 1 m 1 { y ( i ) = 1 } μ_{1}   =   \frac
            {\sum_{i=1}^m 1\{y^{(i)}=1\} x^{(i)}}
            {\sum^m_{i=1}1\{y^{(i)}=1\}} μ 1  = ∑ i = 1 m  1 { y ( i ) = 1 } ∑ i = 1 m  1 { y ( i ) = 1 } x ( i )  Σ = 1 m ∑ i = 1 m ( x ( i ) − μ y ( i ) ) ( x ( i ) − μ y ( i ) ) T Σ       =   \frac{1}{m}
            \sum_{i=1}^m
            ( x^{(i)} - μ_{y^{(i)}} )
            ( x^{(i)} - μ_{y^{(i)}} )^T Σ = m 1  i = 1 ∑ m  ( x ( i ) − μ y ( i )  ) ( x ( i ) − μ y ( i )  ) T The likelihood of the data is
l ( φ , Σ , μ 1 , μ − 1 ) = l o g ∏ i = 1 m p ( x ( i ) , y ( i ) ; φ , Σ , μ 1 , μ − 1 ) l(φ,Σ,μ_{1},μ_{-1}) 
    =   log\prod^m_{i=1}
        p(x^{(i)},y^{(i)};φ,Σ,μ_{1},μ_{-1}) l ( φ , Σ , μ 1  , μ − 1  ) = l o g i = 1 ∏ m  p ( x ( i ) , y ( i ) ; φ , Σ , μ 1  , μ − 1  ) = l o g ∏ i = 1 m p ( x ( i ) ∣ y ( i ) ; Σ , μ 1 , μ − 1 ) ∗ p ( y ( i ) ; φ ) =   log\prod^m_{i=1}
        p(x^{(i)}|y^{(i)};Σ,μ_{1},μ_{-1})*p(y^{(i)};φ) = l o g i = 1 ∏ m  p ( x ( i ) ∣ y ( i ) ; Σ , μ 1  , μ − 1  ) ∗ p ( y ( i ) ; φ ) By maximizing l l l  with respect to the four parameters, prove that the maximum likelihood estimate of φ φ φ , ∑ \sum ∑ , μ 1 μ_{1} μ 1   and μ − 1 μ_{-1} μ − 1   are indeed as given in the formula above.
(You may assume that there is at least ont positive and one negtive example, so that the denominators in the definitions of μ 1 μ_{1} μ 1   and μ − 1 μ_{-1} μ − 1   and non-zero.)
Since n = 1 n=1 n = 1 , then we have
p ( x ∣ y = − 1 ) = 1 ( 2 π σ 2 ) 1 2 e x p ( − ( x − μ − 1 ) 2 2 σ 2 ) p(x|y=-1)=  \frac   {1}
                    {
                        (2\pi\sigma^2)^{\frac{1}{2}}
                    }
            exp(
                -\frac  {(x-μ_{-1})^2}
                        {2\sigma^2}
            ) p ( x ∣ y = − 1 ) = ( 2 π σ 2 ) 2 1  1  e x p ( − 2 σ 2 ( x − μ − 1  ) 2  ) = C ∗ e x p ( A ) =   C*exp(A) = C ∗ e x p ( A ) p ( x ∣ y = 1 ) = 1 ( 2 π σ 2 ) 1 2 e x p ( − ( x − μ 1 ) 2 2 σ 2 ) p(x|y=1)=  \frac   {1}
                    {
                        (2\pi\sigma^2)^{\frac{1}{2}}
                    }
            exp(
                -\frac  {(x-μ_{1})^2}
                        {2\sigma^2}
            ) p ( x ∣ y = 1 ) = ( 2 π σ 2 ) 2 1  1  e x p ( − 2 σ 2 ( x − μ 1  ) 2  ) = C ∗ e x p ( B ) =   C*exp(B) = C ∗ e x p ( B ) We split m samples into m 1 m_1 m 1   positive samples and
m − 1 m_{-1} m − 1   negative samples.
And the likelihood can be written as
l ( φ , Σ , μ 1 , μ − 1 ) = l o g ∏ i = 1 m 1 p ( x ( i ) ∣ y ( i ) = 1 ; Σ , μ 1 , μ − 1 ) ∗ p ( y ( i ) = 1 ; φ ) ∏ i = 1 m − 1 p ( x ( i ) ∣ y ( i ) = − 1 ; Σ , μ 1 , μ − 1 ) ∗ p ( y ( i ) = − 1 ; φ ) l(φ,Σ,μ_{1},μ_{-1})
    =   log \prod_{i=1}^{m_1}
        p(x^{(i)}|y^{(i)}=1;Σ,μ_{1},μ_{-1})*p(y^{(i)=1};φ)
            \prod_{i=1}^{m_{-1}}
        p(x^{(i)}|y^{(i)}=-1;Σ,μ_{1},μ_{-1})*p(y^{(i)=-1};φ) l ( φ , Σ , μ 1  , μ − 1  ) = l o g i = 1 ∏ m 1   p ( x ( i ) ∣ y ( i ) = 1 ; Σ , μ 1  , μ − 1  ) ∗ p ( y ( i ) = 1 ; φ ) i = 1 ∏ m − 1   p ( x ( i ) ∣ y ( i ) = − 1 ; Σ , μ 1  , μ − 1  ) ∗ p ( y ( i ) = − 1 ; φ ) = l o g ∏ i = 1 m 1 C ∗ e x p ( B ) ∗ φ ∏ i = 1 m − 1 C ∗ e x p ( A ) ∗ ( 1 − φ ) =   log \prod_{i=1}^{m_1}C*exp(B)*φ
            \prod_{i=1}^{m_{-1}}C*exp(A)*(1-φ) = l o g i = 1 ∏ m 1   C ∗ e x p ( B ) ∗ φ i = 1 ∏ m − 1   C ∗ e x p ( A ) ∗ ( 1 − φ ) = ∑ i = 1 m 1 l o g ( C ∗ e x p ( B ) ∗ φ ) + ∑ i = 1 m − 1 l o g ( C ∗ e x p ( A ) ∗ ( 1 − φ ) ) =   \sum_{i=1}^{m_1}log(C*exp(B)*φ) + 
        \sum_{i=1}^{m_{-1}}log(C*exp(A)*(1-φ)) = i = 1 ∑ m 1   l o g ( C ∗ e x p ( B ) ∗ φ ) + i = 1 ∑ m − 1   l o g ( C ∗ e x p ( A ) ∗ ( 1 − φ ) ) = ∑ i = 1 m l o g C + ∑ i = 1 m 1 B + ∑ i = 1 m − 1 A + ∑ i = 1 m 1 l o g φ + ∑ i = 1 m − 1 l o g ( 1 − φ ) =   \sum_{i=1}^{m}logC + 
        \sum_{i=1}^{m_1}B +
        \sum_{i=1}^{m_{-1}}A + 
        \sum_{i=1}^{m_1}logφ +
        \sum_{i=1}^{m_{-1}}log(1-φ) = i = 1 ∑ m  l o g C + i = 1 ∑ m 1   B + i = 1 ∑ m − 1   A + i = 1 ∑ m 1   l o g φ + i = 1 ∑ m − 1   l o g ( 1 − φ ) To maximize l ( φ , Σ , μ 1 , μ − 1 ) l(φ,Σ,μ_{1},μ_{-1}) l ( φ , Σ , μ 1  , μ − 1  ) , we set each partial detivative to 0.
∂ l ∂ φ \frac{\partial l}{\partial φ} ∂ φ ∂ l   
∂ l ∂ φ = ∑ i = 1 m 1 1 φ + ∑ i = 1 m − 1 1 1 − φ ∗ − 1 \frac{\partial l}{\partial φ}
    =   \sum_{i=1}^{m_1} \frac{1}{φ} +
        \sum_{i=1}^{m_{-1}} \frac{1}{1-φ}*-1 ∂ φ ∂ l  = i = 1 ∑ m 1   φ 1  + i = 1 ∑ m − 1   1 − φ 1  ∗ − 1 = m 1 φ − m − 1 1 − φ : = 0 =   \frac{m_1}{φ} - 
        \frac{m_{-1}}{1-φ} := 0 = φ m 1   − 1 − φ m − 1   : = 0 m 1 ∗ ( 1 − φ ) − m − 1 ∗ φ = 0 m_1*(1-φ) - {m_{-1}}*φ = 0 m 1  ∗ ( 1 − φ ) − m − 1  ∗ φ = 0 m 1 − m ∗ φ = 0 m_1 - m*φ = 0 m 1  − m ∗ φ = 0 φ = m 1 m = 1 m ∑ i = 1 m 1 { y ( i ) = 1 } φ   =   \frac{m_1}{m}
    =   \frac{1}{m} \sum^m_{i=1}1\{y^{(i)}=1\} φ = m m 1   = m 1  i = 1 ∑ m  1 { y ( i ) = 1 } ∂ l ∂ Σ \frac{\partial l}{\partial Σ} ∂ Σ ∂ l   
∂ l ∂ Σ = ∂ l ∂ σ 2 \frac{\partial l}{\partial Σ}
    =   \frac{\partial l}{\partial \sigma^2} ∂ Σ ∂ l  = ∂ σ 2 ∂ l  = ∑ i = 1 m ∂ l o g C ∂ σ 2 + ∑ i = 1 m − 1 ∂ A ∂ σ 2 + ∑ i = 1 m 1 ∂ B ∂ σ 2 =   \sum_{i=1}^m 
            \frac{\partial logC}{\partial \sigma^2} +
        \sum_{i=1}^{m_{-1}}
            \frac{\partial A}{\partial \sigma^2} +
        \sum_{i=1}^{m_1}
            \frac{\partial B}{\partial \sigma^2} = i = 1 ∑ m  ∂ σ 2 ∂ l o g C  + i = 1 ∑ m − 1   ∂ σ 2 ∂ A  + i = 1 ∑ m 1   ∂ σ 2 ∂ B  = ∑ i = 1 m − 1 2 ∗ 1 2 π σ 2 ∗ 2 π + ∑ i = 1 m − 1 − ( x ( i ) − μ − 1 ) 2 2 ∗ − 1 ∗ ( σ 2 ) − 2 + ∑ − ( x ( i ) − μ 1 ) 2 2 ∗ − 1 ∗ ( σ 2 ) − 2 =   \sum_{i=1}^m
            -\frac{1}{2}*\frac{1}{2\pi\sigma^2}*2\pi + 
        \sum_{i=1}^{m_{-1}}
            -\frac{(x^{(i)}-μ_{-1})^2}{2}*-1*
            (\sigma^2)^{-2}+
        \sum
            -\frac{(x^{(i)}-μ_1)^2}{2}*-1*(\sigma^2)^{-2} = i = 1 ∑ m  − 2 1  ∗ 2 π σ 2 1  ∗ 2 π + i = 1 ∑ m − 1   − 2 ( x ( i ) − μ − 1  ) 2  ∗ − 1 ∗ ( σ 2 ) − 2 + ∑ − 2 ( x ( i ) − μ 1  ) 2  ∗ − 1 ∗ ( σ 2 ) − 2 = ∑ i = 1 m − 1 2 σ 2 + ∑ i = 1 m − 1 ( x ( i ) − μ − 1 ) 2 2 σ 4 + ∑ i = 1 m 1 ( x ( i ) − μ 1 ) 2 2 σ 4 : = 0 =   \sum_{i=1}^m
            -\frac{1}{2\sigma^2} + 
        \sum_{i=1}^{m_{-1}}
            \frac{(x^{(i)}-μ_{-1})^2}{2\sigma^4} + 
        \sum_{i=1}^{m_1}
            \frac{(x^{(i)}-μ_1)^2}{2\sigma^4}
    := 0 = i = 1 ∑ m  − 2 σ 2 1  + i = 1 ∑ m − 1   2 σ 4 ( x ( i ) − μ − 1  ) 2  + i = 1 ∑ m 1   2 σ 4 ( x ( i ) − μ 1  ) 2  : = 0 ∑ i = 1 m σ 2 = ∑ i = 1 m − 1 ( x ( i ) − μ − 1 ) 2 + ∑ i = 1 m 1 ( x ( i ) − μ 1 ) 2 \sum_{i=1}^m \sigma^2
    =   \sum_{i=1}^{m_{-1}}
            {(x^{(i)}-μ_{-1})^2} +
        \sum_{i=1}^{m_1}
            (x^{(i)}-μ_1)^2 i = 1 ∑ m  σ 2 = i = 1 ∑ m − 1   ( x ( i ) − μ − 1  ) 2 + i = 1 ∑ m 1   ( x ( i ) − μ 1  ) 2 σ 2 = ∑ i = 1 m ( x ( i ) − μ y ( i ) ) 2 m \sigma^2 =  \frac
            {
                \sum_{i=1}^m
                (x^{(i)}-μ_{y^{(i)}})^2
            }
            {m} σ 2 = m ∑ i = 1 m  ( x ( i ) − μ y ( i )  ) 2  ∂ l ∂ μ 1 \frac{\partial l}{\partial μ_1} ∂ μ 1  ∂ l   
∂ l ∂ μ 1 = ∑ i = 1 m 1 ∂ B ∂ μ 1 \frac{\partial l}{\partial μ_1} 
    =   \sum_{i=1}^{m_1}
            \frac{\partial B}{\partial μ_1} ∂ μ 1  ∂ l  = i = 1 ∑ m 1   ∂ μ 1  ∂ B  = ∑ i = 1 m 1 − 1 2 σ 2 ∗ 2 ∗ ( x ( i ) − μ 1 ) ∗ − 1 =   \sum_{i=1}^{m_1}
            -\frac{1}{2\sigma^2}*2*(x^{(i)}-μ_1)*-1 = i = 1 ∑ m 1   − 2 σ 2 1  ∗ 2 ∗ ( x ( i ) − μ 1  ) ∗ − 1 = ∑ i = 1 m 1 1 σ 2 ∗ ( x ( i ) − μ 1 ) : = 0 =   \sum_{i=1}^{m_1}
            \frac{1}{\sigma^2}*(x^{(i)}-μ_1) :=0 = i = 1 ∑ m 1   σ 2 1  ∗ ( x ( i ) − μ 1  ) : = 0 ∑ i = 1 m 1 ( x ( i ) − μ 1 ) = 0 \sum_{i=1}^{m_1} 
    (x^{(i)}-μ_1) = 0 i = 1 ∑ m 1   ( x ( i ) − μ 1  ) = 0 ∑ i = 1 m 1 x ( i ) = m 1 ∗ μ 1 \sum_{i=1}^{m_1}x^{(i)} = m_1*μ_1 i = 1 ∑ m 1   x ( i ) = m 1  ∗ μ 1  μ 1 = ∑ i = 1 m 1 { y ( i ) = 1 } x ( i ) ∑ i = 1 m 1 { y ( i ) = 1 } μ_1 =   \frac
        {\sum_{i=1}^{m}1\{y^{(i)}=1\}x^{(i)}}
        {\sum_{i=1}^{m}1\{y^{(i)}=1\}} μ 1  = ∑ i = 1 m  1 { y ( i ) = 1 } ∑ i = 1 m  1 { y ( i ) = 1 } x ( i )  ∂ l ∂ μ − 1 \frac{\partial l}{\partial μ_{-1}} ∂ μ − 1  ∂ l   
Same as μ 1 μ_1 μ 1  
μ − 1 = ∑ i = 1 m 1 { y ( i ) = − 1 } x ( i ) ∑ i = 1 m 1 { y ( i ) = − 1 } μ_{-1} = 
    \frac
        {\sum_{i=1}^{m}1\{y^{(i)}=-1\}x^{(i)}}
        {\sum_{i=1}^{m}1\{y^{(i)}=-1\}} μ − 1  = ∑ i = 1 m  1 { y ( i ) = − 1 } ∑ i = 1 m  1 { y ( i ) = − 1 } x ( i )