MAP and EM for MAP EstimationThe EM algorithm that we talked about in class was for solving a maximum likelihood estimation problem in which we wish to maximize
∏ i = 1 m p ( x ( i ) ; θ ) = ∏ i = 1 m ∑ z ( i ) p ( x ( i ) , z ( i ) ; θ ) \prod_{i=1}^m p(x^{(i)};\theta)
= \prod_{i=1}^m
\sum_{z^{(i)}}
p(x^{(i)},z^{(i)};\theta) i = 1 ∏ m p ( x ( i ) ; θ ) = i = 1 ∏ m z ( i ) ∑ p ( x ( i ) , z ( i ) ; θ ) where the z ( i ) z^{(i)} z ( i )
's are latent random variables. Suppose we are working in a Bayesian framework, and wanted to find the MAP estimation of the parameters θ \theta θ
by maximizing
( ∏ i = 1 m p ( x ( i ) ; θ ) ) p ( θ ) = ( ∏ i = 1 m ∑ z ( i ) p ( x ( i ) , z ( i ) ; θ ) ) p ( θ ) (
\prod_{i=1}^m p(x^{(i)};\theta)
)p(\theta)
=
(\prod_{i=1}^m
\sum_{z^{(i)}}
p(x^{(i)},z^{(i)};\theta)
)p(\theta) ( i = 1 ∏ m p ( x ( i ) ; θ ) ) p ( θ ) = ( i = 1 ∏ m z ( i ) ∑ p ( x ( i ) , z ( i ) ; θ ) ) p ( θ ) Generalize the EM algorithm to work for MAP estimation. You may assume that l o g p ( x , z ∣ θ ) log~p(x,z|\theta) l o g p ( x , z ∣ θ )
and l o g p ( θ ) log~p(\theta) l o g p ( θ )
are both concave in θ \theta θ
.
MAP RecapGiven training datasete
S = { ( x ( i ) , y ( i ) ) } i = 1 m S = \{(x^{(i)}, y^{(i)})\}_{i=1}^m S = { ( x ( i ) , y ( i ) ) } i = 1 m In Miximum Likelihood Estimation , we try to approximate the parameter by
θ M L = a r g m a x θ P ( S ∣ θ ) \theta_{ML}
= argmax_{\theta}~P(S| \theta) θ M L = a r g m a x θ P ( S ∣ θ ) the intuition is, what θ \theta θ
is more likely to produce the training dataset .
But in Maximum A Posteriori Estimation , we assume parameter θ \theta θ
follows an prior distribution, and try to approximate the parameter by
θ M A P = a r g m a x θ P ( θ ∣ S ) \theta_{MAP}
= argmax_{\theta}~P(\theta|S) θ M A P = a r g m a x θ P ( θ ∣ S ) The intuition is, given the training dataset, what θ \theta θ
is the most likely one .
And in MAP estimation, we transform P ( θ ∣ S ) P(\theta|S) P ( θ ∣ S )
according to Bayes Rule .
= P ( S ∣ θ ) P ( θ ) P ( S ) = \frac{P(S|\theta)~P(\theta)} {P(S)} = P ( S ) P ( S ∣ θ ) P ( θ ) = P ( θ ) ∏ i = 1 m P ( y ( i ) ∣ x ( i ) ; θ ) ∫ θ P ( θ ) ∏ i = 1 m P ( y ( i ) ∣ x ( i ) ; θ ) d θ = \frac
{
P(\theta)
\prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta)
}
{
\int_{\theta}
P(\theta)
\prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta)
d\theta
} = ∫ θ P ( θ ) ∏ i = 1 m P ( y ( i ) ∣ x ( i ) ; θ ) d θ P ( θ ) ∏ i = 1 m P ( y ( i ) ∣ x ( i ) ; θ ) Actually, the denominator is just P ( S ) P(S) P ( S )
, it's irrelevant to θ \theta θ
, so we just
θ M A P = a r g m a x θ P ( θ ∣ S ) \theta_{MAP}
= argmax_{\theta}P(\theta|S) θ M A P = a r g m a x θ P ( θ ∣ S ) = a r g m a x θ P ( S ∣ θ ) P ( θ ) = argmax_{\theta}P(S|\theta)P(\theta) = a r g m a x θ P ( S ∣ θ ) P ( θ ) = a r g m a x θ P ( θ ) ∏ i = 1 m P ( y ( i ) ∣ x ( i ) ; θ ) = argmax_{\theta}
P(\theta)
\prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta) = a r g m a x θ P ( θ ) i = 1 ∏ m P ( y ( i ) ∣ x ( i ) ; θ ) Note that what we are maximizing is almost the same as we maximize in Maximum Likelihood Estimation , only introducing an prior distribution on θ \theta θ
.
EM for MAP EstimationWhen we introduce latent variables into MAP Estimation , we approximate the paramter θ \theta θ
by maximizing
P ( θ ) ∏ i = 1 m P ( x ( i ) ∣ θ ) = P ( θ ) ∏ i = 1 m ∑ z ( i ) P ( x ( i ) , z ( i ) ∣ θ ) P(\theta) \prod_{i=1}^mP(x^{(i)}|\theta)
= P(\theta)
\prod_{i=1}^m
\sum_{z^{(i)}}
P(x^{(i)},z^{(i)}|\theta) P ( θ ) i = 1 ∏ m P ( x ( i ) ∣ θ ) = P ( θ ) i = 1 ∏ m z ( i ) ∑ P ( x ( i ) , z ( i ) ∣ θ ) where z ( i ) z^{(i)} z ( i )
's are the latent variables.
And the derivation for EM Algorithm is very strarightforward, when you take the log likelihood, P ( θ ) P(\theta) P ( θ )
is just a additional term add to the main equation.
l o g l ( θ ) = ∑ i = 1 m l o g P ( x ( i ) ∣ θ ) + l o g P ( θ ) log~l(\theta)
= \sum_{i=1}^m logP(x^{(i)}|\theta) + logP(\theta) l o g l ( θ ) = i = 1 ∑ m l o g P ( x ( i ) ∣ θ ) + l o g P ( θ ) = ∑ i = 1 m l o g [ ∑ z ( i ) P ( x ( i ) , z ( i ) ∣ θ ) ] + l o g P ( θ ) = \sum_{i=1}^m log
[\sum_{z^{(i)}}P(x^{(i)},z^{(i)}|\theta)] +
logP(\theta) = i = 1 ∑ m l o g [ z ( i ) ∑ P ( x ( i ) , z ( i ) ∣ θ ) ] + l o g P ( θ ) = ∑ i = 1 m l o g [ ∑ z ( i ) Q i ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) ] + l o g P ( θ ) = \sum_{i=1}^m log
[\sum_{z^{(i)}}
Q_i(z^{(i)})
\frac
{P(x^{(i)},z^{(i)}|\theta)}
{Q_i(z^{(i)})}
] + logP(\theta) = i = 1 ∑ m l o g [ z ( i ) ∑ Q i ( z ( i ) ) Q i ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ ) ] + l o g P ( θ ) = ∑ i = 1 m l o g E z ( i ) [ P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) ] + l o g P ( θ ) = \sum_{i=1}^m
log
E_{z^{(i)}}
[\frac
{P(x^{(i)}, z^{(i)}|\theta)}
{Q_i(z^{(i)})}
]
+ logP(\theta) = i = 1 ∑ m l o g E z ( i ) [ Q i ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ ) ] + l o g P ( θ ) Since l o g log l o g
is concave, according to Jensen's Inequality ,
∑ i = 1 m l o g E z ( i ) [ P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) ] + l o g P ( θ ) \sum_{i=1}^m
log
E_{z^{(i)}}
[\frac
{P(x^{(i)}, z^{(i)}|\theta)}
{Q_i(z^{(i)})}
]
+ logP(\theta) i = 1 ∑ m l o g E z ( i ) [ Q i ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ ) ] + l o g P ( θ ) ≥ ∑ i = 1 m E z ( i ) [ l o g P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) ] + l o g P ( θ ) ≥ \sum_{i=1}^m
E_{z^{(i)}}
[log
\frac
{P(x^{(i)},z^{(i)}|\theta)}
{Q_i(z^{(i)})}
] + logP(\theta) ≥ i = 1 ∑ m E z ( i ) [ l o g Q i ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ ) ] + l o g P ( θ ) = ∑ i = 1 m [ ∑ z ( i ) Q i ( z ( i ) ) l o g P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) ] + l o g P ( θ ) = \sum_{i=1}^m
[\sum_{z^{(i)}}
Q_i(z^{(i)})
log
\frac
{P(x^{(i)},z^{(i)}|\theta)}
{Q_i(z^{(i)})}
]
+ logP(\theta) = i = 1 ∑ m [ z ( i ) ∑ Q i ( z ( i ) ) l o g Q i ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ ) ] + l o g P ( θ ) To hold equality at Jensen's Inequality step for current θ \theta θ
, we need this to be a constant
P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) = C \frac
{P(x^{(i)}, z^{(i)}|\theta)}
{Q_i(z^{(i)})}
= C Q i ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ ) = C Hence
Q i ( z ( i ) ) = P ( x ( i ) , z ( i ) ∣ θ ) C Q_i(z^{(i)}) =
\frac
{P(x^{(i)}, z^{(i)}|\theta)}
{C} Q i ( z ( i ) ) = C P ( x ( i ) , z ( i ) ∣ θ ) ∑ z ( i ) Q i ( z ( i ) ) = ∑ z ( i ) P ( x ( i ) , z ( i ) ∣ θ ) C \sum_{z^{(i)}} Q_i(z^{(i)})
= \sum_{z^{(i)}}
\frac
{P(x^{(i)}, z^{(i)}|\theta)}
{C} z ( i ) ∑ Q i ( z ( i ) ) = z ( i ) ∑ C P ( x ( i ) , z ( i ) ∣ θ ) The left part is 1 1 1
since Q i ( z ( i ) ) Q_i(z^{(i)}) Q i ( z ( i ) )
is a distribution over z ( i ) z^{(i)} z ( i )
. And the right is
∑ z ( i ) P ( x ( i ) , z ( i ) ∣ θ ) C \sum_{z^{(i)}}
\frac
{P(x^{(i)},z^{(i)}|\theta)}
{C} z ( i ) ∑ C P ( x ( i ) , z ( i ) ∣ θ ) = P ( x ( i ) ∣ θ ) C =
\frac
{P(x^{(i)}|\theta)}
{C} = C P ( x ( i ) ∣ θ ) And so
C = P ( x ( i ) ∣ θ ) C = P(x^{(i)}|\theta) C = P ( x ( i ) ∣ θ ) Q i ( z ( i ) ) = P ( x ( i ) , z ( i ) ∣ θ ) P ( x ( i ) ∣ θ ) Q_i(z^{(i)}) =
\frac
{P(x^{(i)},z^{(i)}|\theta)}
{P(x^{(i)}|\theta)} Q i ( z ( i ) ) = P ( x ( i ) ∣ θ ) P ( x ( i ) , z ( i ) ∣ θ ) = P ( z ( i ) ∣ x ( i ) , θ ) = P(z^{(i)}|x^{(i)},\theta) = P ( z ( i ) ∣ x ( i ) , θ ) The E-Step is just the general case
Q i ( z ( i ) ) = P ( z ( i ) ∣ x ( i ) , θ ) Q_i(z^{(i)}) = P(z^{(i)}|x^{(i)},\theta) Q i ( z ( i ) ) = P ( z ( i ) ∣ x ( i ) , θ ) And M-Step is to find the maximum of the lower bound
θ : = a r g m a x θ ∑ i = 1 m [ ∑ z ( i ) Q i ( z ( i ) ) l o g P ( x ( i ) , z ( i ) ∣ θ ) Q i ( z ( i ) ) ] + l o g P ( θ ) \theta :=
argmax_{\theta}
\sum_{i=1}^m
[\sum_{z^{(i)}}
Q_i(z^{(i)})log
\frac
{P(x^{(i)},z^{(i)}|\theta)}
{Q_i(z^{(i)})}
] + logP(\theta) θ : = a r g m a x θ i = 1 ∑ m [ z ( i ) ∑ Q i ( z ( i ) ) l o g Q i ( z ( i ) ) P ( x ( i ) , z ( i ) ∣ θ ) ] + l o g P ( θ )