MAP and EM for MAP Estimation

The EM algorithm that we talked about in class was for solving a maximum likelihood estimation problem in which we wish to maximize

\prod_{i=1}^m p(x^{(i)};\theta) = \prod_{i=1}^m \sum_{z^{(i)}} p(x^{(i)},z^{(i)};\theta)

where the $z^{(i)}$ 's are latent random variables. Suppose we are working in a Bayesian framework, and wanted to find the MAP estimation of the parameters $\theta$ by maximizing

( \prod_{i=1}^m p(x^{(i)};\theta) )p(\theta) = (\prod_{i=1}^m \sum_{z^{(i)}} p(x^{(i)},z^{(i)};\theta) )p(\theta)

Generalize the EM algorithm to work for MAP estimation. You may assume that $log~p(x,z|\theta)$ and $log~p(\theta)$ are both concave in $\theta$ .

MAP Recap

Given training datasete

S = \{(x^{(i)}, y^{(i)})\}_{i=1}^m

In Miximum Likelihood Estimation, we try to approximate the parameter by

\theta_{ML} = argmax_{\theta}~P(S| \theta)

the intuition is, what $\theta$ is more likely to produce the training dataset.

But in Maximum A Posteriori Estimation, we assume parameter $\theta$ follows an prior distribution, and try to approximate the parameter by

\theta_{MAP} = argmax_{\theta}~P(\theta|S)

The intuition is, given the training dataset, what $\theta$ is the most likely one.

And in MAP estimation, we transform $P(\theta|S)$ according to Bayes Rule.

P(\theta|S)

= \frac{P(S|\theta)~P(\theta)} {P(S)}

= \frac { P(\theta) \prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta) } { \int_{\theta} P(\theta) \prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta) d\theta }

Actually, the denominator is just $P(S)$ , it's irrelevant to $\theta$ , so we just

\theta_{MAP} = argmax_{\theta}P(\theta|S)

= argmax_{\theta}P(S|\theta)P(\theta)

= argmax_{\theta} P(\theta) \prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta)

Note that what we are maximizing is almost the same as we maximize in Maximum Likelihood Estimation, only introducing an prior distribution on $\theta$ .

EM for MAP Estimation

When we introduce latent variables into MAP Estimation, we approximate the paramter $\theta$ by maximizing

P(\theta) \prod_{i=1}^mP(x^{(i)}|\theta) = P(\theta) \prod_{i=1}^m \sum_{z^{(i)}} P(x^{(i)},z^{(i)}|\theta)

where $z^{(i)}$ 's are the latent variables. And the derivation for EM Algorithm is very strarightforward, when you take the log likelihood, $P(\theta)$ is just a additional term add to the main equation.

log~l(\theta) = \sum_{i=1}^m logP(x^{(i)}|\theta) + logP(\theta)

= \sum_{i=1}^m log [\sum_{z^{(i)}}P(x^{(i)},z^{(i)}|\theta)] + logP(\theta)

= \sum_{i=1}^m log [\sum_{z^{(i)}} Q_i(z^{(i)}) \frac {P(x^{(i)},z^{(i)}|\theta)} {Q_i(z^{(i)})} ] + logP(\theta)

= \sum_{i=1}^m log E_{z^{(i)}} [\frac {P(x^{(i)}, z^{(i)}|\theta)} {Q_i(z^{(i)})} ] + logP(\theta)

Since $log$ is concave, according to Jensen's Inequality,

\sum_{i=1}^m log E_{z^{(i)}} [\frac {P(x^{(i)}, z^{(i)}|\theta)} {Q_i(z^{(i)})} ] + logP(\theta)

≥ \sum_{i=1}^m E_{z^{(i)}} [log \frac {P(x^{(i)},z^{(i)}|\theta)} {Q_i(z^{(i)})} ] + logP(\theta)

= \sum_{i=1}^m [\sum_{z^{(i)}} Q_i(z^{(i)}) log \frac {P(x^{(i)},z^{(i)}|\theta)} {Q_i(z^{(i)})} ] + logP(\theta)

To hold equality at Jensen's Inequality step for current $\theta$ , we need this to be a constant

\frac {P(x^{(i)}, z^{(i)}|\theta)} {Q_i(z^{(i)})} = C

Hence

Q_i(z^{(i)}) = \frac {P(x^{(i)}, z^{(i)}|\theta)} {C}

\sum_{z^{(i)}} Q_i(z^{(i)}) = \sum_{z^{(i)}} \frac {P(x^{(i)}, z^{(i)}|\theta)} {C}

The left part is $1$ since $Q_i(z^{(i)})$ is a distribution over $z^{(i)}$ . And the right is

\sum_{z^{(i)}} \frac {P(x^{(i)},z^{(i)}|\theta)} {C}

= \frac {P(x^{(i)}|\theta)} {C}

And so

C = P(x^{(i)}|\theta)

Q_i(z^{(i)}) = \frac {P(x^{(i)},z^{(i)}|\theta)} {P(x^{(i)}|\theta)}

= P(z^{(i)}|x^{(i)},\theta)

The E-Step is just the general case

Q_i(z^{(i)}) = P(z^{(i)}|x^{(i)},\theta)

And M-Step is to find the maximum of the lower bound

\theta := argmax_{\theta} \sum_{i=1}^m [\sum_{z^{(i)}} Q_i(z^{(i)})log \frac {P(x^{(i)},z^{(i)}|\theta)} {Q_i(z^{(i)})} ] + logP(\theta)