MAP and EM for MAP Estimation

The EM algorithm that we talked about in class was for solving a maximum likelihood estimation problem in which we wish to maximize

i=1mp(x(i);θ)=i=1mz(i)p(x(i),z(i);θ)\prod_{i=1}^m p(x^{(i)};\theta) = \prod_{i=1}^m \sum_{z^{(i)}} p(x^{(i)},z^{(i)};\theta)

where the z(i)z^{(i)}'s are latent random variables. Suppose we are working in a Bayesian framework, and wanted to find the MAP estimation of the parameters θ\theta by maximizing

(i=1mp(x(i);θ))p(θ)=(i=1mz(i)p(x(i),z(i);θ))p(θ)( \prod_{i=1}^m p(x^{(i)};\theta) )p(\theta) = (\prod_{i=1}^m \sum_{z^{(i)}} p(x^{(i)},z^{(i)};\theta) )p(\theta)

Generalize the EM algorithm to work for MAP estimation. You may assume that log p(x,zθ)log~p(x,z|\theta) and log p(θ)log~p(\theta) are both concave in θ\theta.


MAP Recap

Given training datasete

S={(x(i),y(i))}i=1mS = \{(x^{(i)}, y^{(i)})\}_{i=1}^m

In Miximum Likelihood Estimation, we try to approximate the parameter by

θML=argmaxθ P(Sθ)\theta_{ML} = argmax_{\theta}~P(S| \theta)

the intuition is, what θ\theta is more likely to produce the training dataset.

But in Maximum A Posteriori Estimation, we assume parameter θ\theta follows an prior distribution, and try to approximate the parameter by

θMAP=argmaxθ P(θS)\theta_{MAP} = argmax_{\theta}~P(\theta|S)

The intuition is, given the training dataset, what θ\theta is the most likely one.

And in MAP estimation, we transform P(θS)P(\theta|S) according to Bayes Rule.

P(θS)P(\theta|S)
=P(Sθ) P(θ)P(S)= \frac{P(S|\theta)~P(\theta)} {P(S)}
=P(θ)i=1mP(y(i)x(i);θ)θP(θ)i=1mP(y(i)x(i);θ)dθ= \frac { P(\theta) \prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta) } { \int_{\theta} P(\theta) \prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta) d\theta }

Actually, the denominator is just P(S)P(S), it's irrelevant to θ\theta, so we just

θMAP=argmaxθP(θS)\theta_{MAP} = argmax_{\theta}P(\theta|S)
=argmaxθP(Sθ)P(θ)= argmax_{\theta}P(S|\theta)P(\theta)
=argmaxθP(θ)i=1mP(y(i)x(i);θ)= argmax_{\theta} P(\theta) \prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta)

Note that what we are maximizing is almost the same as we maximize in Maximum Likelihood Estimation, only introducing an prior distribution on θ\theta.


EM for MAP Estimation

When we introduce latent variables into MAP Estimation, we approximate the paramter θ\theta by maximizing

P(θ)i=1mP(x(i)θ)=P(θ)i=1mz(i)P(x(i),z(i)θ)P(\theta) \prod_{i=1}^mP(x^{(i)}|\theta) = P(\theta) \prod_{i=1}^m \sum_{z^{(i)}} P(x^{(i)},z^{(i)}|\theta)

where z(i)z^{(i)}'s are the latent variables. And the derivation for EM Algorithm is very strarightforward, when you take the log likelihood, P(θ)P(\theta) is just a additional term add to the main equation.

log l(θ)=i=1mlogP(x(i)θ)+logP(θ)log~l(\theta) = \sum_{i=1}^m logP(x^{(i)}|\theta) + logP(\theta)
=i=1mlog[z(i)P(x(i),z(i)θ)]+logP(θ)= \sum_{i=1}^m log [\sum_{z^{(i)}}P(x^{(i)},z^{(i)}|\theta)] + logP(\theta)
=i=1mlog[z(i)Qi(z(i))P(x(i),z(i)θ)Qi(z(i))]+logP(θ)= \sum_{i=1}^m log [\sum_{z^{(i)}} Q_i(z^{(i)}) \frac {P(x^{(i)},z^{(i)}|\theta)} {Q_i(z^{(i)})} ] + logP(\theta)
=i=1mlogEz(i)[P(x(i),z(i)θ)Qi(z(i))]+logP(θ)= \sum_{i=1}^m log E_{z^{(i)}} [\frac {P(x^{(i)}, z^{(i)}|\theta)} {Q_i(z^{(i)})} ] + logP(\theta)

Since loglog is concave, according to Jensen's Inequality,

i=1mlogEz(i)[P(x(i),z(i)θ)Qi(z(i))]+logP(θ)\sum_{i=1}^m log E_{z^{(i)}} [\frac {P(x^{(i)}, z^{(i)}|\theta)} {Q_i(z^{(i)})} ] + logP(\theta)
i=1mEz(i)[logP(x(i),z(i)θ)Qi(z(i))]+logP(θ)≥ \sum_{i=1}^m E_{z^{(i)}} [log \frac {P(x^{(i)},z^{(i)}|\theta)} {Q_i(z^{(i)})} ] + logP(\theta)
=i=1m[z(i)Qi(z(i))logP(x(i),z(i)θ)Qi(z(i))]+logP(θ)= \sum_{i=1}^m [\sum_{z^{(i)}} Q_i(z^{(i)}) log \frac {P(x^{(i)},z^{(i)}|\theta)} {Q_i(z^{(i)})} ] + logP(\theta)

To hold equality at Jensen's Inequality step for current θ\theta, we need this to be a constant

P(x(i),z(i)θ)Qi(z(i))=C\frac {P(x^{(i)}, z^{(i)}|\theta)} {Q_i(z^{(i)})} = C

Hence

Qi(z(i))=P(x(i),z(i)θ)CQ_i(z^{(i)}) = \frac {P(x^{(i)}, z^{(i)}|\theta)} {C}
z(i)Qi(z(i))=z(i)P(x(i),z(i)θ)C\sum_{z^{(i)}} Q_i(z^{(i)}) = \sum_{z^{(i)}} \frac {P(x^{(i)}, z^{(i)}|\theta)} {C}

The left part is 11 since Qi(z(i))Q_i(z^{(i)}) is a distribution over z(i)z^{(i)}. And the right is

z(i)P(x(i),z(i)θ)C\sum_{z^{(i)}} \frac {P(x^{(i)},z^{(i)}|\theta)} {C}
=P(x(i)θ)C= \frac {P(x^{(i)}|\theta)} {C}

And so

C=P(x(i)θ)C = P(x^{(i)}|\theta)
Qi(z(i))=P(x(i),z(i)θ)P(x(i)θ)Q_i(z^{(i)}) = \frac {P(x^{(i)},z^{(i)}|\theta)} {P(x^{(i)}|\theta)}
=P(z(i)x(i),θ)= P(z^{(i)}|x^{(i)},\theta)

The E-Step is just the general case

Qi(z(i))=P(z(i)x(i),θ)Q_i(z^{(i)}) = P(z^{(i)}|x^{(i)},\theta)

And M-Step is to find the maximum of the lower bound

θ:=argmaxθi=1m[z(i)Qi(z(i))logP(x(i),z(i)θ)Qi(z(i))]+logP(θ)\theta := argmax_{\theta} \sum_{i=1}^m [\sum_{z^{(i)}} Q_i(z^{(i)})log \frac {P(x^{(i)},z^{(i)}|\theta)} {Q_i(z^{(i)})} ] + logP(\theta)