Subsections

EM algorithm for data clustering

Introduction

Let the analyzed object (texture in our case) be described by two random variables $\mathbb{X}$ and $\mathbb{Y}$ which are assumed to have a probability distribution function

\begin{displaymath}
P(x,y\vert\theta) = P(x\vert y,\theta)P(y\vert\theta)\:.
\end{displaymath}

The distribution is known up to its parameter(s) $\theta\in{\Theta}$. It is assumed that we are given a set of samples ${\cal T}_{X} = \{x_1,\ldots,x_l\}$ independently drawn from the distribution
\begin{displaymath}
\displaystyle P(x\vert\theta) = \sum_{y\in{\cal Y}} P(...
... \sum_{y\in{\cal Y}}
P(x\vert y,\theta) P(y\vert\theta) \:.
\end{displaymath} (1)

The Expectation-Maximization (EM) algorithm is an optimization procedure which computes the Maximal-Likelihood (ML) estimate of the unknown parameter $\theta\in{\Theta}$ when only uncomplete ($y$ is unknown) data ${\cal T}_X$ are presented. In other words, the EM algorithm maximizes the likelihood function
\begin{displaymath}
\displaystyle l(\theta\vert{\cal T}_X) = P({\cal T}_X\...
...l \sum_{y\in{\cal Y}} P(x_i\vert y,\theta)P(y\vert\theta) \:.
\end{displaymath} (2)

with respect to the parameter $\theta\in{\Theta}$.

A particular example of the distribution (1) which can be estimated by the EM algorithm is the Gaussian Mixture Model (GMM). In this case the distribution $P(x\vert y,\theta)$ is the multi-variate Gaussian distribution

\begin{displaymath}
P(x\vert y,\theta) = \frac{1}{(2\pi)^\frac{n}{2}\sqrt{\mat...
...ft (
-\frac{1}{2}(x-\mu_y)^T\Sigma_y^{-1}(x-\mu_y)\right ).
\end{displaymath}

The random variable $\mathbb{X}$ attains value from the set ${\mathbb{R}}^n$ and the variable $\mathbb{Y}$ from a discrete set $\{1,\ldots,m\}$. The unknown parameter consists of mean vectors $\mu_y$, $y=1,\ldots,m$, covariance matrices $\Sigma_y$, $y=1,\ldots,m$ and values of the discrete distribution $P(y\vert\theta)$. The number $m$ of the Gaussian components must be prescribed before the EM is applied. For instance, the cross-validation can be used to select the proper $m$. This involves maximization of the cross-validation estimate of the likelihood function (2) with respect to $m$.

The estimated GMM can be used for data clustering. In this case it is assumed that the similar data which belong to one cluster (class) are generated by an identical Gaussian component. The value of the random variable $\mathbb{Y}$ then identifies the corresponding cluster. The whole clustering procedure involves:

  1. Estimation of the GMM model by the EM algorithm.
  2. Classification of the data to classes based on the estimated model (1). The Bayesian classifier is naturally used to classify the data.

Figure 3: Expectation-Maximization Algorithm for clustering.
Image em

Task assignment

  1. Implement the cross-validation procedure for tuning the number $m$ of Gaussian components in the GMM estimated by the EM algorithm.
  2. Apply the cross-validation procedure to cluster unlabeled data made from the Brodatz textures [brodatz2_trn.mat]. estimate of the likelihood function with respect to the number of components $m$.
  3. Validate the result of clustering by visual inspection.

Useful functions

em_exp1 Example on using EM algorithm.
crossval Partitions data for cross-validation.
emgmm EM algorithm for the Gaussian Mixture Model.
pdfgmm Evaluates GMM distribution function.



Vojtech Franc
2004-08-31