Multivariate Gaussian distribution clustering with Expectation Maximization in Python

Expectation Maximization (EM) is a classical algorithm in ML for data clustering. Its theoretical background is based on Bayes theorem and although it’s quite straightforward to follow it might be a bit confusing to implement it in more than 1 dimension.

If you want to understand more about EM I recommend you to check out this and this videos.

EM in a nutshell:

  1. Randomly initialize a mean and variance per class.
  2. Calculate the probability of each data point to belong to those distributions.
  3. Calculate the mean and variance based on those probabilities.
  4. Go to 2. and repeat until convergence.

In the Figure above we can see the initial clusters (top-left), and how the classification is changing in different iterations.

This method considerably relies on the mean and variance initialization. An unlucky initialization will provide with a deficient segmentation.

The code is provided in the Source code section.

Juan Miguel Valverde

"The only way to proof that you understand something is by programming it"

Leave a Reply