Generative Gaussian mixtures

Generative Gaussian mixtures is an inductive algorithm for semi-supervised clustering. Let's suppose we have a labeled dataset (Xl, Yl) containing N samples (drawn from pdata) and an unlabeled dataset Xu containing M >> N samples (drawn from the marginal distribution p(x)). It's not necessary that M >> N, but we want to create a real semi-supervised scenario, with only a few labeled samples. Moreover, we are assuming that all unlabeled samples are consistent with pdata. This can seem like a vicious cycle, but without this assumption, the procedure does not have a strong mathematical foundation. Our goal is to determine a complete p(x|y) distribution using a generative model. In general, it's possible to use different priors, but we are now employing multivariate Gaussians to model our data:

Thus, our model parameters are means and covariance matrices for all Gaussians. In other contexts, it's possible to use binomial or multinomial distributions. However, the procedure doesn't change; therefore, let's assume that it's possible to approximate p(x|y) with a parametrized distribution p(x|y, θ). We can achieve this goal by minimizing the Kullback-Leibler pergence between the two distributions:

In Chapter 5EM Algorithm and Applications we are going to show that this is equivalent to maximizing the likelihood of the dataset. To obtain the likelihood, it's necessary to define the number of expected Gaussians (which is known from the labeled samples) and a weight-vector that represents the marginal probability of a specific Gaussian:

Using the Bayes' theorem, we get:

As we are working with both labeled and unlabeled samples, the previous expression has a double interpretation:

  • For unlabeled samples, it is computed by multiplying the ith Gaussian weight times the probability p(xj) relative to the ith Gaussian distribution.
  •  For labeled samples, it can be represented by a vector p = [0, 0, ... 1, ... 0, 0] where 1 is the ith element. In this way, we force our model to trust the labeled samples in order to find the best parameter values that maximize the likelihood on the whole dataset.

With this distinction, we can consider a single log-likelihood function where the term fw(yi|xj) has been substituted by a per sample weight:

It's possible to maximize the log-likelihood using the EM algorithm (see Chapter 5, EM Algorithm and Applications). In this context, we provide the steps directly:

  • p(yi|xj,θ,w) is computed according to the previously explained method
  • The parameters of the Gaussians are updated using these rules:

N is the total number of samples. The procedure must be iterated until the parameters stop modifying or the modifications are lower than a fixed threshold.