Clustering is an unsupervised learning problem in which we try to identify groupings of similar data points, i.e. learn the structure of our data. Today I will introduce K-means, a popular and simple clustering algorithm. Our true motivation will be to use this as a gentle introduction to clustering and the expectation maximization (EM) algorithm. In subsequent posts we will expand on this foundation towards Gaussian mixtures, and finally into latent Dirichlet allocation (LDA).
1. K-means Clustering
Suppose we have $n$ observations of data $x$, with dimensionality $D$. We want to partition them into $K$ clusters such that points in each cluster are close together, and are far from points outside.
To indicate cluster assignment, we will define indicator variables $r_{nk}$ such that $r_{nk}=1$ if $x_n$ belongs to cluster $k$, and $r_{nj}=0$ for $j\neq k$.
The simple scheme we employ will be based on distortion, or sum of squared distances, which is a very natural objective function to employ:
\[J = \sum_n\sum_k r_{nk}||x_n-\mu_k||^2\]$mu_k$ is a $D$-dimensional vector representing the center of cluster $k$. We need to manipulate $r_{nk}$ and $\mu_k$ to minimize $J$.
A simple iterative algorithm to do this is:
- Initialize $\mu_k$
- Minimize $J$ with respect to $r_{nk}$
- Minimize $J$ with respect to $\mu_k$
- Repeat 1-2 until convergence
This is our first introduction to the expectation maximization (EM) algorithm!
1.1 E-step - $\frac{\partial J}{\partial r_{nk}}$
$\frac{\partial J}{\partial r_{nk}}$ is easy to calculate. We swiftly arrive at
\[r_{nk} = \begin{cases} 1, & \text{if }k=\text{arg min}_j ||x_n-\mu_j||^2 \\\ 0, & \text{otherwise} \end{cases}\]Notice that having to recompute distances between $mu_k$ and $x_n$ in every E-step is very costly.
1.2 M-step - $\frac{\partial J}{\partial \mu_k}$
$\frac{\partial J}{\partial \mu_k}$ is not much more difficult:
\[\frac{\partial J}{\partial \mu_k} = 2\sum_n r_{nk}(x_n-\mu_k) = 0\] \[\mu_k = \frac{\sum_n r_{nk}x_n}{\sum_n r_{nk}}\]Which is actually the mean of all points in cluster $k$.
2. Properties
Convergence can be decided in many ways - threshold on $J$, number of iterations, until assignments no longer change, etc. Notice that under EM, the objective function is bound to decrease after every iteration. Of course, we are not guaranteed to stumble into the global optimum. Also notice that the algorithm is stochastic in that it depends on the initialization of $\mu_k$.
An alternative to this batch formulation of k-means can be derived using stochastic approximation methods, yielding the sequential update equation:
\[\mu_k^{1} = \mu_k^{0} + \eta_n(x_n-\mu_k^{0})\]$\eta_n$ is the learning rate parameter; its dependence on $n$ allows it to be annealed over time. This formulation lets us use k-means in online settings.
3. Kernelization
So far, our formulation and interpretation of k-means is wholly reliant on squared Euclidean distance. This is not always the best metric for evaluating similarity of points. We can generalize with kernels $k(x_i, x_j) = \phi(x_i)^T\phi(x_j)$:
-
E-step
\[r_{nk} = \begin{cases} 1, & \text{if }k=\text{arg min}_j ||\phi(x_n)-\mu_j||^2 \\\ 0, & \text{otherwise} \end{cases}\]Notice that we can express $\|\phi(x)-\mu_k^2\|$ purely using the kernel function, in yet another example of the kernel trick:
\[||\phi(x)-\mu_k||^2 = \phi(x)^T\phi(x) - 2\mu_k^T\phi(x) + \mu_k^T\mu_k\]Let $y_k = \sum_n r_{nk}$:
\[||\phi(x)-\mu_k||^2 = \frac{1}{y_k^2}\sum_{n,m}r_{nk}r_{mk}k(x_n, x_m) - \frac{2}{y_k}\sum_n r_{nk}k(x_n, x) + k(x, x)\] -
M-step