Standard algorithm (naïve k-means)
The most common algorithm uses an iterative refinement technique. Due to its ubiquity, it is often called "the k-means algorithm"; it is also referred to as Lloyd's algorithm, particularly in the computer science community. It is sometimes also referred to as "naive k-means", because there exist much faster alternatives.
Given an initial set of k means m1(1),...,mk(1) (see below), the algorithm proceeds by alternating between two steps:
- Assignment step: Assign each observation to the cluster with the nearest mean: that with the least squared Euclidean distance. (Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means.)
- {\displaystyle S_{i}^{(t)}={\big \{}x_{p}:{\big \|}x_{p}-m_{i}^{(t)}{\big \|}^{2}\leq {\big \|}x_{p}-m_{j}^{(t)}{\big \|}^{2}\ \forall j,1\leq j\leq k{\big \}},}

- where each {\displaystyle x_{p}}
is assigned to exactly one {\displaystyle S^{(t)}}
, even if it could be assigned to two or more of them.
- Update step: Recalculate means (centroids) for observations assigned to each cluster.
- {\displaystyle m_{i}^{(t+1)}={\frac {1}{\left|S_{i}^{(t)}\right|}}\sum _{x_{j}\in S_{i}^{(t)}}x_{j}}

The algorithm has converged when the assignments no longer change. The algorithm is not guaranteed to find the optimum.
The algorithm is often presented as assigning objects to the nearest cluster by distance. Using a different distance function other than (squared) Euclidean distance may prevent the algorithm from converging. Various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures.
Initialization methods
Commonly used initialization methods are Forgy and Random Partition. The Forgy method randomly chooses k observations from the dataset and uses these as the initial means. The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster's randomly assigned points. The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set. According to Hamerly et al., the Random Partition method is generally preferable for algorithms such as the k-harmonic means and fuzzy k-means. For expectation maximization and standard k-means algorithms, the Forgy method of initialization is preferable. A comprehensive study by Celebi et al., however, found that popular initialization methods such as Forgy, Random Partition, and Maximin often perform poorly, whereas Bradley and Fayyad's approach performs "consistently" in "the best group" and k-means++ performs "generally well".
- Demonstration of the standard algorithm
-
1. k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color).
-
2. k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means.
-
3. The centroid of each of the k clusters becomes the new mean.
-
4. Steps 2 and 3 are repeated until convergence has been reached.
The algorithm does not guarantee convergence to the global optimum. The result may depend on the initial clusters. As the algorithm is usually fast, it is common to run it multiple times with different starting conditions. However, worst-case performance can be slow: in particular certain point sets, even in two dimensions, converge in exponential time, that is 2Ω(n). These point sets do not seem to arise in practice: this is corroborated by the fact that the smoothed running time of k-means is polynomial.
The "assignment" step is referred to as the "expectation step", while the "update step" is a maximization step, making this algorithm a variant of the generalized expectation-maximization algorithm.
Complexity
Finding the optimal solution to the k-means clustering problem for observations in d dimensions is:
- NP-hard in general Euclidean space (of d dimensions) even for two clusters,
- NP-hard for a general number of clusters k even in the plane,
- if k and d (the dimension) are fixed, the problem can be exactly solved in time {\displaystyle O(n^{dk+1})}
, where n is the number of entities to be clustered.
Thus, a variety of heuristic algorithms such as Lloyd's algorithm given above are generally used.
The running time of Lloyd's algorithm (and most variants) is {\displaystyle O(nkdi)}
, where:
- n is the number of d-dimensional vectors (to be clustered)
- k the number of clusters
- i the number of iterations needed until convergence.
On data that does have a clustering structure, the number of iterations until convergence is often small, and results only improve slightly after the first dozen iterations. Lloyd's algorithm is therefore often considered to be of "linear" complexity in practice, although it is in the worst case superpolynomial when performed until convergence.
- In the worst-case, Lloyd's algorithm needs {\displaystyle i=2^{\Omega ({\sqrt {n}})}}
iterations, so that the worst-case complexity of Lloyd's algorithm is superpolynomial.
- Lloyd's k-means algorithm has polynomial smoothed running time. It is shown that[14] for arbitrary set of n points in {\displaystyle [0,1]^{d}}
, if each point is independently perturbed by a normal distribution with mean 0 and variance {\displaystyle \sigma ^{2}}
, then the expected running time of k-means algorithm is bounded by {\displaystyle O(n^{34}k^{34}d^{8}\log ^{4}(n)/\sigma ^{6})}
, which is a polynomial in n, k, d and {\displaystyle 1/\sigma }
.
- Better bounds are proven for simple cases. For example, it is shown that the running time of k-means algorithm is bounded by {\displaystyle O(dn^{4}M^{2})}
for n points in an integer lattice {\displaystyle \{1,\dots ,M\}^{d}}
.
Lloyd's algorithm is the standard approach for this problem. However, it spends a lot of processing time computing the distances between each of the k cluster centers and the n data points. Since points usually stay in the same clusters after a few iterations, much of this work is unnecessary, making the naive implementation very inefficient. Some implementations use caching and the triangle inequality in order to create bounds and accelerate Lloyd's algorithm.