k-means clustering

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.

Machine learningClustering



Initial contribute: 2020-12-17


Method-focused categoriesData-perspectiveIntelligent computation analysis

Detailed Description

English {{currentDetailLanguage}} English

Quoted from: https://en.wikipedia.org/wiki/K-means_clustering

Standard algorithm (naïve k-means)

The most common algorithm uses an iterative refinement technique. Due to its ubiquity, it is often called "the k-means algorithm"; it is also referred to as Lloyd's algorithm, particularly in the computer science community. It is sometimes also referred to as "naive k-means", because there exist much faster alternatives.

Given an initial set of k means m1(1),...,mk(1) (see below), the algorithm proceeds by alternating between two steps:

Assignment step: Assign each observation to the cluster with the nearest mean: that with the least squared Euclidean distance. (Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means.)
where each  is assigned to exactly one , even if it could be assigned to two or more of them.
Update step: Recalculate means (centroids) for observations assigned to each cluster.

The algorithm has converged when the assignments no longer change. The algorithm is not guaranteed to find the optimum.

The algorithm is often presented as assigning objects to the nearest cluster by distance. Using a different distance function other than (squared) Euclidean distance may prevent the algorithm from converging. Various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures.

Initialization methods

Commonly used initialization methods are Forgy and Random Partition. The Forgy method randomly chooses k observations from the dataset and uses these as the initial means. The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster's randomly assigned points. The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set. According to Hamerly et al., the Random Partition method is generally preferable for algorithms such as the k-harmonic means and fuzzy k-means. For expectation maximization and standard k-means algorithms, the Forgy method of initialization is preferable. A comprehensive study by Celebi et al., however, found that popular initialization methods such as Forgy, Random Partition, and Maximin often perform poorly, whereas Bradley and Fayyad's approach performs "consistently" in "the best group" and k-means++ performs "generally well".

The algorithm does not guarantee convergence to the global optimum. The result may depend on the initial clusters. As the algorithm is usually fast, it is common to run it multiple times with different starting conditions. However, worst-case performance can be slow: in particular certain point sets, even in two dimensions, converge in exponential time, that is 2Ω(n). These point sets do not seem to arise in practice: this is corroborated by the fact that the smoothed running time of k-means is polynomial.

The "assignment" step is referred to as the "expectation step", while the "update step" is a maximization step, making this algorithm a variant of the generalized expectation-maximization algorithm.


Finding the optimal solution to the k-means clustering problem for observations in d dimensions is:

  • NP-hard in general Euclidean space (of d dimensions) even for two clusters,
  • NP-hard for a general number of clusters k even in the plane,
  • if k and d (the dimension) are fixed, the problem can be exactly solved in time , where n is the number of entities to be clustered.

Thus, a variety of heuristic algorithms such as Lloyd's algorithm given above are generally used.

The running time of Lloyd's algorithm (and most variants) is , where:

  • n is the number of d-dimensional vectors (to be clustered)
  • k the number of clusters
  • i the number of iterations needed until convergence.

On data that does have a clustering structure, the number of iterations until convergence is often small, and results only improve slightly after the first dozen iterations. Lloyd's algorithm is therefore often considered to be of "linear" complexity in practice, although it is in the worst case superpolynomial when performed until convergence.

  • In the worst-case, Lloyd's algorithm needs  iterations, so that the worst-case complexity of Lloyd's algorithm is superpolynomial.
  • Lloyd's k-means algorithm has polynomial smoothed running time. It is shown that[14] for arbitrary set of n points in , if each point is independently perturbed by a normal distribution with mean 0 and variance , then the expected running time of k-means algorithm is bounded by , which is a polynomial in nkd and .
  • Better bounds are proven for simple cases. For example, it is shown that the running time of k-means algorithm is bounded by  for n points in an integer lattice .

Lloyd's algorithm is the standard approach for this problem. However, it spends a lot of processing time computing the distances between each of the k cluster centers and the n data points. Since points usually stay in the same clusters after a few iterations, much of this work is unnecessary, making the naive implementation very inefficient. Some implementations use caching and the triangle inequality in order to create bounds and accelerate Lloyd's algorithm.



Zhen Qian (2020). k-means clustering, Model Item, OpenGMS, https://geomodeling.njnu.edu.cn/modelItem/9f2e0f7c-a684-4b18-9326-6af76cb1eeb6


Initial contribute : 2020-12-17


QR Code


{{curRelation.author.join('; ')}}



Drop the file here, orclick to upload.
Select From My Space
+ add


Cancel Submit
{{htmlJSON.Cancel}} {{htmlJSON.Submit}}
{{htmlJSON.Localizations}} + {{htmlJSON.Add}}
{{ item.label }} {{ item.value }}
{{htmlJSON.Cancel}} {{htmlJSON.Submit}}
名称 别名 {{tag}} +
系列名 版本号 目的 修改内容 创建/修改日期 作者
摘要 详细描述
{{tag}} + 添加关键字
* 时间参考系
* 空间参考系类型 * 空间参考系名称

起始日期 终止日期 进展 开发者
* 是否开源 * 访问方式 * 使用方式 开源协议 * 传输方式 * 获取地址 * 发布日期 * 发布者

编号 目的 修改内容 创建/修改日期 作者

时间分辨率 时间尺度 时间步长 时间范围 空间维度 格网类型 空间分辨率 空间尺度 空间范围
{{tag}} +
* 类型

* 名称 * 描述
示例描述 * 名称 * 类型 * 值/链接 上传

{{htmlJSON.Cancel}} {{htmlJSON.Submit}}
Title Author Date Journal Volume(Issue) Pages Links Doi Operation
{{htmlJSON.Cancel}} {{htmlJSON.Submit}}
{{htmlJSON.Add}} {{htmlJSON.Cancel}}


Authors:  {{articleUploading.authors[0]}}, {{articleUploading.authors[1]}}, {{articleUploading.authors[2]}}, et al.

Journal:   {{articleUploading.journal}}

Date:   {{articleUploading.date}}

Page range:   {{articleUploading.pageRange}}

Link:   {{articleUploading.link}}

DOI:   {{articleUploading.doi}}

Yes, this is it Cancel

The article {{articleUploading.title}} has been uploaded yet.

{{htmlJSON.Cancel}} {{htmlJSON.Confirm}}