k-means clustering

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.

Machine learningClustering

Contributor(s)

Initial contribute: 2020-12-17

Classification(s)

●

Method-focused categories

Data-perspective

Intelligent computation analysis

Detailed Description

English

Quoted from: https://en.wikipedia.org/wiki/K-means_clustering

Standard algorithm (naïve k-means)

The most common algorithm uses an iterative refinement technique. Due to its ubiquity, it is often called "the k-means algorithm"; it is also referred to as Lloyd's algorithm, particularly in the computer science community. It is sometimes also referred to as "naive k-means", because there exist much faster alternatives.

Given an initial set of k means m₁⁽¹⁾,...,m_k⁽¹⁾ (see below), the algorithm proceeds by alternating between two steps:

Assignment step: Assign each observation to the cluster with the nearest mean: that with the least squared Euclidean distance. (Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means.)

S_{i}^{(t)}={\big \{}x_{p}:{\big \|}x_{p}-m_{i}^{(t)}{\big \|}^{2}\leq {\big \|}x_{p}-m_{j}^{(t)}{\big \|}^{2}\ \forall j,1\leq j\leq k{\big \}},

where each

x_{p}

is assigned to exactly one

S^{(t)}

, even if it could be assigned to two or more of them.

Update step: Recalculate means (centroids) for observations assigned to each cluster.

m_{i}^{(t+1)}={\frac {1}{\left|S_{i}^{(t)}\right|}}\sum _{x_{j}\in S_{i}^{(t)}}x_{j}

The algorithm has converged when the assignments no longer change. The algorithm is not guaranteed to find the optimum.

The algorithm is often presented as assigning objects to the nearest cluster by distance. Using a different distance function other than (squared) Euclidean distance may prevent the algorithm from converging. Various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures.

Initialization methods

Commonly used initialization methods are Forgy and Random Partition. The Forgy method randomly chooses k observations from the dataset and uses these as the initial means. The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster's randomly assigned points. The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set. According to Hamerly et al., the Random Partition method is generally preferable for algorithms such as the k-harmonic means and fuzzy k-means. For expectation maximization and standard k-means algorithms, the Forgy method of initialization is preferable. A comprehensive study by Celebi et al., however, found that popular initialization methods such as Forgy, Random Partition, and Maximin often perform poorly, whereas Bradley and Fayyad's approach performs "consistently" in "the best group" and k-means++ performs "generally well".

Demonstration of the standard algorithm
1. k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color).
2. k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means.
3. The centroid of each of the k clusters becomes the new mean.
4. Steps 2 and 3 are repeated until convergence has been reached.

The algorithm does not guarantee convergence to the global optimum. The result may depend on the initial clusters. As the algorithm is usually fast, it is common to run it multiple times with different starting conditions. However, worst-case performance can be slow: in particular certain point sets, even in two dimensions, converge in exponential time, that is $2 Ω(n)$ . These point sets do not seem to arise in practice: this is corroborated by the fact that the smoothed running time of k-means is polynomial.

The "assignment" step is referred to as the "expectation step", while the "update step" is a maximization step, making this algorithm a variant of the generalized expectation-maximization algorithm.

Complexity

Finding the optimal solution to the k-means clustering problem for observations in d dimensions is:

NP-hard in general Euclidean space (of d dimensions) even for two clusters,
NP-hard for a general number of clusters k even in the plane,
if k and d (the dimension) are fixed, the problem can be exactly solved in time $O(n^{dk+1})$ , where n is the number of entities to be clustered.

Thus, a variety of heuristic algorithms such as Lloyd's algorithm given above are generally used.

The running time of Lloyd's algorithm (and most variants) is $O(nkdi)$ , where:

n is the number of d-dimensional vectors (to be clustered)
k the number of clusters
i the number of iterations needed until convergence.

On data that does have a clustering structure, the number of iterations until convergence is often small, and results only improve slightly after the first dozen iterations. Lloyd's algorithm is therefore often considered to be of "linear" complexity in practice, although it is in the worst case superpolynomial when performed until convergence.

In the worst-case, Lloyd's algorithm needs $i=2^{\Omega ({\sqrt {n}})}$ iterations, so that the worst-case complexity of Lloyd's algorithm is superpolynomial.
Lloyd's k-means algorithm has polynomial smoothed running time. It is shown that^[14] for arbitrary set of n points in $[0,1]^{d}$ , if each point is independently perturbed by a normal distribution with mean $0$ and variance $\sigma ^{2}$ , then the expected running time of $k$ -means algorithm is bounded by $O(n^{34}k^{34}d^{8}\log ^{4}(n)/\sigma ^{6})$ , which is a polynomial in $n$ , $k$ , $d$ and $1/\sigma$ .
Better bounds are proven for simple cases. For example, it is shown that the running time of k-means algorithm is bounded by $O(dn^{4}M^{2})$ for $n$ points in an integer lattice $\{1,\dots ,M\}^{d}$ .

Lloyd's algorithm is the standard approach for this problem. However, it spends a lot of processing time computing the distances between each of the k cluster centers and the n data points. Since points usually stay in the same clusters after a few iterations, much of this work is unnecessary, making the naive implementation very inefficient. Some implementations use caching and the triangle inequality in order to create bounds and accelerate Lloyd's algorithm.

{{htmlJSON.ComputableModelList}} 0

{{htmlJSON.ConceptualschematicModelList}} 0

{{htmlJSON.LogicalschematicModelList}} 0

{{htmlJSON.ModelItem}} 0

Author {{curRelation.author.join('; ')}}

Journal {{curRelation.journal}}

{{htmlJSON.DataItem}} 0

Data Hub 0

There is no related data hub. You can link data hubs.

Data Method 0

There is no related data method. You can link data methods.

{{htmlJSON.Reference}} 0

{{htmlJSON.Material}} 0

模型元数据

系列名：{{ModelMetaData.mp.modelVersion.seriesName}}

编号：{{ModelMetaData.mp.modelVersion.numbering}}

目的：{{ item.label }}

修改内容：{{ModelMetaData.mp.modelVersion.changeContent}}

The above is part of the information. You can click here to see full information.

There is no Model Preparation about this model. You can click to add overview.

点击查看基本信息.

The above is part of the information. You can click here to see full information.

There is no Pre-integration Evaluation about this model. You can click to add overview.

点击查看设计理念.

类型： {{ item.label }}

There is no Model Orchestration about this model. You can click to add overview.

点击查看模型架构.

There is no Data Interoperability about this model. You can click to add overview.

点击查看模型数据.

示例描述：{{item2.desc}}

名称： {{item2.name}}

类型： {{ item.label }}

值： {{item2.value}}

There is no Test about this model. You can click to add overview.

点击查看运行测试.

Zhen Qian (2020). k-means clustering, Model Item, OpenGMS, https://geomodeling.njnu.edu.cn/modelItem/9f2e0f7c-a684-4b18-9326-6af76cb1eeb6

Copyright and Disclaimer

All copyrights of a material (model, data, article, etc.) in the OpenGMS fully belong to its author/developer/designer (or any other wording about the owner). The OpenGMS takes every care to avoid copyright infringement, contributor(s) should carefully employ materials from other sources and give proper citations.

Contributor(s)

Initial contribute : 2020-12-17

QR Code

Author {{curRelation.author.join('; ')}}

Journal {{curRelation.journal}}

{{htmlJSON.LinkResourceFromRepositoryOrCreate}}{{htmlJSON.create}}.

Drop the file here, orclick to upload.

Select From My Space

+ add

Alias

+ {{htmlJSON.Add}}

{{htmlJSON.ModelName}}:

* 名称

别名

系列名

* 版本号

* 目的

* 修改内容

* 创建/修改日期

* 作者

* 摘要

详细描述

+ 添加关键字

* 时间参考系

* 空间参考系类型

* 空间参考系名称

* 起始日期

终止日期

* 进展

* 开发者

* 是否开源

* 访问方式

* 使用方式

* 开源协议

* 传输方式

* 获取地址

* 发布日期

* 发布者

* 编号

* 目的

* 修改内容

* 创建/修改日期

* 作者

* 时间分辨率

* 时间尺度

* 时间步长

* 时间范围

* 空间维度

* 格网类型

* 空间分辨率

* 空间尺度

* 空间范围

* 类型

图例

* 名称

* 描述

示例描述

* 名称

* 类型

* 值/链接

或

上传

Title	Author	Date	Journal	Volume(Issue)	Pages	Links	Doi	Operation

{{htmlJSON.GetByDoi}} :

Authors: {{articleUploading.authors[0]}}, {{articleUploading.authors[1]}}, {{articleUploading.authors[2]}}, et al.

Journal: {{articleUploading.journal}}

Date: {{articleUploading.date}}

Page range: {{articleUploading.pageRange}}

Link: {{articleUploading.link}}

DOI: {{articleUploading.doi}}

The article {{articleUploading.title}} has been uploaded yet.