## Sklearn - KMeans

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

SklearnClusteringKMeans

true

#### Classification(s)

Method-focused categoriesData-perspectiveGeoinformation analysis

#### Detailed Description

English {{currentDetailLanguage}} English

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

The k-means algorithm divides a set of N samples X into K disjoint clusters C, each described by the mean μj of the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from X, although they live in the same space.

The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion:

Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:

1. Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.
2. Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem and speed up the computations.

K-means is often referred to as Lloyd’s algorithm. In basic terms, the algorithm has three steps. The first step chooses the initial centroids, with the most basic method being to choose k samples from the dataset X. After initialization, K-means consists of looping between the two other steps. The first step assigns each sample to its nearest centroid. The second step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid. The difference between the old and the new centroids are computed and the algorithm repeats these last two steps until this value is less than a threshold. In other words, it repeats until the centroids do not move significantly.

K-means is equivalent to the expectation-maximization algorithm with a small, all-equal, diagonal covariance matrix.

The algorithm can also be understood through the concept of Voronoi diagrams. First the Voronoi diagram of the points is calculated using the current centroids. Each segment in the Voronoi diagram becomes a separate cluster. Secondly, the centroids are updated to the mean of each segment. The algorithm then repeats this until a stopping criterion is fulfilled. Usually, the algorithm stops when the relative decrease in the objective function between iterations is less than the given tolerance value. This is not the case in this implementation: iteration stops when centroids move less than the tolerance.

Given enough time, K-means will always converge, however this may be to a local minimum. This is highly dependent on the initialization of the centroids. As a result, the computation is often done several times, with different initializations of the centroids. One method to help address this issue is the k-means++ initialization scheme, which has been implemented in scikit-learn (use the init='k-means++' parameter). This initializes the centroids to be (generally) distant from each other, leading to provably better results than random initialization, as shown in the reference.

K-means++ can also be called independently to select seeds for other clustering algorithms, see sklearn.cluster.kmeans_plusplus for details and example usage.

The algorithm supports sample weights, which can be given by a parameter sample_weight. This allows to assign more weight to some samples when computing cluster centers and values of inertia. For example, assigning a weight of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset X.

K-means can be used for vector quantization. This is achieved using the transform method of a trained model of KMeans.

# 1. Low-level parallelism

KMeans benefits from OpenMP based parallelism through Cython. Small chunks of data (256 samples) are processed in parallel, which in addition yields a low memory footprint. For more details on how to control the number of threads, please refer to our Parallelism notes.

Examples:

1. Demonstration of k-means assumptions: Demonstrating when k-means performs intuitively and when it does not
2. A demo of K-Means clustering on the handwritten digits data: Clustering handwritten digits

References:

“k-means++: The advantages of careful seeding” Arthur, David, and Sergei Vassilvitskii, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics (2007)

# 2. Mini Batch K-Means

The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm.

The algorithm iterates between two major steps, similar to vanilla k-means. In the first step, b samples are drawn randomly from the dataset, to form a mini-batch. These are then assigned to the nearest centroid. In the second step, the centroids are updated. In contrast to k-means, this is done on a per-sample basis. For each sample in the mini-batch, the assigned centroid is updated by taking the streaming average of the sample and all previous samples assigned to that centroid. This has the effect of decreasing the rate of change for a centroid over time. These steps are performed until convergence or a predetermined number of iterations is reached.

MiniBatchKMeans converges faster than KMeans, but the quality of the results is reduced. In practice this difference in quality can be quite small, as shown in the example and cited reference.

Examples:

1. Comparison of the K-Means and MiniBatchKMeans clustering algorithms: Comparison of KMeans and MiniBatchKMeans
2. Clustering text documents using k-means: Document clustering using sparse MiniBatchKMeans
3. Online learning of a dictionary of parts of faces

References:

#### {{htmlJSON.HowtoCite}}

Jie Song (2021). Sklearn - KMeans, Model Item, OpenGMS, https://geomodeling.njnu.edu.cn/modelItem/c2669bbe-13fe-41c9-b8fc-4b4c2976a97f

Last modifier
Jie Song
Last modify time
2021-01-12
Modify times
View History

#### QR Code

• {{curRelation.name}}
{{curRelation.name}}

{{curRelation.overview}}
{{curRelation.author.join('; ')}}
{{curRelation.journal}}

{{htmlJSON.RelatedItems}}
{{props.row.name}}

{{ props.row.overview }}
{{ props.row.overview }}
Drop the file here, orclick to upload.
File size should not exceed 10m.
Select From My Space

{{htmlJSON.authorshipSubmitted}}

Cancel Submit
{{htmlJSON.Cancel}} {{htmlJSON.Submit}}
{{ item.label }} {{ item.value }}
{{props.row.localName}}
{{htmlJSON.ModelName}}:
{{htmlJSON.Cancel}} {{htmlJSON.Submit}}
模型名称
名称 别名 {{tag}}

模型版本
系列名 版本号 目的 修改内容 创建/修改日期 作者

描述信息
摘要 详细描述

{{tag}}
* 时间参考系
* 空间参考系类型 * 空间参考系名称

开发信息
起始日期 终止日期 进展 开发者

* 是否开源 * 访问方式 * 使用方式 开源协议 * 传输方式 * 获取地址 * 发布日期 * 发布者

元数据版本
编号 目的 修改内容 创建/修改日期 作者
{{index+1}}

{{index+1}}

{{index+1}}

模型类型

分类信息

时间分辨率 时间尺度 时间步长 时间范围 空间维度 格网类型 空间分辨率 空间尺度 空间范围
{{tag}}
* 类型
图例

* 名称 * 描述
上传

示例描述 * 名称 * 类型 * 值/链接 上传

{{htmlJSON.Cancel}} {{htmlJSON.Submit}}
Title Author Date Journal Volume(Issue) Pages Links Doi Operation
{{htmlJSON.Cancel}} {{htmlJSON.Submit}}

Yes, this is it Cancel

OK
{{htmlJSON.Cancel}} {{htmlJSON.Confirm}}
Model Classifications 1
Model Classifications 2
Title Author Date Journal Volume(Issue) Pages Links Doi Operation

#### NEW

Name:
Affiliation:
Email:
Homepage:

Yes, this is it Cancel

Confirm
{{htmlJson.path}}
:
/{{path.name}}
search results of '{{searchContentShown}}'

#### No content to show

{{item.name}}

.

{{item.suffix}}

.{{item.suffix}}

{{htmlJson.Max}}: {{toDecimal1(capacity/1073741824)}} GB
Copy
Delete
Rename
/{{path.label}}
{{htmlJson.Change}}
/{{path.name}}
{{htmlJson.SelectFile}}
{{htmlJson.Cancel}} {{htmlJson.Confirm}}
{{htmlJson.path}}
:
/{{path.name}}
/..
{{htmlJson.Cancel}} {{htmlJson.Confirm}}
{{ data.name }}
##### You have select  {{multipleSelection.length+multipleSelectionMyData.length}} data .
• Output Data
• {{item.computableName}}@{{formatDate(item.runTime)}}
{{scope.row.type}}
{{ scope.row.tag }}
• Fork Data
{{it.category}}

#### NEW

Name:
Affiliation:
Email:
Homepage:
previous next conform
{{htmlJSON.ModelClassifications}}

{{htmlJson.RelatedItems}}
{{ props.row.overview }}
{{ props.row.overview }}
{{htmlJson.Cancel}} {{htmlJson.OK}}
{{ item.label }} {{ item.value }}
{{props.row.localName}}
Model Name :