Sklearn - KMeans

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

SklearnClusteringKMeans

Contributor(s)

Initial contribute: 2021-01-07

Classification(s)

●

Method-focused categories

Data-perspective

Geoinformation analysis

Detailed Description

English

Quoted from:https://scikit-learn.org/stable/modules/clustering.html#k-means

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

The k-means algorithm divides a set of N samples X into K disjoint clusters C, each described by the mean μj of the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from X, although they live in the same space.

The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion:

Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:

Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.
Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem and speed up the computations.

K-means is often referred to as Lloyd’s algorithm. In basic terms, the algorithm has three steps. The first step chooses the initial centroids, with the most basic method being to choose k samples from the dataset X. After initialization, K-means consists of looping between the two other steps. The first step assigns each sample to its nearest centroid. The second step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid. The difference between the old and the new centroids are computed and the algorithm repeats these last two steps until this value is less than a threshold. In other words, it repeats until the centroids do not move significantly.

K-means is equivalent to the expectation-maximization algorithm with a small, all-equal, diagonal covariance matrix.

The algorithm can also be understood through the concept of Voronoi diagrams. First the Voronoi diagram of the points is calculated using the current centroids. Each segment in the Voronoi diagram becomes a separate cluster. Secondly, the centroids are updated to the mean of each segment. The algorithm then repeats this until a stopping criterion is fulfilled. Usually, the algorithm stops when the relative decrease in the objective function between iterations is less than the given tolerance value. This is not the case in this implementation: iteration stops when centroids move less than the tolerance.

Given enough time, K-means will always converge, however this may be to a local minimum. This is highly dependent on the initialization of the centroids. As a result, the computation is often done several times, with different initializations of the centroids. One method to help address this issue is the k-means++ initialization scheme, which has been implemented in scikit-learn (use the init='k-means++' parameter). This initializes the centroids to be (generally) distant from each other, leading to provably better results than random initialization, as shown in the reference.

K-means++ can also be called independently to select seeds for other clustering algorithms, see sklearn.cluster.kmeans_plusplus for details and example usage.

The algorithm supports sample weights, which can be given by a parameter sample_weight. This allows to assign more weight to some samples when computing cluster centers and values of inertia. For example, assigning a weight of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset X.

K-means can be used for vector quantization. This is achieved using the transform method of a trained model of KMeans.

1. Low-level parallelism

KMeans benefits from OpenMP based parallelism through Cython. Small chunks of data (256 samples) are processed in parallel, which in addition yields a low memory footprint. For more details on how to control the number of threads, please refer to our Parallelism notes.

Examples:

Demonstration of k-means assumptions: Demonstrating when k-means performs intuitively and when it does not

A demo of K-Means clustering on the handwritten digits data: Clustering handwritten digits

References:

“k-means++: The advantages of careful seeding” Arthur, David, and Sergei Vassilvitskii, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics (2007)

2. Mini Batch K-Means

The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm.

The algorithm iterates between two major steps, similar to vanilla k-means. In the first step, b samples are drawn randomly from the dataset, to form a mini-batch. These are then assigned to the nearest centroid. In the second step, the centroids are updated. In contrast to k-means, this is done on a per-sample basis. For each sample in the mini-batch, the assigned centroid is updated by taking the streaming average of the sample and all previous samples assigned to that centroid. This has the effect of decreasing the rate of change for a centroid over time. These steps are performed until convergence or a predetermined number of iterations is reached.

MiniBatchKMeans converges faster than KMeans, but the quality of the results is reduced. In practice this difference in quality can be quite small, as shown in the example and cited reference.

Examples:

Comparison of the K-Means and MiniBatchKMeans clustering algorithms: Comparison of KMeans and MiniBatchKMeans

Clustering text documents using k-means: Document clustering using sparse MiniBatchKMeans

Online learning of a dictionary of parts of faces

References:

“Web Scale K-Means clustering” D. Sculley, Proceedings of the 19th international conference on World wide web (2010)

{{htmlJSON.ComputableModelList}} 0

{{htmlJSON.ConceptualschematicModelList}} 0

{{htmlJSON.LogicalschematicModelList}} 0

{{htmlJSON.ModelItem}} 2

Author {{curRelation.author.join('; ')}}

Journal {{curRelation.journal}}

{{htmlJSON.DataItem}} 0

Data Hub 0

There is no related data hub. You can link data hubs.

Data Method 0

There is no related data method. You can link data methods.

{{htmlJSON.Reference}} 0

{{htmlJSON.Material}} 0

模型元数据

Jie Song (2021). Sklearn - KMeans, Model Item, OpenGMS, https://geomodeling.njnu.edu.cn/modelItem/c2669bbe-13fe-41c9-b8fc-4b4c2976a97f

Copyright and Disclaimer

All copyrights of a material (model, data, article, etc.) in the OpenGMS fully belong to its author/developer/designer (or any other wording about the owner). The OpenGMS takes every care to avoid copyright infringement, contributor(s) should carefully employ materials from other sources and give proper citations.

Contributor(s)

Initial contribute : 2021-01-07

History

Last modifier:: Jie Song
Last modify time:: 2021-01-12
Modify times:: 2 View History

QR Code

Author {{curRelation.author.join('; ')}}

Journal {{curRelation.journal}}

{{htmlJSON.LinkResourceFromRepositoryOrCreate}}{{htmlJSON.create}}.

Drop the file here, orclick to upload.

Select From My Space

+ add

Alias

+ {{htmlJSON.Add}}

{{htmlJSON.ModelName}}:

* 名称

别名

系列名

* 版本号

* 目的

* 修改内容

* 创建/修改日期

* 作者

* 摘要

详细描述

+ 添加关键字

* 时间参考系

* 空间参考系类型

* 空间参考系名称

* 起始日期

终止日期

* 进展

* 开发者

* 是否开源

* 访问方式

* 使用方式

* 开源协议

* 传输方式

* 获取地址

* 发布日期

* 发布者

* 编号

* 目的

* 修改内容

* 创建/修改日期

* 作者

* 时间分辨率

* 时间尺度

* 时间步长

* 时间范围

* 空间维度

* 格网类型

* 空间分辨率

* 空间尺度

* 空间范围

* 类型

图例

* 名称

* 描述

示例描述

* 名称

* 类型

* 值/链接

或

上传

Title	Author	Date	Journal	Volume(Issue)	Pages	Links	Doi	Operation

{{htmlJSON.GetByDoi}} :

Authors: {{articleUploading.authors[0]}}, {{articleUploading.authors[1]}}, {{articleUploading.authors[2]}}, et al.

Journal: {{articleUploading.journal}}

Date: {{articleUploading.date}}

Page range: {{articleUploading.pageRange}}

Link: {{articleUploading.link}}

DOI: {{articleUploading.doi}}

The article {{articleUploading.title}} has been uploaded yet.

Sklearn - KMeans

Contributor(s)

Initial contribute: 2021-01-07

Classification(s)

Detailed Description

1. Low-level parallelism

2. Mini Batch K-Means

{{htmlJSON.ModelContentService}}

{{htmlJSON.noComputableModel}}

{{htmlJSON.NoRelatedConceptual}}

{{htmlJSON.NoRelatedLogical}}

{{htmlJSON.RelatedModelsData}}

{{htmlJSON.NoRelatedModel}}

{{htmlJSON.noRelatedData}}

There is no related data hub. You can link data hubs.

There is no related data method. You can link data methods.

{{htmlJSON.RelatedKnowledge}}

{{htmlJSON.noRelatedReference}}

{{htmlJSON.NoRelatedMmaterial}}

模型元数据

{{htmlJSON.HowtoCite}}

Copyright and Disclaimer

Contributor(s)

Initial contribute : 2021-01-07

{{htmlJSON.CoContributor}}

History

QR Code

{{articleUploading.title}}

OpenGMS Systems

Online Tools

About

Contact

OpenGMS Systems

Online Tools

About

Contact

Open Geographic Modeling and Simulation

Authorship

NEW

{{articleUploading.title}}

No content to show

You have select {{multipleSelection.length+multipleSelectionMyData.length}} data .

NEW

Comment(s)