Sklearn - DBSCAN

DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.

SklearnclusteringDBSCAN
339

Authorship

Authorship is unclear, you can claim the item.

Classification(s)

Method-focused categoriesData-perspectiveGeoinformation analysis

Model Description

English {{currentDetailLanguage}} English

The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.

More formally, we define a core sample as being a sample in the dataset such that there exist min_samples other samples within a distance of eps, which are defined as neighbors of the core sample. This tells us that the core sample is in a dense area of the vector space. A cluster is a set of core samples that can be built by recursively taking a core sample, finding all of its neighbors that are core samples, finding all of their neighbors that are core samples, and so on. A cluster also has a set of non-core samples, which are samples that are neighbors of a core sample in the cluster but are not themselves core samples. Intuitively, these samples are on the fringes of a cluster.

Any core sample is part of a cluster, by definition. Any sample that is not a core sample, and is at least eps in distance from any core sample, is considered an outlier by the algorithm.

While the parameter min_samples primarily controls how tolerant the algorithm is towards noise (on noisy and large data sets it may be desirable to increase this parameter), the parameter eps is crucial to choose appropriately for the data set and distance function and usually cannot be left at the default value. It controls the local neighborhood of the points. When chosen too small, most data will not be clustered at all (and labeled as -1 for “noise”). When chosen too large, it causes close clusters to be merged into one cluster, and eventually the entire data set to be returned as a single cluster. Some heuristics for choosing this parameter have been discussed in the literature, for example based on a knee in the nearest neighbor distances plot (as discussed in the references below).

In the figure below, the color indicates cluster membership, with large circles indicating core samples found by the algorithm. Smaller circles are non-core samples that are still part of a cluster. Moreover, the outliers are indicated by black points below.

Examples:

Implementation

The DBSCAN algorithm is deterministic, always generating the same clusters when given the same data in the same order. However, the results can differ when data is provided in a different order. First, even though the core samples will always be assigned to the same clusters, the labels of those clusters will depend on the order in which those samples are encountered in the data. Second and more importantly, the clusters to which non-core samples are assigned can differ depending on the data order. This would happen when a non-core sample has a distance lower than eps to two core samples in different clusters. By the triangular inequality, those two core samples must be more distant than eps from each other, or they would be in the same cluster. The non-core sample is assigned to whichever cluster is generated first in a pass through the data, and so the results will depend on the data ordering.

The current implementation uses ball trees and kd-trees to determine the neighborhood of points, which avoids calculating the full distance matrix (as was done in scikit-learn versions before 0.14). The possibility to use custom metrics is retained; for details, see NearestNeighbors.

Memory consumption for large sample sizes

This implementation is by default not memory efficient because it constructs a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot be used (e.g., with sparse matrices). This matrix will consume n2 floats. A couple of mechanisms for getting around this are:

1. Use OPTICS clustering in conjunction with the extract_dbscan method. OPTICS clustering also calculates the full pairwise matrix, but only keeps one row in memory at a time (memory complexity n).
2. A sparse radius neighborhood graph (where missing entries are presumed to be out of eps) can be precomputed in a memory-efficient way and dbscan can be run over this with metric='precomputed'. See sklearn.neighbors.NearestNeighbors.radius_neighbors_graph.
3. The dataset can be compressed, either by removing exact duplicates if these occur in your data, or by using BIRCH. Then you only have a relatively small number of representatives for a large number of points. You can then provide a sample_weight when fitting DBSCAN.

References:

1. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise” Ester, M., H. P. Kriegel, J. Sander, and X. Xu, In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
2. “DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). In ACM Transactions on Database Systems (TODS), 42(3), 19.

Model Domain
{{domain}}

Principles
{{principle}}
Incorporated Models
{{incorporatedModel}}
Model part of larger framework: {{metadata.design.framework}}
Incorporated Models
{{process}}

Inputs
{{input}}
Outputs
{{output}}

There is no usage info about this model. You can click to add overview.

How to Cite

songjie (2021). Sklearn - DBSCAN, Model Item, OpenGMS, https://geomodeling.njnu.edu.cn/modelItem/ed99f987-a5b5-4c32-98fd-e617170209aa

Comment(s)

{{comment.date}}
{{comment.content}}
{{subComm.date}}
{{subComm.content}}

Authorship

Authorship is unclear, you can claim the item.

History

Last modifier :
songjie
Last modify time :
2021-01-12
Modify times :
View History

QR Code

• {{curRelation.name}}
{{curRelation.name}}

{{curRelation.overview}}
{{curRelation.author.join('; ')}}
{{curRelation.journal}}

You can link related {{typeName}} from repository to this model item, or you can create a new {{typeName.toLowerCase()}}.

Related Items
{{ props.row.description }}
{{ props.row.description }}
Related Items
{{props.row.name}}

You can link resource from repository to this model item, or you can create a new {{typeName.toLowerCase()}}.

{{ props.row.description }}
{{ props.row.description }}
Drop the file here, orclick to upload.
File size should not exceed 10m.
Select From My Space

These authorship information will be submitted to the contributor to review.

Cancel Submit
Model Classifications
Cancel Submit
{{ item.label }} {{ item.value }}
{{props.row.localName}}
Model Name :
Cancel Submit
Name:
Version:
Model Type:
Model Domain:
Scale:
Purpose:
Principles:
Incorporated models:

Model part of

larger framework

Process:
Information:
Initialization:
Hardware Requirements:
Software Requirements:
Inputs:
Outputs:
Cancel Submit
Title Author Date Journal Volume(Issue) Pages Links Doi Operation
Cancel Submit

Yes, this is it Cancel

OK
Cancel Confirm
Model Classifications 1
Model Classifications 2
Title Author Date Journal Volume(Issue) Pages Links Doi Operation

NEW

Name:
Affiliation:
Email:
Homepage:

Yes, this is it Cancel

Confirm
path
:
/{{path.label}}
search results of '{{searchContentShown}}'

No content to show

{{item.label}}

.

{{item.suffix}}

.{{item.suffix}}

{{item.fileName}}.{{item.suffix}}

Copy
Delete
Rename
/{{path.label}}
Change
/{{path.label}}
Select File
Cancel Confirm
path
:
/{{path.label}}
/..
Cancel Confirm
{{ node.label }}
You have select  {{multipleSelection.length+multipleSelectionMyData.length}} data .
• My Uploaded Data
• Output Data
• {{item.computableName}}@{{formatDate(item.runTime)}}
{{scope.row.type}}
{{ scope.row.tag }}