Below are quoted from: Liu, Li-xiong, et al. "rDenStream, a clustering algorithm over an evolving data stream." 2009 international conference on information engineering and computer science. IEEE, 2009.
For mining new pattern from evolving data streams, most algorithms are inherited from DenStream framework which is realized via a sliding window. So at the early stage of a pattern emerges, its knowledge points can be easily mistaken as outliers and dropped. In most cases, these points can be ignored, but in some special applications which need to quickly and precisely master the emergence rule of some patterns, these points must play their rules. Based on DenStream, this paper proposes a three-step clustering algorithm, rDenStream, which presents the concept of outlier retrospect. In rDenStream clustering, dropped micro-clusters are stored on outside memory temporarily, and will be given new chance to attend clustering to improve the clustering accuracy. Experiments modeled the arrival of data stream in Poisson process, and the results over standard data set showed its advantage over other methods in the early phase of new pattern discovery.
Three phases are implemented as follows:
-
Select the time window with proper granularity, points of a data stream are divided into several disjoint subsets according to the arriving time. Points in the same time window are clustered into two kinds of clusters: potential-micro-cluster and outlier-micro-cluster. Only the p-micro-cluster is the input of the next phase, while other micro-clusters are stored in historical micro-cluster buffer by using the outside memory such as magnetic disc.
-
Macro-clustering phase puts all results from each time window into a new set. This set includes all the learning results from subsets but has much few points compared with the original data.
-
In the last phase, new clusters are used to form a classifier, which will relearn the historical micro-clusters according to emerging rule of the coming data, and we call this phase as retrospect. During the retrospect, the pseudo outlier-micro-clusters misjudged in the first two stages can be modified to improve the clustering accuracy. Considering the “long tail” phenomenon, this algorithm gives a chance to let misjudged points to be learned again so it can enhance theclustering robustness.