SUBCLU

From testwiki
Jump to navigation Jump to search

Template:One source SUBCLU is an algorithm for clustering high-dimensional data by Karin Kailing, Hans-Peter Kriegel and Peer Kröger.[1] It is a subspace clustering algorithm that builds on the density-based clustering algorithm DBSCAN. SUBCLU can find clusters in axis-parallel subspaces, and uses a bottom-up, greedy strategy to remain efficient.

Approach

SUBCLU uses a monotonicity criteria: if a cluster is found in a subspace S, then each subspace TS also contains a cluster. However, a cluster CDB in subspace S is not necessarily a cluster in TS, since clusters are required to be maximal, and more objects might be contained in the cluster in T that contains C. However, a density-connected set in a subspace S is also a density-connected set in TS.

This downward-closure property is utilized by SUBCLU in a way similar to the Apriori algorithm: first, all 1-dimensional subspaces are clustered. All clusters in a higher-dimensional subspace will be subsets of the clusters detected in this first clustering. SUBCLU hence recursively produces k+1-dimensional candidate subspaces by combining k-dimensional subspaces with clusters sharing k1 attributes. After pruning irrelevant candidates, DBSCAN is applied to the candidate subspace to find out if it still contains clusters. If it does, the candidate subspace is used for the next combination of subspaces. In order to improve the runtime of DBSCAN, only the points known to belong to clusters in one k-dimensional subspace (which is chosen to contain as little clusters as possible) are considered. Due to the downward-closure property, other point cannot be part of a k+1-dimensional cluster anyway.

Pseudocode

SUBCLU takes two parameters, ϵ and MinPts, which serve the same role as in DBSCAN. In a first step, DBSCAN is used to find 1D-clusters in each subspace spanned by a single attribute:

𝚂𝚄𝙱𝙲𝙻𝚄(DB,eps,MinPts)

S1:=
C1:=
𝚏𝚘𝚛𝚎𝚊𝚌𝚑aAttributes
C{a}=𝙳𝙱𝚂𝙲𝙰𝙽(DB,{a},eps,MinPts)
𝚒𝚏(C{a})
S1:=S1{a}
C1:=C1C{a}
𝚎𝚗𝚍𝚒𝚏
𝚎𝚗𝚍𝚏𝚘𝚛
// In a second step, k+1-dimensional clusters are built from k-dimensional ones:
k:=1
𝚠𝚑𝚒𝚕𝚎(Ck)
𝙲𝚊𝚗𝚍𝚂k+1:=𝙶𝚎𝚗𝚎𝚛𝚊𝚝𝚎𝙲𝚊𝚗𝚍𝚒𝚍𝚊𝚝𝚎𝚂𝚞𝚋𝚜𝚙𝚊𝚌𝚎𝚜(Sk)
𝚏𝚘𝚛𝚎𝚊𝚌𝚑cand𝙲𝚊𝚗𝚍𝚂k+1
𝚋𝚎𝚜𝚝𝚂𝚞𝚋𝚜𝚙𝚊𝚌𝚎:=minsSkscandCiCs|Ci|
Ccand:=
𝚏𝚘𝚛𝚎𝚊𝚌𝚑𝚌𝚕𝚞𝚜𝚝𝚎𝚛clC𝚋𝚎𝚜𝚝𝚂𝚞𝚋𝚜𝚙𝚊𝚌𝚎
Ccand:=Ccand𝙳𝙱𝚂𝙲𝙰𝙽(cl,cand,eps,MinPts)
𝚒𝚏(Ccand)
Sk+1:=Sk+1cand
Ck+1:=Ck+1Ccand
𝚎𝚗𝚍𝚒𝚏
𝚎𝚗𝚍𝚏𝚘𝚛
𝚎𝚗𝚍𝚏𝚘𝚛
k:=k+1
𝚎𝚗𝚍𝚠𝚑𝚒𝚕𝚎

𝚎𝚗𝚍

The set Sk contains all the k-dimensional subspaces that are known to contain clusters. The set Ck contains the sets of clusters found in the subspaces. The bestSubspace is chosen to minimize the runs of DBSCAN (and the number of points that need to be considered in each run) for finding the clusters in the candidate subspaces.

Candidate subspaces are generated much alike the Apriori algorithm generates the frequent itemset candidates: Pairs of the k-dimensional subspaces are compared, and if they differ in one attribute only, they form a k+1-dimensional candidate. However, a number of irrelevant candidates are found as well; they contain a k-dimensional subspace that does not contain a cluster. Hence, these candidates are removed in a second step:

𝙶𝚎𝚗𝚎𝚛𝚊𝚝𝚎𝙲𝚊𝚗𝚍𝚒𝚍𝚊𝚝𝚎𝚂𝚞𝚋𝚜𝚙𝚊𝚌𝚎𝚜(Sk)

𝙲𝚊𝚗𝚍𝚂k+1:=
𝚏𝚘𝚛𝚎𝚊𝚌𝚑s1Sk
𝚏𝚘𝚛𝚎𝚊𝚌𝚑s2Sk
𝚒𝚏(s1𝚊𝚗𝚍s2𝚍𝚒𝚏𝚏𝚎𝚛𝚒𝚗𝚎𝚡𝚊𝚌𝚝𝚕𝚢𝚘𝚗𝚎𝚊𝚝𝚝𝚛𝚒𝚋𝚞𝚝𝚎)
𝙲𝚊𝚗𝚍𝚂k+1:=𝙲𝚊𝚗𝚍𝚂k+1{s1s2}
𝚎𝚗𝚍𝚒𝚏
𝚎𝚗𝚍𝚏𝚘𝚛
𝚎𝚗𝚍𝚏𝚘𝚛
// Pruning of irrelevant candidate subspaces
𝚏𝚘𝚛𝚎𝚊𝚌𝚑cand𝙲𝚊𝚗𝚍𝚂k+1
𝚏𝚘𝚛𝚎𝚊𝚌𝚑k-𝚎𝚕𝚎𝚖𝚎𝚗𝚝scand
𝚒𝚏(s∉Sk)
𝙲𝚊𝚗𝚍𝚂k+1=𝙲𝚊𝚗𝚍𝚂k+1{cand}
𝚎𝚗𝚍𝚒𝚏
𝚎𝚗𝚍𝚏𝚘𝚛
𝚎𝚗𝚍𝚏𝚘𝚛

𝚎𝚗𝚍

Availability

An example implementation of SUBCLU is available in the ELKI framework.

References

  1. Karin Kailing, Hans-Peter Kriegel and Peer Kröger. Density-Connected Subspace Clustering for High-Dimensional Data. In: Proc. SIAM Int. Conf. on Data Mining (SDM'04), pp. 246-257, 2004.