Clustering using the K-Means++ algorithm (SPMF documentation)

This example explains how to run the K-Means++ algorithm using the SPMF open-source data mining library.

How to run this example?

What is K-Means++?

K-Means is one of the most famous clustering algorithm. It is used to automatically separate a set of instances (vectors of double values) into groups of instances (clusters) according to their similarity. K-Means is used to automatically group similar instances together into clusters.

This is an implementation of an improved version of K-Means, called K-Means++. The difference between K-Means and K-Means++ lies in how the clusters are initialized in the first stages of the algorithm. While K-Means perform a random initialization, K-Means++ aims to select cluster centers that are as far as possible from each other, which often gives better results.

In this implementaiton, the user can also choose between various distance functions to assess the similarity between vectors. SPMF offers the Euclidian distance, correlation distance, cosine distance, Manathan distance and Jaccard distance.

What is the input?

K-Means++ takes as input a set of instances having a name and containing one or more double values, a parameter K (a positive integer >=1) indicating the number of clusters to be created, and a distance function.

The input file format of K-Means++ is a text file containing several instances.

The first lines (optional) specify the name of the attributes used for describing the instances. In this example, two attributes will be used, named X and Y. But note that more than two attributes could be used. Each attribute is specified on a separated line by the keyword "@ATTRIBUTEDEF=", followed by the attribute name

Then, each instance is described by two lines. The first line (which is optional) contains the string "@NAME=" followed by the name of the instance. Then, the second line provides a list of double values separated by single spaces.

An example of input is provided in the file "inputDBScan2.txt" of the SPMF distribution. It contains 31 instances, each described by two attribute called X and Y.

@ATTRIBUTEDEF=X
@ATTRIBUTEDEF=Y
@NAME=Instance1
1 1
@NAME=Instance2
0 1
@NAME=Instance3
1 0
@NAME=Instance4
11 12
@NAME=Instance5
11 13
@NAME=Instance6
13 13
@NAME=Instance7
12 8.5
@NAME=Instance8
13 8
@NAME=Instance9
13 9
@NAME=Instance10
13 7
@NAME=Instance11
11 7
@NAME=Instance12
8 2
@NAME=Instance13
9 2
@NAME=Instance14
10 1
@NAME=Instance15
7 13
@NAME=Instance16
5 9
@NAME=Instance17
16 16
@NAME=Instance18
11.5 8
@NAME=Instance20
13 10
@NAME=Instance21
12 13
@NAME=Instance21
14 12.5
@NAME=Instance22
14.5 11.5
@NAME=Instance23
15 10.5
@NAME=Instance24
15 9.5
@NAME=Instance25
12 9.5
@NAME=Instance26
10.5 11
@NAME=Instance27
10 10.5
@NAME=Instance28
9 3
@NAME=Instance29
9 4
@NAME=Instance30
9 5

For example, the first instance is named "Instance1", and it is a vector with two values: 1 and 1 for the attributes X and Y, respectively.

This input file represents a set of 2D points. But note that, it is possible to use more than two attributes to describe instances. To give a better idea of what the input file looks like, here is a visual representation:

The K-Means++ algorithm will group the instances according to their similarity. To do this, it is also necessary to specify the distance function to be used for comparing the instances. The distance function can be the Euclidian distance, correlation distance, cosine distance, Manathan distance and Jaccard distance. In the command line or GUI of SPMF, the distance function is specified by using one of these keywords: "euclidian", "correlation", "cosine", "manathan" and "jaccard" as parameter. In this example, the euclidian distance is used.

What is the output?

K-Means++ groups instances in clusters according to their similarity. In SPMF, the similarity is defined according to the distance function chosen by the user such as the Euclidian distance. K-Means++ returns K clusters or less.

Note that running K-Means++ with the same data does not always generate the same result because K-Means++ initializes clusters randomly.

By running K-Means++ on the previous input file and K=3, we can obtain the following output file containing 3 clusters:

@ATTRIBUTEDEF=X
@ATTRIBUTEDEF=Y
[Instance12 8.0 2.0][Instance13 9.0 2.0][Instance14 10.0 1.0][Instance28 9.0 3.0][Instance29 9.0 4.0][Instance30 9.0 5.0]
[Instance4 11.0 12.0][Instance5 11.0 13.0][Instance6 13.0 13.0][Instance7 12.0 8.5][Instance8 13.0 8.0][Instance9 13.0 9.0][Instance11 11.0 7.0][Instance15 7.0 13.0][Instance16 5.0 9.0][Instance17 16.0 16.0][Instance18 11.5 8.0][Instance20 13.0 10.0][Instance21 12.0 13.0][Instance21 14.0 12.5][Instance22 14.5 11.5][Instance23 15.0 10.5][Instance24 15.0 9.5][Instance25 12.0 9.5][Instance26 10.5 11.0][Instance27 10.0 10.5][Instance10 13.0 7.0]
[Instance1 1.0 1.0][Instance2 0.0 1.0][Instance3 1.0 0.0]

The output file format is defined as follows. The first few lines indicate the attribute names. Each attribute is specified on a separated line with the keyword "ATTRIBUTEDEF=" followed by the attribute name (a string). Then, the list of clusters is indicated. Each cluster is specified on a separated line, listing the instances contained in the cluster. An instance is a name followed by a list of double values separated by " " and between the "[" and "]" characters.

The clusters found by the algorithm can be viewed visually using the "Cluster Viewer" provided in SPMF. If you are using the graphical interface of SPMF, click the checkbox "Cluster Viewer" before pressing the "Run Algorithm" button. The result will be displayed in the Cluster Viewer.

As it can be seen in this example, the result somewhat make sense, as points close to each other are in the same cluster.

Applying K-Means++ to time series

Note that the K-Means++ algorithm implementation in SPMF can also be applied to time series database such as the file contextSAX.txt in the SPMF distribution. To apply K-Means++ to time series, it is necessary to set the "separator" parameter of the K-Means++ algorithm to "," since time series files separate values by "," instead of spaces.

Where can I get more information about K-Means++?

K-Means was proposed by MacQueen in 1967. K-Means is one of the most famous data mining algorithm.

K-Means++ was proposed in this paper:

Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. Proceedings of the Annuual ACM-SIAM Symposium on Discrete
Algorithms, 8, 1027-1035. 10.1145/1283383.1283494.