Calculate Statistics for a file with double vectors (instances) (SPMF documentation)
This example explains how to calculate statistics for a file with double vectors (instances) using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "Calculate_stats_for_double_vector_instance_file" algorithm, (2) choose the input file inputDBScan2.txt(3) set the separator parameter to " ", and (4) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run Calculate_stats_for_double_vector_instance_file inputDBScan.txt no_output_file " " in a folder containing spmf.jar and the input file inputDBScan.txt. - If you are using the source code version of SPMF, launch the file "MainTestDoubleVectorDBStats.java" in the package ca.pfv.SPMF.tests.
What is this tool?
This tool is a tool for generating statistics about a file containing double-vectors, a type of data for example used by clustering algorithms.
What is the input?
The input file format is is a text file containing several instances.
The first lines (optional) specify the name of the attributes used for describing the instances. In this example, two attributes will be used, named X and Y. But note that more than two attributes could be used. Each attribute is specified on a separated line by the keyword "@ATTRIBUTEDEF=", followed by the attribute name
Then, each instance is described by two lines. The first line (which is optional) contains the string "@NAME=" followed by the name of the instance. Then, the second line provides a list of double values separated by single spaces.
An example of input is provided in the file "inputDBScan.txt" of the SPMF distribution. It contains 5 instances, each described by two attribute called X and Y.
@ATTRIBUTEDEF=X
@ATTRIBUTEDEF=Y
@NAME=Instance1
1 1
@NAME=Instance2
0 1
@NAME=Instance3
1 0
@NAME=Instance4
11 12
@NAME=Instance5
11 13
For example, the first instance is named "Instance1", and it is a vector with two values: 1 and 1 for the attributes X and Y, respectively.
This input file represents a set of 2D points. But note that, it is possible to use more than two attributes to describe instances.
What is the output?
The output is statistics about the event sequence. For example, if we use the tool on the previous event sequence given as example, we get the following statistics:
=========== DOUBLE VECTOR DB STATS ============
Number of instances: 10
Number of attributes: 2
Statistics for attribute 1:
Min value: 0.0
Max value: 89.0
Average value: 29.0
Median value: 11.5
Statistics for attribute 2:
Min value: 0.0
Max value: 89.0
Average value: 29.1
Median value: 13.0
=================================================