Calculate Statistics for a Graph Database (SPMF documentation)

This example explains how to calculate statistics for a graph database using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Calculate_stats_for_a_graph_database" algorithm, (2) choose the input file contextTKG.txt (3) click "Run algorithm".

What is this tool?

This tool is a tool for generating statistics about a graph database. This tool can be used to learn about the characteristics of a graph database. Graph databases can be used with several graph mining algorithms such as TKG and GSpan.

What is the input?

The input is a set of labeled connected graphs.

To explain the input more clearly, some definitions are first introduced. A labeled graph is a set of vertices and edges, having some labels. Let’s me illustrate this idea with an example. Consider the following graph:

graph

This graph contains four vertices (depicted as yellow circles). These vertices have labels such as “10” and “11”.  These labels provide information about the vertices. For example, imagine that this graph is a  chemical molecule. The label 10 and 11 could represent the chemical elements of Hydrogen and Oxygen, respectively. Labels do not need to be unique. In other words, the same labels may be used to describe several vertices in the same graph. For example, if the above graph represents a chemical molecule, the labels “10” and “11” could be used for all vertices representing Oxygen and Hydrogen, respectively.

Now, besides vertices, a graph also contains edges. The edges are the lines between the vertices, here represented by thick black lines.  Edges also have some labels. In this example, four labels are used, which are 20, 21, 22 and 23.  These labels represents different types of relationships between vertices. Edge labels do not need to be unique. A graph is a connected graph if by following the edges, it is possible to go from any vertex to any other vertices. 

The algorithm takes a graph database as input.  For example, let’s consider the following graph database containing three graphs, which is provided in the file contextTKG.txt of the SPMF distribution.

graph database

This database contains three graphs, respectively called Graph 1, 2 and 3.

What is the output?

The output is statistics about the graph database. For example, if we use the tool on the previous graph database given as example, we get the following statistics:

============ GRAPH DATABASE STATS ==========

Number of graphs : 3

File /D:/workspace/SPMF_2019_for_release/bin/ca/pfv/spmf/tools/dataset_stats/contextTKG.txt

Number of distinct vertex labels: 2

Number of distinct edge labels: 4

Largest vertex label: 11

Largest edge label: 23

Average number of vertices per graph : 3.3333333333333335 standard deviation: 0.9428090415820634 variance: 0.8888888888888888

Average number of edges per graph : 2.6666666666666665 standard deviation: 1.247219128924647 variance: 1.5555555555555554

Input file format

The input file format is defined as follows. It is a text file which contains one or more graphs.  A graph is defined by a few lines of text that follow the following format:

For the above example, the input file is defined as follows:

t # 0
v 0 10
v 1 11
v 2 10
v 3 11
e 0 1 20
e 1 2 23
e 1 3 22

t # 1
v 4 10
v 5 11
e 4 5 20

t # 2
v 6 10
v 7 10
v 8 11
v 9 11
e 6 7 21
e 7 8 23
e 7 9 20
e 8 9 22