[en] Algorithms description

Julien Barnier

2023-03-27

The goal of this document is to describe the algorithms used in the rainette package and their implementation. This implementation has been based upon two main resources :

Simple clustering algorithm

Document-term matrix

The simple clustering method is a divisive hierarchical clustering applied to a document-term matrix. This matrix is binary weighted, so only the absence or presence of a term in a document is taken into account.

Here is a sample matrix :

##      return of the obra dinn
## doc1      1  1   1    1    1
## doc2      0  0   1    1    1
## doc3      1  1   1    0    0
## doc4      0  1   0    0    0

Note that if documents are segments computed by split_segments(), segments from the same source may be merged together before clustering computation if some of them are too short and don’t reach the minimum size given by the min_segment_size argument of rainette().

Maximal χ² splitting

In the newt step, the goal is to split the document-term matrix into two groups of documents such as these are as “different” as possible. For the Reinert method, the “best” split is the one which maximizes the χ² value of the grouped array.

For example, using the previous matrix, if we group together doc1 with doc2 and doc3 with doc4, we get the following grouped array :

##             return of the obra dinn
## doc1 + doc2      1  1   2    2    2
## doc3 + doc4      1  2   1    0    0

We can compute the χ² statistics of this array, which can be seen has an association coefficient between the two document groups regarding their term distributions : if the χ² is high, the two groups are different, if it is low they can be considered as similar.

In such a simple case, we could compute every possible groupings and determine which one corresponds to the maximal χ² value, but with real data the complexity and needed computations rise very rapidly.

The Reinert method therefore proceeds the following way :

At the end we get two groups of documents and two corresponding document-term matrices.

Terms selection

The following step is to filter out some terms from each matrices before the next iteration :

Divisive hierarchical clustering

The previous steps allow to split a documents corpus into two groups. To get a hierarchy of groups, we first split the whole initial corpus, then continue by splitting the biggest resulting cluster, ie the one with the most documents (unless the cluster document-term matrix is too small to compute a correspondance analysis, or if its size is smaller than the value of the min_split_members argument of rainette()).

By repeating this k - 1 times, we get a divisive hierarchical clustering of k clusters, and the corresponding dendrogram (the height of each dendrogram branch is given by the χ² value of the corresponding split).

Double clustering algorithm

The double clustering only applies to a corpus which has been split into segments. The goal is to make clusters more robust by crossing the results of two single clusterings computed with different minimal segment sizes.

Doing a double clustering implies therefore implies that we first do two simple clusterings with distinct min_segment_size values. We then get two different dendrograms determining different clusters, here numbered from 1 to 6 :

Cross-clusters computation

The first step of the algorithm is to cross all clusters from both dendrograms together :

Each cluster of the first simple clustering is crossed with each cluster of the second one, even if both cluster do not belong to the same k level. The new resulting cluster is called a “crossed cluster” :

Crossed cluster Size Association χ²
11 25 41.2
12 102 30.1
13 59 87.3
14 41 94.0
15 0 -
16 13 6.2
21 0 -
66 5 3.1

Once these crossed clusters have been computed, two types of analysis are possible.

“full” analysis

The first type, called a “full” analysis (full = TRUE argument of rainette2()), retains for the following steps all the crossed clusters with a size greater than zero (if these have not already been filtered out at the previous step). In the previous example, we would keep :

Crossed cluster Size Association χ²
11 25 41.2
12 102 30.1
13 59 87.3
14 41 94.0
16 13 6.2
66 5 3.1

“classical” analysis

The second type, called a “classical” analysis (full = FALSE argument of rainette2()), only keeps for the following steps the crossed clusters whose clusters are mutually most associated. Thus, if we consider the crossed cluster crossing cluster x and cluster y, we only keep this crossed cluster if y is the cluster with the highest association χ² with x, and if at the same time x is the cluster with the highest association χ² with y.

In this case the number of crossed clusters kept for the following steps is much lower :

Crossed cluster Size Association χ²
14 41 94.0
33 18 70.1
46 21 58.2
65 25 61.0

Determining optimal partitions

The goal of the next step is to identify, starting from the set of crossed clusters defined previously, every possible partitions of our corpus in 2, 3, 4… crossed clusters. More precisely, we try to identify each set of 2, 3, 4… crossed clusters which have no common elements.

We start by computing the size 2 partitions, ie every set of 2 crossed clusters without common elements. For each partition we compute is total size and the sum of its association χ².

Partition Total size Sum of association χ²
11 23 47 62.3
14 35 36 51.0
24 16 29 38.2
12 53 143 68.7

We only keep among these partitions the ones that are considered the “best” ones :

Partition Total size Sum of association χ²
12 53 143 68.7
34 26 86 98.0

We do the same for the size 3 partitions : we identify every set of 3 non overlapping crossed clusters, and keep the “best” ones.

Partition Total size Sum of association χ²
12 53 25
189 91.3
34 26 53 113 108.1

And we repeat the operation for 4 crossed clusters, etc.

Partition Total size Sum of association χ²
34265315
223 114.7

The operation is repeated until we reach the value of the max_k argument passed to rainette2(), or until there is no possible partition of the size k.

Result

At the end we get, for each k value from 2 to max_k, a selection of the “best” crossed clusters partitions, either according to the association χ² criterion or according to the total size criterion. These crossed clusters form a new set of clusters which are potentially more “robust” than the ones computed from the two simple clusterings.

After this operation, a potentially high number of segments may not belong to any cluster anymore. rainette allows to reassign them to the new clusters with a fast k-nearest neighbours method, but this may not be recommended as we would then loose the clusters robustness acquired with the double clustering.

Differences with the “Reinert method”

The way to determine the optimal partitions did not seem completely clear to us in the articles cited in references, so it is not really possible to compare with the rainette implementation. The “classical” method described in this document seems to be close from the one suggested by Max Reinert : after computing the crossed clusters, we only keep the ones where the two crossed clusters are both the most mutually associated.

One important difference is the fact that once the best partition of crossed clusters has been determined, Max Reinert suggests to use these new groups as starting points to reassign the documents to new clusters with a k-means type method. This is not implemented in rainette : the rainette2_complete_groups() allows to reassign documents without cluster using a k-nearest neighbours method, but this may not be recommended if you want to keep the “robustness” of the clusters computed with the double clustering.

References