genericROM.BasicAlgorithms.Clustering module

class ClusteringToolbox(clusteringAlgo=None)[source]

Bases: object

Class for clustering problems.

clusteringAlgo

Object containing a clustering algorithm. All clustering algorithms available in Scikit-Learn can be used. If defined by the user, the clustering algorithm must follow Scikit-Learn’s API for clustering.

clusters

clusters[k] is an array containing the indices of points belonging to cluster k.

Type:

dict

ClusterRenumbering(clusterIdPermutation)[source]

Changes the numerotation of clusters.

Parameters:

clusterIdPermutation (list) – List of integers such that clusterIdPermutation[k] is the new index for cluster k.

GetClusteringAlgo()[source]
GetClusters()[source]
GetLabels()[source]
GetNumberOfClusters()[source]
ReadClusteringResults(resultsFile)[source]

Reads clustering results from a text file.

Parameters:

resultsFile (str) – Name of the txt file containing the clustering results.

SetClusteringAlgo(clusteringAlgo)[source]
WriteClusteringResults(outputFileName)[source]

Writes clustering results in a text file.

Parameters:

outputFileName (str) – Name of the txt file in which clustering results are written.

fit(X, **kwargs)[source]

Computes clusters.

Parameters:

X (array of shape (n_samples, n_features)) – Training instances to cluster. Note: if your clustering algorithm works on a distance matrix, then X is the distance matrix of shape (n_samples, n_samples).

fit_predict(X, returnLabels=False, **kwargs)[source]

Computes clusters and predicts cluster index for each sample.

Parameters:
  • X (array of shape (n_samples, n_features)) – Training instances to cluster. Note: if your clustering algorithm works on a distance matrix, then X is the distance matrix of shape (n_samples, n_samples).

  • returnLabels (boolean) – If True, returns labels. If false, it only updates the object’s attributes (self.clusters).

Returns:

labels – Array of integers containing the index of the cluster each sample belongs to. Returned only if returnLabels is True.

Return type:

1D array of length n_samples

predict(X, returnLabels=False)[source]

Predicts the cluster index for each sample in X.

Parameters:
  • X (array of shape (n_samples, n_features)) – Training instances to cluster. Note: if your clustering algorithm works on a distance matrix, then X is the distance matrix of shape (n_samples, n_samples).

  • returnLabels (boolean) – If True, returns labels. If false, it only updates the object’s attributes (self.clusters).

Returns:

labels – Array of integers containing the index of the cluster each sample belongs to. Returned only if returnLabels is True.

Return type:

1D array of length n_samples

predictTest(X, returnLabels=False)[source]

Predicts the cluster index for each sample in X, where X contains new unseen data.

Parameters:
  • X (array of shape (n_samples, n_features)) – Training instances to cluster. Note: if your clustering algorithm works on a distance matrix, then X is the distance matrix of shape (n_samples, n_clusters).

  • returnLabels (boolean) – If True, returns labels. If false, it only updates the object’s attributes (self.clusters).

Returns:

labels – Array of integers containing the index of the cluster each sample belongs to. Returned only if returnLabels is True.

Return type:

1D array of length n_samples

GetAdjacentClustersFromLabelsVector(labels, localNbSnapshots=None)[source]

Returns a dictionary with keys the cluster number and values the numbers of cluster adjacent from the data used in the clustering (through the labels).

Parameters:
  • labels (1D array of integers) – labels[j] = k if example j belongs to cluster k.

  • localNbSnapshots (1D array or list of integers) – localNbSnapshots[j] = is the size of j-th group of values for which adjence is well-defined.

Returns:

  • adjacentClusters (dict) – adjacentClusters[k] is an array containing the indices of the clusters adjacent to cluster k.

  • snapshotsOfAdjacentClusters (dict) – snapshotsOfAdjacentClusters[k] is an array containing the indices of points belonging to cluster k and its adjacent clusters.

GetClustersFromLabelsVector(labels)[source]

Returns a dictionary containing clustering results.

Parameters:

labels (1D array of integers) – labels[j] = k if example j belongs to cluster k.

Returns:

clusters – clusters[k] is an array containing the indices of points belonging to cluster k.

Return type:

dict

GetLabelsVectorFromClusters(clusters)[source]

Returns a labels vector “labels”.

Parameters:

clusters (dict) – clusters[k] is an array containing the indices of points belonging to cluster k.

Returns:

labels – labels[j] = k if example j belongs to cluster k.

Return type:

1D array of integers

class KMedoids(nClusters, nIter=100, init='k-meds++', algo='PAM', squaredDist=False, runs=10)[source]

Bases: object

Class for k-medoids clustering.

nClusters

Number of clusters.

Type:

int

nIter

Maximum number of iterations.

Type:

int, default 100

init

Medoids initialization method. Random selection if ‘random’. If ‘k-meds++’, we use the method described in the following article: Hae-Sang Park, Chi-Hyuck Jun, “A simple and fast algorithm for K-medoids clustering”, 2009. If ‘multipleRuns’, the clustering algorithm is run self.runs times with random initial medoids. The best solution in terms of the cost function is returned.

Type:

str, ‘k-meds++’ or ‘random’, default ‘k-meds++’

medoids

Array of integers containings the ids of the medoids.

Type:

1D array of length nClusters

algo

Algorithm for k-medoids. Park & Jun’s algorithm is simpler and faster but explores a smaller search space than PAM (Partitioning around medoids).

Type:

‘ParkJun’ or ‘PAM’, default ‘PAM’

squaredDist

Says whether the cost function and the medoid update rule use squared dissimilarities.

Type:

boolean, default True.

runs

Number of times the clustering algorithm is run when using init=’multipleRuns’.

Type:

integer, default 10.

EvalCostFunction(medoids, distMatrix, isAlreadySquared=True, **kwargs)[source]
GetMedoids()[source]
InitializeMedoids(distanceMatrix)[source]

Initial medoids selection. Method described in Hae-Sang Park, Chi-Hyuck Jun, “A simple and fast algorithm for K-medoids clustering”, 2009.

Parameters:

distanceMatrix (2D array of shape (n_samples,n_samples))

SetCostFunction(costFunction)[source]
fit(distMatrix, printCostFct=False, verbose=False)[source]
fit_PAM(distMatrix, printCostFct=False, verbose=False)[source]

Implementation of Partitioning Around Medoids (PAM) algorithm for k-medoids.

Parameters:
  • distMatrix (2D array of shape (n_samples,n_samples))

  • printCostFct (boolean)

fit_ParkJun(distMatrix, printCostFct=False, verbose=False)[source]

Implementation of k-medoids clustering based on the Voronoi iteration approach (Park and Jun 2009). This code is a slightly modified version of the code presented in: “NumPy/SciPy recipes for data science: k-Medoids clustering”, C. Bauckhage.

Parameters:
  • distMatrix (2D array of shape (n_samples,n_samples))

  • printCostFct (boolean)

fit_predict(distMatrix, printCostFct=False, verbose=False)[source]

Computes clusters and gives cluster assignments.

Parameters:

distMatrix (2D array of shape (n_samples,n_samples))

Returns:

labels – Array of integers containing the index of the cluster each sample belongs to.

Return type:

1D array of length n_samples

predict(distMatrix)[source]

Gives cluster assignments for training data.

Parameters:

distMatrix (2D array of shape (n_samples,n_samples))

Returns:

labels – Array of integers containing the index of the cluster each sample belongs to.

Return type:

1D array of length n_samples

predictTest(distMatrix)[source]

Gives cluster assignments for test data.

Parameters:

distMatrix (2D array of shape (n_samples,n_clusters)) – such that distMatrix[i,k] is the distance between the i-th test example with the k-th medoid.

Returns:

labels – Array of integers containing the index of the cluster each sample belongs to.

Return type:

1D array of length n_samples