genericROM.BasicAlgorithms.Clustering

class ClusteringToolbox(clusteringAlgo=None)[source]

Bases: object

Class for clustering problems.

clusteringAlgo

Object containing a clustering algorithm. All clustering algorithms available in Scikit-Learn can be used. If defined by the user, the clustering algorithm must follow Scikit-Learn’s API for clustering.

clusters

clusters[k] is an array containing the indices of points belonging to cluster k.

Type

dict

ClusterRenumbering(clusterIdPermutation)[source]

Changes the numerotation of clusters.

Parameters

clusterIdPermutation (list) – List of integers such that clusterIdPermutation[k] is the new index for cluster k.

GetClusteringAlgo()[source]
GetClusters()[source]
GetLabels()[source]
GetNumberOfClusters()[source]
ReadClusteringResults(resultsFile)[source]

Reads clustering results from a text file.

Parameters

resultsFile (str) – Name of the txt file containing the clustering results.

SetClusteringAlgo(clusteringAlgo)[source]
WriteClusteringResults(outputFileName)[source]

Writes clustering results in a text file.

Parameters

outputFileName (str) – Name of the txt file in which clustering results are written.

fit(X, **kwargs)[source]

Computes clusters.

Parameters

X (array of shape (n_samples, n_features)) – Training instances to cluster. Note: if your clustering algorithm works on a distance matrix, then X is the distance matrix of shape (n_samples, n_samples).

fit_predict(X, returnLabels=False, **kwargs)[source]

Computes clusters and predicts cluster index for each sample.

Parameters
  • X (array of shape (n_samples, n_features)) – Training instances to cluster. Note: if your clustering algorithm works on a distance matrix, then X is the distance matrix of shape (n_samples, n_samples).

  • returnLabels (boolean) – If True, returns labels. If false, it only updates the object’s attributes (self.clusters).

Returns

labels – Array of integers containing the index of the cluster each sample belongs to. Returned only if returnLabels is True.

Return type

1D array of length n_samples

predict(X, returnLabels=False)[source]

Predicts the cluster index for each sample in X.

Parameters
  • X (array of shape (n_samples, n_features)) – Training instances to cluster. Note: if your clustering algorithm works on a distance matrix, then X is the distance matrix of shape (n_samples, n_samples).

  • returnLabels (boolean) – If True, returns labels. If false, it only updates the object’s attributes (self.clusters).

Returns

labels – Array of integers containing the index of the cluster each sample belongs to. Returned only if returnLabels is True.

Return type

1D array of length n_samples

predictTest(X, returnLabels=False)[source]

Predicts the cluster index for each sample in X, where X contains new unseen data.

Parameters
  • X (array of shape (n_samples, n_features)) – Training instances to cluster. Note: if your clustering algorithm works on a distance matrix, then X is the distance matrix of shape (n_samples, n_clusters).

  • returnLabels (boolean) – If True, returns labels. If false, it only updates the object’s attributes (self.clusters).

Returns

labels – Array of integers containing the index of the cluster each sample belongs to. Returned only if returnLabels is True.

Return type

1D array of length n_samples

GetAdjacentClustersFromLabelsVector(labels, localNbSnapshots=None)[source]

Returns a dictionary with keys the cluster number and values the numbers of cluster adjacent from the data used in the clustering (through the labels).

Parameters
  • labels (1D array of integers) – labels[j] = k if example j belongs to cluster k.

  • localNbSnapshots (1D array or list of integers) – localNbSnapshots[j] = is the size of j-th group of values for which adjence is well-defined.

Returns

  • adjacentClusters (dict) – adjacentClusters[k] is an array containing the indices of the clusters adjacent to cluster k.

  • snapshotsOfAdjacentClusters (dict) – snapshotsOfAdjacentClusters[k] is an array containing the indices of points belonging to cluster k and its adjacent clusters.

GetClustersFromLabelsVector(labels)[source]

Returns a dictionary containing clustering results.

Parameters

labels (1D array of integers) – labels[j] = k if example j belongs to cluster k.

Returns

clusters – clusters[k] is an array containing the indices of points belonging to cluster k.

Return type

dict

GetLabelsVectorFromClusters(clusters)[source]

Returns a labels vector “labels”.

Parameters

clusters (dict) – clusters[k] is an array containing the indices of points belonging to cluster k.

Returns

labels – labels[j] = k if example j belongs to cluster k.

Return type

1D array of integers

class KMedoids(nClusters, nIter=100, init='k-meds++', algo='PAM', squaredDist=False, runs=10)[source]

Bases: object

Class for k-medoids clustering.

nClusters

Number of clusters.

Type

int

nIter

Maximum number of iterations.

Type

int, default 100

init

Medoids initialization method. Random selection if ‘random’. If ‘k-meds++’, we use the method described in the following article: Hae-Sang Park, Chi-Hyuck Jun, “A simple and fast algorithm for K-medoids clustering”, 2009. If ‘multipleRuns’, the clustering algorithm is run self.runs times with random initial medoids. The best solution in terms of the cost function is returned.

Type

str, ‘k-meds++’ or ‘random’, default ‘k-meds++’

medoids

Array of integers containings the ids of the medoids.

Type

1D array of length nClusters

algo

Algorithm for k-medoids. Park & Jun’s algorithm is simpler and faster but explores a smaller search space than PAM (Partitioning around medoids).

Type

‘ParkJun’ or ‘PAM’, default ‘PAM’

squaredDist

Says whether the cost function and the medoid update rule use squared dissimilarities.

Type

boolean, default True.

runs

Number of times the clustering algorithm is run when using init=’multipleRuns’.

Type

integer, default 10.

EvalCostFunction(medoids, distMatrix, isAlreadySquared=True, **kwargs)[source]
GetMedoids()[source]
InitializeMedoids(distanceMatrix)[source]

Initial medoids selection. Method described in Hae-Sang Park, Chi-Hyuck Jun, “A simple and fast algorithm for K-medoids clustering”, 2009.

Parameters

distanceMatrix (2D array of shape (n_samples,n_samples)) –

SetCostFunction(costFunction)[source]
fit(distMatrix, printCostFct=False, verbose=False)[source]
fit_PAM(distMatrix, printCostFct=False, verbose=False)[source]

Implementation of Partitioning Around Medoids (PAM) algorithm for k-medoids.

Parameters
  • distMatrix (2D array of shape (n_samples,n_samples)) –

  • printCostFct (boolean) –

fit_ParkJun(distMatrix, printCostFct=False, verbose=False)[source]

Implementation of k-medoids clustering based on the Voronoi iteration approach (Park and Jun 2009). This code is a slightly modified version of the code presented in: “NumPy/SciPy recipes for data science: k-Medoids clustering”, C. Bauckhage.

Parameters
  • distMatrix (2D array of shape (n_samples,n_samples)) –

  • printCostFct (boolean) –

fit_predict(distMatrix, printCostFct=False, verbose=False)[source]

Computes clusters and gives cluster assignments.

Parameters

distMatrix (2D array of shape (n_samples,n_samples)) –

Returns

labels – Array of integers containing the index of the cluster each sample belongs to.

Return type

1D array of length n_samples

predict(distMatrix)[source]

Gives cluster assignments for training data.

Parameters

distMatrix (2D array of shape (n_samples,n_samples)) –

Returns

labels – Array of integers containing the index of the cluster each sample belongs to.

Return type

1D array of length n_samples

predictTest(distMatrix)[source]

Gives cluster assignments for test data.

Parameters

distMatrix (2D array of shape (n_samples,n_clusters)) – such that distMatrix[i,k] is the distance between the i-th test example with the k-th medoid.

Returns

labels – Array of integers containing the index of the cluster each sample belongs to.

Return type

1D array of length n_samples