genericROM.BasicAlgorithms.Clustering

class ClusteringToolbox(clusteringAlgo=None)[source]

Bases: object

Class for clustering problems.

clusteringAlgo: Object containing a clustering algorithm. All clustering algorithms available in Scikit-Learn can be used. If defined by the user, the clustering algorithm must follow Scikit-Learn’s API for clustering.

clusters

clusters[k] is an array containing the indices of points belonging to cluster k.

Type: dict

ClusterRenumbering(clusterIdPermutation)[source]

Changes the numerotation of clusters.

Parameters: clusterIdPermutation (list) – List of integers such that clusterIdPermutation[k] is the new index for cluster k.

GetClusteringAlgo()[source]

GetClusters()[source]

GetLabels()[source]

GetNumberOfClusters()[source]

ReadClusteringResults(resultsFile)[source]

Reads clustering results from a text file.

Parameters: resultsFile (str) – Name of the txt file containing the clustering results.

SetClusteringAlgo(clusteringAlgo)[source]

WriteClusteringResults(outputFileName)[source]

Writes clustering results in a text file.

Parameters: outputFileName (str) – Name of the txt file in which clustering results are written.

fit(X, **kwargs)[source]

Computes clusters.

Parameters: X (array of shape (n_samples, n_features)) – Training instances to cluster. Note: if your clustering algorithm works on a distance matrix, then X is the distance matrix of shape (n_samples, n_samples).

fit_predict(X, returnLabels=False, **kwargs)[source]

Computes clusters and predicts cluster index for each sample.

Parameters

X (array of shape (n_samples, n_features)) – Training instances to cluster. Note: if your clustering algorithm works on a distance matrix, then X is the distance matrix of shape (n_samples, n_samples).
returnLabels (boolean) – If True, returns labels. If false, it only updates the object’s attributes (self.clusters).

Returns

labels – Array of integers containing the index of the cluster each sample belongs to. Returned only if returnLabels is True.

Return type

1D array of length n_samples

predict(X, returnLabels=False)[source]

Predicts the cluster index for each sample in X.

Parameters

X (array of shape (n_samples, n_features)) – Training instances to cluster. Note: if your clustering algorithm works on a distance matrix, then X is the distance matrix of shape (n_samples, n_samples).
returnLabels (boolean) – If True, returns labels. If false, it only updates the object’s attributes (self.clusters).

Returns

labels – Array of integers containing the index of the cluster each sample belongs to. Returned only if returnLabels is True.

Return type

1D array of length n_samples

predictTest(X, returnLabels=False)[source]

Predicts the cluster index for each sample in X, where X contains new unseen data.

Parameters

X (array of shape (n_samples, n_features)) – Training instances to cluster. Note: if your clustering algorithm works on a distance matrix, then X is the distance matrix of shape (n_samples, n_clusters).
returnLabels (boolean) – If True, returns labels. If false, it only updates the object’s attributes (self.clusters).

Returns

labels – Array of integers containing the index of the cluster each sample belongs to. Returned only if returnLabels is True.

Return type

1D array of length n_samples

GetAdjacentClustersFromLabelsVector(labels, localNbSnapshots=None)[source]

Returns a dictionary with keys the cluster number and values the numbers of cluster adjacent from the data used in the clustering (through the labels).

Parameters

labels (1D array of integers) – labels[j] = k if example j belongs to cluster k.
localNbSnapshots (1D array or list of integers) – localNbSnapshots[j] = is the size of j-th group of values for which adjence is well-defined.

Returns

adjacentClusters (dict) – adjacentClusters[k] is an array containing the indices of the clusters adjacent to cluster k.
snapshotsOfAdjacentClusters (dict) – snapshotsOfAdjacentClusters[k] is an array containing the indices of points belonging to cluster k and its adjacent clusters.

GetClustersFromLabelsVector(labels)[source]

Returns a dictionary containing clustering results.

Parameters: labels (1D array of integers) – labels[j] = k if example j belongs to cluster k.
Returns: clusters – clusters[k] is an array containing the indices of points belonging to cluster k.
Return type: dict

GetLabelsVectorFromClusters(clusters)[source]

Returns a labels vector “labels”.

Parameters: clusters (dict) – clusters[k] is an array containing the indices of points belonging to cluster k.
Returns: labels – labels[j] = k if example j belongs to cluster k.
Return type: 1D array of integers

class KMedoids(nClusters, nIter=100, init='k-meds++', algo='PAM', squaredDist=False, runs=10)[source]

Bases: object

Class for k-medoids clustering.

nClusters

Number of clusters.

Type: int

nIter

Maximum number of iterations.

Type: int, default 100

init

Medoids initialization method. Random selection if ‘random’. If ‘k-meds++’, we use the method described in the following article: Hae-Sang Park, Chi-Hyuck Jun, “A simple and fast algorithm for K-medoids clustering”, 2009. If ‘multipleRuns’, the clustering algorithm is run self.runs times with random initial medoids. The best solution in terms of the cost function is returned.

Type: str, ‘k-meds++’ or ‘random’, default ‘k-meds++’

medoids

Array of integers containings the ids of the medoids.

Type: 1D array of length nClusters

algo

Algorithm for k-medoids. Park & Jun’s algorithm is simpler and faster but explores a smaller search space than PAM (Partitioning around medoids).

Type: ‘ParkJun’ or ‘PAM’, default ‘PAM’

squaredDist

Says whether the cost function and the medoid update rule use squared dissimilarities.

Type: boolean, default True.

runs

Number of times the clustering algorithm is run when using init=’multipleRuns’.

Type: integer, default 10.

EvalCostFunction(medoids, distMatrix, isAlreadySquared=True, **kwargs)[source]

GetMedoids()[source]

InitializeMedoids(distanceMatrix)[source]

Initial medoids selection. Method described in Hae-Sang Park, Chi-Hyuck Jun, “A simple and fast algorithm for K-medoids clustering”, 2009.

Parameters: distanceMatrix (2D array of shape (n_samples,n_samples)) –

SetCostFunction(costFunction)[source]

fit(distMatrix, printCostFct=False, verbose=False)[source]

fit_PAM(distMatrix, printCostFct=False, verbose=False)[source]

Implementation of Partitioning Around Medoids (PAM) algorithm for k-medoids.

Parameters

distMatrix (2D array of shape (n_samples,n_samples)) –
printCostFct (boolean) –

fit_ParkJun(distMatrix, printCostFct=False, verbose=False)[source]

Implementation of k-medoids clustering based on the Voronoi iteration approach (Park and Jun 2009). This code is a slightly modified version of the code presented in: “NumPy/SciPy recipes for data science: k-Medoids clustering”, C. Bauckhage.

Parameters

distMatrix (2D array of shape (n_samples,n_samples)) –
printCostFct (boolean) –

fit_predict(distMatrix, printCostFct=False, verbose=False)[source]

Computes clusters and gives cluster assignments.

Parameters: distMatrix (2D array of shape (n_samples,n_samples)) –
Returns: labels – Array of integers containing the index of the cluster each sample belongs to.
Return type: 1D array of length n_samples

predict(distMatrix)[source]

Gives cluster assignments for training data.

Parameters: distMatrix (2D array of shape (n_samples,n_samples)) –
Returns: labels – Array of integers containing the index of the cluster each sample belongs to.
Return type: 1D array of length n_samples

predictTest(distMatrix)[source]

Gives cluster assignments for test data.

Parameters: distMatrix (2D array of shape (n_samples,n_clusters)) – such that distMatrix[i,k] is the distance between the i-th test example with the k-th medoid.
Returns: labels – Array of integers containing the index of the cluster each sample belongs to.
Return type: 1D array of length n_samples