You can install this package with pip using the following command:
pip install git+https://github.com/ndgigliotti/cluster-optimizer.git@main
This project provides a simple, Scikit-Learn-compatible, hyperparameter optimization tool for clustering. It's intended for situations where predicting clusters for new data points is a low priority. Many clustering algorithms in Scikit-Learn are transductive, meaning that they are not designed to be applied to new observations. Even if using an inductive clustering algorithm like K-Means, you might not have any desire to predict clusters for new observations. Or, even if you do have such a desire, prediction might be a lower priority than finding the best clusters in the data.
Since Scikit-Learn's GridSearchCV
uses cross-validation, and is designed to optimize inductive machine learning models, an alternative tool is necessary.
The ClusterOptimizer
class is a hyperparameter search tool for optimizing clustering algorithms. It simply fits one model per hyperparameter combination and selects the best. It's a spin-off of GridSearchCV
, and the implementation is derived from Scikit-Learn. The only difference is that it doesn't use cross-validation and is designed to work with special clustering scorers. It's not always necessary to provide a target variable, since clustering metrics such as silhouette, Calinski-Harabasz, and Davies-Bouldin are designed for unsupervised clustering.
The interface is largely the same as GridSearchCV
. One minor difference is that the search results are stored in the results_
attribute, rather than cv_results_
.
You can use ClusterOptimizer
by passing the string name of a Scikit-Learn clustering metric, e.g. 'silhouette', 'calinski_harabasz', or 'rand_score' (the '_score' suffix is optional). You can also create a special scorer for transductive clustering using scorer.make_scorer
on any score function with the signature score_func(labels_true, labels_fit)
or score_func(X, labels_fit)
.
Note that the '_score' suffix is always optional.
- 'silhouette_score'
- 'silhouette_score_euclidean'
- 'silhouette_score_cosine'
- 'davies_bouldin_score'
- 'calinski_harabasz_score'
- 'mutual_info_score'
- 'normalized_mutual_info_score'
- 'adjusted_mutual_info_score'
- 'rand_score'
- 'adjusted_rand_score'
- 'completeness_score'
- 'fowlkes_mallows_score'
- 'homogeneity_score'
- 'v_measure_score'
It's important to consider your dataset and goals before comparing clustering algorithms in a grid search. Just because one algorithm gets a higher score than another does not necessarily make it a better choice. Different clustering algorithms have different benefits, drawbacks, and use cases.
- Write automated tests.
- Develop alternative to
BaseSearchCV
. - Add multi-metric compatibility.
- Remove noise "cluster" and impose noise limit.
- Update docstrings taken from Scikit-Learn.
- Add more search types (e.g. randomized).
Most of the credit goes to the developers of Scikit-Learn for the engineering behind the search estimators. It's not very hard to spam a bunch of models with different hyperparameters, but it's hard to do it in a robust way with a friendly interface and wide compatibility.