cluster-optimizer

Installation

You can install this package with pip using the following command:

pip install git+https://github.com/ndgigliotti/cluster-optimizer.git@main

Purpose

This project provides a simple, Scikit-Learn-compatible, hyperparameter optimization tool for clustering. It's intended for situations where predicting clusters for new data points is a low priority. Many clustering algorithms in Scikit-Learn are transductive, meaning that they are not designed to be applied to new observations. Even if using an inductive clustering algorithm like K-Means, you might not have any desire to predict clusters for new observations. Or, even if you do have such a desire, prediction might be a lower priority than finding the best clusters in the data.

Since Scikit-Learn's GridSearchCV uses cross-validation, and is designed to optimize inductive machine learning models, an alternative tool is necessary.

`ClusterOptimizer`

The ClusterOptimizer class is a hyperparameter search tool for optimizing clustering algorithms. It simply fits one model per hyperparameter combination and selects the best. It's a spin-off of GridSearchCV, and the implementation is derived from Scikit-Learn. The only difference is that it doesn't use cross-validation and is designed to work with special clustering scorers. It's not always necessary to provide a target variable, since clustering metrics such as silhouette, Calinski-Harabasz, and Davies-Bouldin are designed for unsupervised clustering.

The interface is largely the same as GridSearchCV. One minor difference is that the search results are stored in the results_ attribute, rather than cv_results_.

Transductive Clustering Scorers

You can use ClusterOptimizer by passing the string name of a Scikit-Learn clustering metric, e.g. 'silhouette', 'calinski_harabasz', or 'rand_score' (the '_score' suffix is optional). You can also create a special scorer for transductive clustering using scorer.make_scorer on any score function with the signature score_func(labels_true, labels_fit) or score_func(X, labels_fit).

Recognized Scorer Names

Note that the '_score' suffix is always optional.

'silhouette_score'
'silhouette_score_euclidean'
'silhouette_score_cosine'
'davies_bouldin_score'
'calinski_harabasz_score'
'mutual_info_score'
'normalized_mutual_info_score'
'adjusted_mutual_info_score'
'rand_score'
'adjusted_rand_score'
'completeness_score'
'fowlkes_mallows_score'
'homogeneity_score'
'v_measure_score'

Caveats

Comparing Clustering Algorithms

It's important to consider your dataset and goals before comparing clustering algorithms in a grid search. Just because one algorithm gets a higher score than another does not necessarily make it a better choice. Different clustering algorithms have different benefits, drawbacks, and use cases.

Future Work

Write automated tests.
Develop alternative to BaseSearchCV.
Add multi-metric compatibility.
Remove noise "cluster" and impose noise limit.
Update docstrings taken from Scikit-Learn.
Add more search types (e.g. randomized).

Credits

Most of the credit goes to the developers of Scikit-Learn for the engineering behind the search estimators. It's not very hard to spam a bunch of models with different hyperparameters, but it's hard to do it in a robust way with a friendly interface and wide compatibility.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
cluster_optimizer		cluster_optimizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pytest.ini		pytest.ini
setup.py		setup.py
test.ipynb		test.ipynb
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cluster-optimizer

Installation

Purpose

`ClusterOptimizer`

Transductive Clustering Scorers

Recognized Scorer Names

Caveats

Comparing Clustering Algorithms

Future Work

Credits

About

Releases

Packages

Languages

License

ndgigliotti/cluster-optimizer

Folders and files

Latest commit

History

Repository files navigation

cluster-optimizer

Installation

Purpose

ClusterOptimizer

Transductive Clustering Scorers

Recognized Scorer Names

Caveats

Comparing Clustering Algorithms

Future Work

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`ClusterOptimizer`

Packages