Replies: 1 comment
-
Hi, Your analysis of the sklearn.cluster.KMeans implementation is quite thorough and accurate. The KMeans implementation in scikit-learn does indeed use a variant of the k-means++ initialization method. This variant, as you've pointed out, selects several new centers during each iteration and then greedily chooses the one that decreases the potential function as much as possible. The discrepancy between the theoretical guarantees of the original k-means++ algorithm and the "greedy k-means++" variant used in scikit-learn's KMeans implementation is a valid point. I would recommend raising this issue with the scikit-learn developers. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I believe that the class$\mathcal{O}(\log k)$ -optimal.
sklearn.cluster.KMeans
with default settinginit='k-means++
does not implement the original k-means++ algorithm as described in k-means++: The Advantages of Careful Seeding, Arthur and Vassilvitskii, 2007. Rather, it implements a close variant called "greedy k-means++" (also described in that paper) that usually gives better results (so it is a good decision) but lacks the theoretical guarantee mentioned on https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html of being"Greedy k-means++" was described in the above paper with these two sentences stating that - despite generally better performance - the optimality proofs for k-means++ were not valid for "greedy k-means++":
Why do I think that
sklearn.cluster.KMeans
actually implements "greedy k-means++"?When calling
sklearn.cluster.KMeans
withinit=kmeans++
the function_kmeans_plusplus
is called:scikit-learn/sklearn/cluster/_kmeans.py
Lines 951 to 957 in f3f51f9
This function has a parameter
n_local_trials
with defaultNone
which is the value always used bysklearn.cluster.KMeans
since it does not provide a value forn_local_trials
when calling_kmeans_plusplus()
:scikit-learn/sklearn/cluster/_kmeans.py
Line 154 in f3f51f9
The parameter
n_local_trials
is described asscikit-learn/sklearn/cluster/_kmeans.py
Lines 81 to 85 in f3f51f9
and is used in the code like
scikit-learn/sklearn/cluster/_kmeans.py
Lines 192 to 197 in f3f51f9
Thus, for
n_local_trials > 1
the algorithm used corresponds to the greedy variant of Arthur and Vassilvitskii, now commonly called "greedy k-means++".According to the recent paper$\mathcal{O}(\ell³ \log³ k)$ -optimal whereby $\ell$ denotes $\ell$ with $\log k$ in the above $\mathcal{O}$ -Notation and thus arrives at $\mathcal{O}(\log⁶k)$ , a clearly weaker guarantee than $\mathcal{O}(\log k)$ .
A Nearly Tight Analysis of Greedy k-means++; Grunau, Özüdoğru, Rozhoň, Tětek this algorithm is only
n_local_trials
. With the sklearn definitionn_local_trials = 2 + int(np.log(n_clusters))
one can replaceThere is also an earlier paper Noisy, Greedy and Not So Greedy k-means++; Bhattacharya, Eube, Röglin, Schmidt; ESA 2020 proving a lower bound$\Omega(\ell \log k)$ for "greedy k-means++" which also proves that the "greedy k-means++" algorithm can not be $\mathcal{O}(\log k)$ -optimal
Therefore the statement from https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html saying
is not correct.
BTW: One can get the original k-means++ algorithm with the$\mathcal{O}(\log k)$ guarantee via the function
kmeans_plusplus
which is separately available in sklearn since version 0.24 and exposes the parametern_local_trials
. One just has to setsn_local_trials=1
.scikit-learn/sklearn/cluster/_kmeans.py
Lines 58 to 60 in f3f51f9
What do I recommend to do here?
init='k-means++
actually uses the greedy variant of k-means++ (which lacks then_local_trials
to the__init__
function ofKMeans
. Unfortunately this parameter would be used only ifinit='k-means++
and not ifinit='random
Other/better proposal are welcome. I believe, however, that
sklearn.cluster.KMeans
, the probably most-popular and also very efficient implementation of k-means++ should be accurately describing what it implements, one reason being to reduce confusion about this widely-used algorithm.Beta Was this translation helpful? Give feedback.
All reactions