sklearn.cluster.KMeans: "k-means++" is actually "greedy k-means++" and is not O(log k) optimal #24964

gittar · 2022-11-17T00:03:42Z

gittar
Nov 17, 2022

I believe that the class sklearn.cluster.KMeanswith default setting init='k-means++ does not implement the original k-means++ algorithm as described in k-means++: The Advantages of Careful Seeding, Arthur and Vassilvitskii, 2007. Rather, it implements a close variant called "greedy k-means++" (also described in that paper) that usually gives better results (so it is a good decision) but lacks the theoretical guarantee mentioned on https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html of being $\mathcal{O}(\log k)$-optimal.

"Greedy k-means++" was described in the above paper with these two sentences stating that - despite generally better performance - the optimality proofs for k-means++ were not valid for "greedy k-means++":

Also, experiments showed that k-means++ generally performed better if it selected several new centers during
each iteration, and then greedily chose the one that decreased φ as much as possible. Unfortunately, our
proofs do not carry over to this scenario.

Why do I think that sklearn.cluster.KMeans actually implements "greedy k-means++"?

When calling sklearn.cluster.KMeans with init=kmeans++ the function _kmeans_plusplus is called:

scikit-learn/sklearn/cluster/_kmeans.py

Lines 951 to 957 in f3f51f9

    
           if isinstance(init, str) and init == "k-means++": 
        
               centers, _ = _kmeans_plusplus( 
        
                   X, 
        
                   n_clusters, 
        
                   random_state=random_state, 
        
                   x_squared_norms=x_squared_norms, 
        
               )

This function has a parameter n_local_trials with default None which is the value always used by sklearn.cluster.KMeans since it does not provide a value for n_local_trials when calling _kmeans_plusplus():

scikit-learn/sklearn/cluster/_kmeans.py

Line 154 in f3f51f9

    
           def _kmeans_plusplus(X, n_clusters, x_squared_norms, random_state, n_local_trials=None):

The parameter n_local_trials is described as

scikit-learn/sklearn/cluster/_kmeans.py

Lines 81 to 85 in f3f51f9

    
               n_local_trials : int, default=None 
        
                   The number of seeding trials for each center (except the first), 
        
                   of which the one reducing inertia the most is greedily chosen. 
        
                   Set to None to make the number of trials depend logarithmically 
        
                   on the number of seeds (2+log(k)).

and is used in the code like

scikit-learn/sklearn/cluster/_kmeans.py

Lines 192 to 197 in f3f51f9

    
           # Set the number of local seeding trials if none is given 
        
           if n_local_trials is None: 
        
               # This is what Arthur/Vassilvitskii tried, but did not report 
        
               # specific results for other than mentioning in the conclusion 
        
               # that it helped. 
        
               n_local_trials = 2 + int(np.log(n_clusters))

Thus, for n_local_trials > 1 the algorithm used corresponds to the greedy variant of Arthur and Vassilvitskii, now commonly called "greedy k-means++".

According to the recent paper
A Nearly Tight Analysis of Greedy k-means++; Grunau, Özüdoğru, Rozhoň, Tětek this algorithm is only $\mathcal{O}(\ell³ \log³ k)$-optimal whereby $\ell$ denotes n_local_trials. With the sklearn definition n_local_trials = 2 + int(np.log(n_clusters)) one can replace $\ell$ with $\log k$ in the above $\mathcal{O}$-Notation and thus arrives at $\mathcal{O}(\log⁶k)$, a clearly weaker guarantee than $\mathcal{O}(\log k)$.

There is also an earlier paper Noisy, Greedy and Not So Greedy k-means++; Bhattacharya, Eube, Röglin, Schmidt; ESA 2020 proving a lower bound $\Omega(\ell \log k)$ for "greedy k-means++" which also proves that the "greedy k-means++" algorithm can not be $\mathcal{O}(\log k)$-optimal

Therefore the statement from https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html saying

k-means++’ : selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence, and is theoretically proven to be $O(\log k)$-optimal. See the description of n_init for more details.

is not correct.

BTW: One can get the original k-means++ algorithm with the $\mathcal{O}(\log k)$ guarantee via the function kmeans_plusplus which is separately available in sklearn since version 0.24 and exposes the parameter n_local_trials. One just has to sets n_local_trials=1.

scikit-learn/sklearn/cluster/_kmeans.py

Lines 58 to 60 in f3f51f9

    
           def kmeans_plusplus( 
        
               X, n_clusters, *, x_squared_norms=None, random_state=None, n_local_trials=None 
        
           ):

What do I recommend to do here?

either make a remark in the documentation that init='k-means++ actually uses the greedy variant of k-means++ (which lacks the $\mathcal{O}(\log k)$-optimality)
or expose the parameter n_local_trials to the __init__ function of KMeans. Unfortunately this parameter would be used only if init='k-means++ and not if init='random

Other/better proposal are welcome. I believe, however, that sklearn.cluster.KMeans, the probably most-popular and also very efficient implementation of k-means++ should be accurately describing what it implements, one reason being to reduce confusion about this widely-used algorithm.

caitlon · 2024-05-07T19:57:08Z

caitlon
May 7, 2024

Hi, Your analysis of the sklearn.cluster.KMeans implementation is quite thorough and accurate. The KMeans implementation in scikit-learn does indeed use a variant of the k-means++ initialization method. This variant, as you've pointed out, selects several new centers during each iteration and then greedily chooses the one that decreases the potential function as much as possible.

The discrepancy between the theoretical guarantees of the original k-means++ algorithm and the "greedy k-means++" variant used in scikit-learn's KMeans implementation is a valid point.

I would recommend raising this issue with the scikit-learn developers.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sklearn.cluster.KMeans: "k-means++" is actually "greedy k-means++" and is not O(log k) optimal #24964

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

sklearn.cluster.KMeans: "k-means++" is actually "greedy k-means++" and is not O(log k) optimal #24964

gittar Nov 17, 2022

Replies: 1 comment

caitlon May 7, 2024

gittar
Nov 17, 2022

caitlon
May 7, 2024