A paper published at SIGMOD2024

Created

January 11, 2024

Tags

PaperMachine Learning

Updated

March 29, 2024

We are excited to announce that our paper “Efficient algorithm for K-multiple-means” has been accepted to ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD2024).

Yasuhiro Fujiwara, Atsutoshi Kumagai, Yasutoshi Ida, Masahiro Nakano, Makoto Nakatsuji, Akisato Kimura, “Efficient algorithm for K-multiple-means,” ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), June 2024.

K-Multiple-Means is an extension of K-means for the clustering of multiple means used in many applications, such as image segmentation, load balancing, and blind-source separation. Since K-means uses only one mean to represent each cluster, it fails to capture non-spherical cluster structures of data points. However, since K-Multiple-Means represents the cluster by computing multiple means and grouping them into specified 𝑐 clusters, it can effectively capture the non-spherical clusters of the data points.

To obtain the clusters, K-Multiple-Means updates a similarity matrix of a bipartite graph between the data points and the multiple means by iteratively computing the leading $c$ singular vectors of the matrix. K-Multiple-Means, however, incurs a high computation cost for large-scale data due to the iterative SVD computations.

Our proposal, F-KMM, increases the efficiency of K-Multiple-Means by computing the singular vectors from a smaller similarity matrix between the multiple means obtained from the similarity matrix of the bipartite graph. To compute the similarity matrix of the bipartite graph efficiently, we skip unnecessary distance computations and estimate lower bounding distances between the data points and the multiple means. Theoretically, the proposed approach guarantees the same clustering results as K-Multiple-Means since it can exactly compute the singular vectors from the similarity matrix between the multiple means.

Experiments show that our approach is several orders of magnitude faster than previous clustering approaches that use multiple means.

The paper has already been published as a part of Proceedings of the ACM on Management of Data at https://dl.acm.org/doi/10.1145/3639273 and will be presented at SIGMOD2024 in June.