Let's say you have data about a collection of items (such as web pages, or insects, or baseball players, or whatever). You might want to organize this data into a number of groups such that the items within any group are "similar" (while items that belong to different groups are "dissimilar"). The hope is that these groupings reveal hidden patterns in the data that can be used for some purpose. The problem of partitioning data into groups (or clusters) is called the clustering problem, and an algorithm that creates such clusters is called a clustering algorithm.
The k-means clustering problem requires the partitioning of the data into "k" clusters. I will present a simple I/O-efficient algorithm for k-means clustering. This algorithm is based on the standard divide, conquer and combine technique that is taught in CSC 16, and CSC 155 (as in Merge Sort, for example), and has a straightforward implementation with no complicated data structures to maintain.
Our experiments show that this algorithm produces cluster centers that are, on average, more accurate than the ones produced by the well-known iterative k-means (Lloyd's) algorithm. For small values of k and large databases, the running times for our algorithm is usually lower than Lloyd's algorithm. Each item in the database is examined only once, and in sequential order. The algorithm is, therefore, also efficient in terms in terms of I/O access.
The talk is based on joint work with G. Jagannathan and R. Mathur.