# A Simple Algorithm for I/O-efficient k-means Clustering

### Speaker: Krish Pillaipakkamnatt

#### NOTE: This talk should be accessible to most undergraduate and
graduate students

Let's say you have data about a collection of items (such as web pages,
or insects, or baseball players, or whatever). You might want to
organize this data into a number of groups such that the items within
any group are "similar" (while items that belong to different groups
are "dissimilar"). The hope is that these groupings reveal hidden
patterns in the data that can be used for some purpose. The problem of
partitioning data into groups (or *clusters*) is called the clustering
problem, and an algorithm that creates such clusters is called a
clustering algorithm.

The *k-means clustering problem* requires the partitioning of
the data into "k" clusters. I will present a simple I/O-efficient
algorithm for k-means clustering. This algorithm is based on the
standard divide, conquer and combine technique that is taught in CSC
16, and CSC 155 (as in Merge Sort, for example), and has a
straightforward implementation with no complicated data structures to
maintain.

Our experiments show that this algorithm produces cluster centers that
are, on average, more accurate than the ones produced by the well-known
iterative k-means (Lloyd's) algorithm. For small values of k and large
databases, the running times for our algorithm is usually lower than
Lloyd's algorithm. Each item in the database is examined only once,
and in sequential order. The algorithm is, therefore, also efficient
in terms in terms of I/O access.

The talk is based on joint work with G. Jagannathan and R. Mathur.