Anticipatory Clustering

Authors: M. Goller, M. Schrefl
Paper: Goll04a (2004)
Citation: M. H. Hamza (Ed.): Proceedings of the IASTED International Conference on Databases and Applications (DBA 2004) as part of the 22nd IASTED International Multi-Conference on Applied Informatics, Innsbruck, Austria, February 17-19, 2004, ACTA Press, ISBN 0-88986-383-0, pp. 145-150, 2004.
Resources: Copy (In order to obtain the copy please send an email with subject Goll04a to dke.win@jku.at)

Abstract

Clustering is a data mining task that is computationally intensive - especially in large databases. Previous work shows that using aggregated representations of the original data is successful in reducing the cost of computation. But the construction of these aggregated representations is still a big time consuming task. This article shows that clustering using aggregated data should be done in two separate steps - in a time-consuming preparation step and a clustering step that requires only a fraction of the time the first step does.

The parameters of a specific clustering task are unknown at the time the data are aggregated. Hence, the aggregated data must be stored in a way that does not exclude any future parameter settings. It is even unknown whether or not a clustering will ever happen. Hence, the preparation may require only few additional resources to make anticipatory clustering profitable.

This article shows that a task-independent representation can be computed as spin-off in other regular tasks like the Extract-Transform-Load Cycle in data warehouses---which are often used in combination with data mining.