DESIGNING ClUSTERING PROCESS WITH REUSABLE COMPONENTS
Kathrin Kirchner
Fakultet za ekonomiju poslovnu administraciju Friedrich Schiller, Univerzitet u Jeni
Boris Delibašić
Fakultet organizacionih nauka
Milan Vukićević
Fakultet organizacionih nauka
Keywords:
Clustering, data mining, paterns, CRISP-DM
Abstract
A typical data mining process, as it is described e.g. in the CRISP-DM approach, consists of several phases starting from business and data understanding and proceeds with preprocessing, modeling and evaluation. For each of these phases, several generic tasks are described that have to be carried out. In practice, however, there are difficulties to decide which specialized task solves a generic task best. There are at least three reasons for this. First, a galore of specialized tasks is proposed in the literature and available in data mining software. Second, a lot of these tasks are encapsulated in algorithms, and can’t be used independently of the algorithm. Third, specialized tasks (reusable components - RCs) are not well-organized, i.e. it is not easy to select the appropriate RC for a generic task (sub-problem). In this paper, we propose a white box modeling methodology that supports the design of the data mining process. Our paper concentrates on clus tering algorithms only. Thus, we propose RCs for commonly appearing sub-problems in clustering, as well as pre- and post-processing RCs.