PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Razvrščanje velikih podatkovij
Vladimir Batagelj, Nataša Kejžar, Barbara Japelj-Pavešič and Simona Korenjak-Černe
In: Statistični dnevi, Radenci, Slovenija, 9-11 Nov 2009, Radenci, Slovenija.


In the last two decades the development of IT enabled users to store large datasets on ordinary PC. The problem how to analyze such data sets emerged. One of the answers are the clustering methods, where the units are partitioned into smaller number of coherent groups - clusters. Classical clustering methods face two problems: hierarchical methods are limited to small number of units; and nonhierarchical methods are mostly limited to units described with numbers and use for the cluster’s representation only one value (usually the center of the cluster). In the paper, our adaptations of clustering methods for data described with discrete distributions will be presented. Such description is more informative and also enables us to cluster very large datasets. Since standard k-means and Ward's clustering methods are both based on the squared Euclidean distance as the error function, they in some cases do not give the 'expected' results. To reveal the ‘expected’ structure in the data we developed new clustering methods based on relative error functions. The applications of the new methods on concrete data sets will be also presented.

EPrint Type:Conference or Workshop Item (Talk)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
ID Code:8183
Deposited By:Boris Horvat
Deposited On:21 February 2012