Monothetic cluster analysis with extensions to circular and functional data
Loading...
Date
2019
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Montana State University - Bozeman, College of Letters & Science
Abstract
Monothetic clustering is a divisive clustering method that uses a hierarchical, recursive partitioning of multivariate responses based on binary decision rules that are built from individual response variables. This clustering technique is helpful for applications where the rules of groupings of observations as well as predicting new subjects into clusters are both important. Based on the ideas of classification and regression trees, a monothetic clustering algorithm was implemented in R to allow further explorations and modifications. One of the common problems in performing clustering is deciding whether a cluster structure is present and, if it is, how many clusters are 'enough'. Some well-established techniques are reviewed as well as new methods based on cross-validation and permutation-based hypothesis tests at each split are suggested. Monothetic clustering is of interest to be applied in a variety of situations. This can include data sets with circular variables, where the variables' natures are not linear. A method for monothetic clustering and visualizations of clusters with circular variables was developed that could also be used in other classification and regression tree situations. Clustering is also interesting for data sets where the responses can be transformed into functional data, which has unique properties that need exploring. Partitioning Using Local Subregions (PULS), a clustering technique inspired by monothetic clustering to overcome some of its disadvantages in clustering functional data, is discussed. In this algorithm, clusters are formed based on aggregating the information from several variables or time intervals. In both monothetic clustering and PULS, it is possible to limit the set of feasible splitting variables to be able to create clusters for new observations without observing all variables or times to assign new observations to the clusters. R packages for these methods have been developed for others to use and test and support the proposed research, and a detailed vignette is provided for utilizing all the functions developed here.