Applications and diagnostics for dimension reduction of multivariate data
Date
2022
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Montana State University - Bozeman, College of Letters & Science
Abstract
Working with high-dimensional data involves various statistical challenges. This dissertation overviews a suite of tools and methods for dimension reduction, using latent- variable models, techniques for mapping high-dimensional data, clustering, and working with multivariate responses across a variety of use cases. First, we propose and develop a method for classifying institutions of higher education is and compare with the current standard for university classification: the Carnegie Classification. We present a classification tool based on Structural Equation Models that better allows for modeling of correlated indices than the PCA-based methodology that underlies the Carnegie Classification. Additionally, we create a Shiny-based web application that allows for assessment of sensitivity to changes in the underlying characteristics of each institution. Second, we develop a novel methodology that extends the Cook's Distance diagnostic for identifying influential points in regression to a new application on high-dimensional mapping tools. We highlight a PERMANOVA-based method for calculating the difference in the shape of resulting ordinations based on inclusion/exclusion of points, similar in style to the influence diagnostic Cook's Distance for regression. We present a set simulation studies with several mapping techniques and highlight where the method works well (Classical Multidimensional Scaling) and where the methods appear to work less effectively (t-distributed Stochastic Neighbor Embedding). Additionally, we examine several real data sets and assess the efficacy of the diagnostic on thsoe data sets. Finally, we introduce a new method for feature selection in a specific type of divisive clustering, called monothetic clustering. Utilizing a penalized matrix decomposition to re- weight the input data to the monothetic clustering algorithm allows for reduction in noise features allows this clustering method to better make splits based on single features at a time, leading to better cluster results. We present a method for tuning both the number of clusters, K, and the degree of sparsity, s, as well as simulation studies that highlight the efficacy of noise reduction in monothetic clustering solutions.