Statistics in the presence of cost : cost-considerate variable selection and MCMC convergence diagnostics
Lerch, Michael David
MetadataShow full item record
The overarching objective of this research is to address and recognize the cost-benefit trade-off inherent in much of statistics. We identify two places where such a balance is present for researchers: variable selection and Markov chain Monte Carlo (MCMC) sampling. An easily identifiable source of cost in science occurs when taking measurements. Researchers measure variables to estimate another quantity based on a model. When model building, researchers may have access to a large number of variables to include in the model and may consider using a subset of the variables so that future uses of the model need only measure this subset rather than all variables. The researchers are incentivized to proceed in this manner if some variables are prohibitively expensive to measure for future uses of the model. In this research, we present a new algorithm for cost-considerate variable selection in linear modeling when confronted with this problem. Since overfitting may be a danger when many variables at the disposal of the researcher, we build on the LARS and Lasso algorithms to perform cost-based variable selection in concert with model regularization. In MCMC sampling for Bayesian statistics, the cost-benefit trade-off is unavoidable. Researchers sampling from a posterior distribution must run a sampler for some number of iterations before finally stopping the sampler to make inference on the finite number of samples drawn. In this situation, the cost to be reduced is time to run the sampler while realizing the longer the sampler is run, the better the convergence. Time may not be as tangible a cost as a dollar figure, but increased wait time to perform analyses incurs the cost of running a computer and any negative effects associated with a delay as the researcher waits until the sampler has finished running. In this research, we introduce new convergence assessment tools in a diagnostic and plot. Unlike commonly used convergence diagnostics, these new tools focus explicitly on posterior quantiles and probabilities which are common inferential objectives in Bayesian statistics. Additionally, we introduce equivalence testing to the convergence assessment domain by using it as the framework of the diagnostic.