Theses and Dissertations at Montana State University (MSU)
Permanent URI for this communityhttps://scholarworks.montana.edu/handle/1/732
Browse
137 results
Search Results
Item Exploratory study on the effectiveness of type-level complexity metrics(Montana State University - Bozeman, College of Engineering, 2018) Smith, Killian; Chairperson, Graduate Committee: Clemente IzurietaThe research presented in this thesis analyzes the feasibility of using information collected at the type level of object oriented software systems as a metric for software complexity, using the number of recorded faults as the response variable. In other words, we ask the question: Do popular industrial language type systems encode enough of the model logic to provide useful information about software quality? A longitudinal case study was performed on five open source Java projects of varying sizes and domains to obtain empirical evidence supporting the proposed type level metrics. It is shown that the type level metrics Unique Morphisms and Logic per Line of Code are more strongly correlated to the number of reported faults than the popular metrics Cyclomatic Complexity and Instability, and performed comparably to Afferent Coupling, Control per Line of Code, and Depth of Inheritance Tree. However, the type level metrics did not perform as well as Efferent Coupling. In addition to looking at metrics at single points in time, successive changes in metrics between software versions was analyzed. There was insufficient evidence to suggest that the metrics reviewed in this case study provided predictive capabilities in regards to the number of faults in the system. This work is an exploratory study; reducing the threats to external validity requires further research on a wider variety of domains and languages.Item On the usability of continuous time bayesian networks: improving scalability and expressiveness(Montana State University - Bozeman, College of Engineering, 2017) Perreault, Logan Jared; Chairperson, Graduate Committee: John SheppardThe Continuous Time Bayesian Network (CTBN) is a model capable of compactly representing the behavior of discrete state systems that evolve in continuous time. This is achieved by factoring a Continuous Time Markov Process using the structure of a directed graph. Although CTBNs have proven themselves useful in a variety of applications, adoption of the model for use in real-world problems can be difficult. We believe this is due in part to limitations relating to scalability as well as representational power and ease of use. This dissertation attempts to address these issues. First, we improve the expressiveness of CTBNs by providing procedures that support the representation of non-exponential parametric distributions. We also propose the Continuous Time Decision Network (CTDN) as a framework for representing decision problems using CTBNs. This new model supports optimization of a utility value as a function of a set of possible decisions. Next, we address the issue of scalability by providing two distinct methods for compactly representing CTBNs by taking advantage of similarities in the model parameters. These compact representations are able to mitigate the exponential growth in parameters that CTBNs exhibit, allowing for the representation of more complex processes. We then introduce another approach to managing CTBN model complexity by introducing the concept of disjunctive interaction for CTBNs. Disjunctive interaction has been used in Bayesian networks to provide significant reductions in the number of parameters, and we have adapted this concept to provide the same benefits within the CTBN framework. Finally, we demonstrate how CTBNs can be applied to the real-world task of system prognostics and diagnostics. We show how models can be built and parameterized directly using information that is readily available for diagnostic models. We then apply these model construction techniques to build a CTBN describing a vehicle system. The vehicle model makes use of some of the newly introduced algorithms and techniques, including the CTDN framework and disjunctive interaction. This extended application not only demonstrates the utility of the novel contributions presented in this work, but also serves as a template for applying CTBNs to other real-world problems.Item Factored evolutionary algorithms: cooperative coevolutionary optimization with overlap(Montana State University - Bozeman, College of Engineering, 2017) Strasser, Shane Tyler; Chairperson, Graduate Committee: John SheppardFactored Evolutionary Algorithms (FEA) define a relatively new class of evolutionary-based optimization algorithms that have been successfully applied to various problems, such as training neural networks and performing abductive inference in graphical models. FEA is unique in that it factors the function being optimized by creating subpopulations that optimize over a subset of dimensions of the function. However, unlike other optimization techniques that subdivide optimization problems, FEA encourages subpopulations to overlap with one another, allowing subpopulations to compete and share information. Although FEA has been shown to be very effective at function optimization, there is still little understanding with respect to its general characteristics. In this dissertation, we present seven results exploring the theoretical and empirical properties of FEA. First, we present a formal definition of FEA and demonstrate its relationships to other multiple population algorithms. Second, we demonstrate that FEA's success is independent of the underlying optimization algorithm by evaluating the performance of FEA using a wide variety of evolutionary- and swarm-based algorithms over single-population and non-overlapping versions. Third, we demonstrate that for a given problem, there is an optimal way to generate groups of overlapping subpopulations derived using the Markov blanket in Bayesian networks. Fourth, we establish that a class of optimization functions like NK landscapes can be mapped directly to probabilistic graphical models. Additionally, we demonstrate that factor architectures derived from Markov blankets maintain better diversity of individuals in their population. Fifth, we present a new discrete Particle Swarm Optimization (PSO) algorithm and compare its performance to competing approaches. In addition, we analyze the performance of FEA versions of discrete PSO and discover that FEA masks the poor performance of search algorithms. We show what conditions are necessary for FEA to converge and scenarios where FEA may become stuck in suboptimal regions in the search space. Finally, we explore the performance of FEA on unitation functions and discover several instances where FEA struggles to outperform single-population algorithms. These results allow us to determine which situations are appropriate for FEA when using solving real-world problems.Item New methods in computation of reaction fluxes from metabolomics data(Montana State University - Bozeman, College of Engineering, 2018) Salinas Duron, Daniel; Chairperson, Graduate Committee: Brendan MumeyChanges in cellular metabolism can be deduced from how they affect the measurable metabolites in cell samples. We provide methods to compute metabolic reaction rates from changes in measurable metabolites over time. The methods provided are intended to overcome technical challenges, such as the inapplicability of a steady state assumption, heterogeneity of samples from different donors, and the lack of targeted metabolomics data. Solutions to these challenges involve identifying metabolites constrained even under non-steady state, using components analysis to find the donor consensus, and using an integer linear program to solve a set cover variant designed to generate targeted data from untargeted data. The methods are applied on data derived from diseased articular cells. The results show that the reaction rates inferred from the incomplete data are biologically relevant, and that the minimal pathways captured ancillary processes that alternative approaches ignored. We conclude that, although the resulting rates and pathways are not conclusive, they provide useful guidance on experiments to pursue. On the experimental side, our findings have lead us to believe that osteoarthritic chondrocytes respond to compression by initiating protein synthesis, opening the possibility of physical therapy as a stimulus for cartilage regeneration.Item Using semi-supervised learning for predicting metamorphic relations(Montana State University - Bozeman, College of Engineering, 2018) Hardin, Bonnie Elizabeth; Chairperson, Graduate Committee: Upulee KanewalaSoftware testing is difficult to automate, especially in programs which face the oracle problem, where an oracle does not exist, or is too hard to develop. Metamorphic testing is a solution to this problem. Metamorphic testing uses metamorphic relations to determine if tests pass or fail. A large amount of time is needed for a domain expert to determine which metamorphic relations can be used to test a given program. Metamorphic relation prediction removes this need for such an expert. We propose a method using semi-supervised learning algorithms to detect which metamorphic relations are applicable to a given code base. Semi-supervised learning is useful in this problem domain as most programs do not have pre-defined metamorphic relations. These programs are considered unlabeled data in a semi-supervised algorithm. We compare two semi-supervised models with a supervised model, and show that the addition of unlabeled data improves the classification accuracy of the metamorphic relation prediction model.Item Efficient machine learning using partitioned restricted Boltzmann machines(Montana State University - Bozeman, College of Engineering, 2016) Tosun, Hasari; Chairperson, Graduate Committee: John SheppardRestricted Boltzmann Machines (RBM) are energy-based models that are used as generative learning models as well as crucial components of Deep Belief Networks (DBN). The most successful training method to date for RBMs is Contrastive Divergence. However, Contrastive Divergence is inefficient when the number of features is very high and the mixing rate of the Gibbs chain is slow. We develop a new training method that partitions a single RBM into multiple overlapping atomic RBMs. Each partition (RBM) is trained on a section of the input vector. Because it is partitioned into smaller RBMs, all available data can be used for training, and individual RBMs can be trained in parallel. Moreover, as the number of dimensions increases, the number of partitions can be increased to reduce runtime computational resource requirements significantly. All other recently developed methods for training RBMs suffer from some serious disadvantage under bounded computational resources; one is forced to either use a subsample of the whole data, run fewer iterations (early stop criterion), or both. Our Partitioned-RBM method provides an innovative scheme to overcome this shortcoming. By analyzing the role of spatial locality in Deep Belief Networks (DBN), we show that spatially local information becomes diffused as the network becomes deeper. We demonstrate that deep learning based on partitioning of Restricted Boltzmann Machines (RBMs) is capable of retaining spatially local information. As a result, in addition to computational improvement, reconstruction and classification accuracy of the model is also improved using our Partitioned-RBM training method.Item Anomaly detection through spatio-temporal data mining, with application to near real-time outlying sensor identification(Montana State University - Bozeman, College of Engineering, 2017) Galarus, Douglas Edward; Chairperson, Graduate Committee: John Paxton; Rafal A. Angryk (co-chair)There is a need for robust solutions to the challenges of near real-time spatio-temporal outlier and anomaly detection. In our dissertation, we define and demonstrate quality measures for evaluation and comparison of overlapping, real-time, spatio-temporal data providers and for assessment and optimization of data acquisition, system operation and data redistribution. Our measures are tested on real-world data and applications, and our results show the need and potential to develop our own mechanisms for outlier and anomaly detection. We then develop a representative, near real-time solution for the identification of outlying sensors that far outperforms state of the art methods in terms of accuracy and is computationally efficient. When applied to a real-world, meteorological data set, we identify numerous problematic sites that otherwise have not been flagged as bad. We identify sites for which metadata is incorrect. We identify observations that have been mislabeled by provider quality control processes. And, we demonstrate that our method outperforms enhanced versions of state of the art methods for assessment of accuracy using comparable or less computation time. There are many quality-related problems with real data sets and, in the absence of an approach like ours, these problems may have largely gone unidentified. Our approach is novel for the simple but effective way that it accounts for spatial and temporal variation, and that it addresses more than just accuracy. Collectively these contributions form an overarching data-mining framework and example that can be used and extended for data-mining method development, model building and evaluation of spatio-temporal outlier and anomaly detection processes.Item Internet measurements and application layer optimizations for faster web communications(Montana State University - Bozeman, College of Engineering, 2017) Goel, Utkarsh; Chairperson, Graduate Committee: Mike WittieThe evolution of Web technologies enables interactive Web communications and makes the Web ecosystem more complex. To ensure timely delivery of Web content, the Web Performance Community (WPC) -- comprised of browser vendors, content providers, content delivery networks (CDNs), and network regulators -- develops new protocols and optimization techniques. However, new protocols suffer from insufficiently wide adoption and the optimization techniques often require ISP support. To cope with these challenges, I present several measurement techniques through which WPC could better understand the current state of the Web performance. I also present several application-layer optimizations that enable applications to control how content is delivered in different networks. This work summarizes several best-practices, which have been extensively evaluated on production infrastructure, to which the WPC could and should transition to achieve faster Web communications.Item Learning spectral filters for single- and multi-label classification of musical instruments(Montana State University - Bozeman, College of Engineering, 2015) Donnelly, Patrick Joseph; Chairperson, Graduate Committee: John SheppardMusical instrument recognition is an important research task in music information retrieval. While many studies have explored the recognition of individual instruments, the field has only recently begun to explore the more difficult multi-label classification problem of identifying the musical instruments present in mixtures. This dissertation presents a novel method for feature extraction in multi-label instrument classification and makes important contributions to the domain of instrument classification and to the research area of multi-label classification. In this work, we consider the largest collection of instrument samples in the literature. We examine 13 musical instruments common to four datasets. We consider multiple performers, multiple dynamic levels, and all possible musical pitches within the range of the instruments. To the area of multi-label classification, we introduce a binary-relevance feature extraction scheme to couple with the common binary-relevance classification paradigm, allowing selection of features unique to each class label. We present a data-driven approach to learning areas of spectral prominence for each instrument and use these locations to guide our binary-relevance feature extraction. We use this approach to estimate source separation of our polyphonic mixtures. We contribute the largest study of single- and multi-label classification in musical instrument literature and demonstrate that our results track with or improve upon the results of comparable approaches. In our solo instrument classification experiments, we provide the seminal use of Bayesian classifiers in the domain and demonstrate the utility of conditional dependencies between frequency- and time-based features for the instrument classification problem. For multi-label instrument classification, we explore the question of dataset bias in a cross-validation study controlled for dataset independence. Additionally, we present a comprehensive cross-dataset study and demonstrate the generalizability of our approach. We consider the difficulty of the multi-label problem with regards to label density and cardinality and present experiments with a reduced label set, comparable to many studies in the literature, and demonstrate the efficacy of our system on this easier problem. Furthermore, we provide a comprehensive set of multi-label evaluation measures.Item Finding disjoint dense clubs in an undirected graph(Montana State University - Bozeman, College of Engineering, 2016) Zou, Peng; Chairperson, Graduate Committee: Binhai ZhuFor over a decade, software like Twitter, Facebook and WeChat have changed people's lives by creating social groups and networks easily. They give people a new convenient 'world' where we can share everything that happens around us, and social networks have grown enormously in recent years. In essence, social networks are full of data and have become an indispensable part of our life. Trust is an important feature of the relationship between two users in a social network. With the development of social networks, the trust among its members has become a big issue. In a social network, the trust among its members usually cannot be carried over many users. In the corresponding social network modeled as a graph, a user is denoted by a vertex and an edge between two vertices means that these two users communicate a lot above some threshold and they trust each other. An online social community is usually corresponding to a dense region in such a graph. A complex social network is usually composed of several groups/communities (the regions with a lot of edges), and this characterization of community structure means the appearance of densely connected groups of vertices, with only sparse connections between groups. For analyzing the structure of social networks and the relationship between users, it is important to find disjoint groups/communities with a small diameter and with a decent size, formally called dense clubs in this thesis. We focus on handling this NP-complete problem in this thesis. First, from the parameterized computational complexity point of view, we show that this problem does not admit a polynomial kernel (implying that it is unlikely to apply some reduction rules to obtain a practically small problem size). Then, we focus on the dual version of the problem, i.e., deleting 'd' vertices to obtain some disjoint dense clubs. We show that this dual problem admits a simple FPT algorithm using a bounded search tree method (the running time is still too high for practical datasets). Finally, we combine a simple reduction rule together with some heuristic methods to obtain a practical solution (verified by extensive testing on practical datasets). Empirical results show that this heuristic algorithm is very sensitive to all parameters. This algorithm is suitable on graphs which have a mixture of dense and sparse regions. These graphs are very common in the real world.