Theses and Dissertations at Montana State University (MSU)

Permanent URI for this collectionhttps://scholarworks.montana.edu/handle/1/733

Browse

Search Results

Now showing 1 - 10 of 12
  • Thumbnail Image
    Item
    An evaluation of graph representation of programs for malware detection and categorization using graph-based machine learning methods
    (Montana State University - Bozeman, College of Engineering, 2023) Pearsall, Reese Andersen; Chairperson, Graduate Committee: Clemente Izurieta
    With both new and reused malware being used in cyberattacks everyday, there is a dire need for the ability to detect and categorize malware before damage can be done. Previous research has shown that graph-based machine learning algorithms can learn on graph representations of programs, such as a control flow graph, to better distinguish between malicious and benign programs, and detect malware. With many types of graph representations of programs, there has not been a comparison between these different graphs to see if one performs better than the rest. This thesis provides a comparison between different graph representations of programs for both malware detection and categorization using graph-based machine learning methods. Four different graphs are evaluated: control flow graph generated via disassembly, control flow graph generated via symbolic execution, function call graph, and data dependency graph. This thesis also describes a pipeline for creating a classifier for malware detection and categorization. Graphs are generated using the binary analysis tool angr, and their embeddings are calculated using the Graph2Vec graph embedding algorithm. The embeddings are plotted and clustered using K-means. A classifier is then built by assigning labels to clusters and the points within each cluster. We collected 2500 malicious executables and 2500 benign executables, and each of the four graph types is generated for each executable. Each is plugged into their own individual pipeline. A classifier for each of the four graph types is built, and classification metrics (e.g. F1 score) are calculated. The results show that control flow graphs generated from symbolic execution had the highest F1 score of the four different graph representations. Using the control flow graph generated from symbolic execution pipeline, the classifier was able to most accurately categorize trojan malware.
  • Thumbnail Image
    Item
    Improving the confidence of machine learning models through improved software testing approaches
    (Montana State University - Bozeman, College of Engineering, 2022) ur Rehman, Faqeer; Chairperson, Graduate Committee: Clemente Izurieta; This is a manuscript style paper that includes co-authored chapters.
    Machine learning is gaining popularity in transforming and improving a number of different domains e.g., self-driving cars, natural language processing, healthcare, manufacturing, retail, banking, and cybersecurity. However, knowing the fact that machine learning algorithms are computationally complex, it becomes a challenging task to verify their correctness when either the oracle is not available or is available but too expensive to apply. Software Engineering for Machine Learning (SE4ML) is an emerging research area that focuses on applying the SE best practices and methods for better development, testing, operation, and maintenance of ML models. The focus of this work is on the testing aspect of ML applications by adapting the traditional software testing approaches for improving the confidence in them. First, a statistical metamorphic testing technique is proposed to test Neural Network (NN)-based classifiers in a non-deterministic environment. Furthermore, an MRs minimization algorithm is proposed for the program under test; thus, saving computational costs and organizational testing resources. Second, a Metamorphic Relation (MR) is proposed to address a data generation/labeling problem; that is, enhancing the test inputs effectiveness by extending the prioritized test set with new tests without incurring additional labeling costs. Further, the prioritized test inputs are leveraged to propose a statistical hypothesis testing (for detection) and machine learning-based approach (for prediction) of faulty behavior in two other machine learning classifiers i.e., NN-based Intrusion Detection Systems. Finally, to test unsupervised ML models, the metamorphic testing approach is utilized to make some insightful contributions that include: i) proposing a broader set of 22 MRs for assessing the behavior of clustering algorithms under test, ii) providing a detailed analysis/reasoning to show how the proposed MRs can be used to target both the verification and validation aspects of testing the programs under investigation, and iii) showing that verification of MR using multiple criteria is more beneficial than relying on using just a single criterion (i.e., clusters assigned). Thus, the work presented here results in providing a significant contribution to address the gaps found in the field, which enhances the body of knowledge in the emergent SE4ML field.
  • Thumbnail Image
    Item
    Automated techniques for prioritization of metamorphic relations for effective metamorphic testing
    (Montana State University - Bozeman, College of Engineering, 2022) Srinivasan, Madhusudan; Chairperson, Graduate Committee: John Paxton and Upulee Kanewala (co-chair)
    An oracle is a mechanism to decide whether the outputs of the program for the executed test cases are correct. In many situations, the oracle is not available or too difficult to implement. Metamorphic testing is a testing approach that uses metamorphic relations (MRs), properties of the software under test represented in the form of relations among inputs and outputs of multiple executions, to help verify the correctness of a program. Typically, MRs vary in their ability to detect faults in the program under test, and some MRs tend to detect the same set of faults. In this work, we aim to prioritize MRs to improve the efficiency and effectiveness of MT. We present five MR prioritization approaches: (1) Fault-based, (2) Coverage-based, (3) Statement Centrality-based, (4) Variable-based, and (5) Data Diversity-based. To evaluate these MR prioritization approaches, we conducted experiments on complex open- source software systems and machine learning programs. Our results suggest that the proposed MR prioritization approaches outperform the current practice of executing the source and follow-up test cases of the MRs randomly. Further, our results show that Statement Centrality-based and Variable-based approaches outperform Code Coverage and random-based approaches. Also, the proposed approaches show 21% higher rate of fault detection over random-based prioritization. For machine learning programs, the proposed Data Diversity-based MR prioritization approach increases the fault detection effectiveness by up to 40% when compared to the Code Coverage- based approach and reduces the time taken to detect a fault by 29% when compared to random execution of MRs. Further, all the proposed approaches lead to reducing the number of MRs that needs to be executed. Overall, our work would result in saving time and cost during the metamorphic testing process.
  • Thumbnail Image
    Item
    Machine learning for pangenomics
    (Montana State University - Bozeman, College of Engineering, 2021) Manuweera, Buwani Sakya; Chairperson, Graduate Committee: Brendan Mumey; This is a manuscript style paper that includes co-authored chapters.
    Finding genotype-phenotype associations is an important task in biology. Most of the the existing reference-based methods introduce biases because they use a single genome from an individual as the reference sequence. So, these biases can lead to limitations in inferred genotype-phenotype associations. Advances in sequencing techniques have enabled access to a large number of sequenced genomes from multiple organisms from different species. These can be used to create a pangenome, which represents a collection of genetic information from multiple organisms. Using a pangenome can effectively reduce those limitation issues as it does not require a reference. Recently, machine learning techniques are emerging as effective methods for problems involving genomics and pangenomics data. Kernel methods are used as a part of machine learning models to compute similarities between instances. Kernels can map the given set of data into a different feature space that can help distinguish the data into corresponding classes. In this work, we develop supervised machine learning models using a set of features gathered using pangenomic graphs, and the effectiveness of those features is evaluated in predicting yeast phenotypes. We first evaluated the effectiveness of the features using a a traditional supervised machine learning model and, then compared it to novel custom kernels that incorporate the information from the pangenomic graphical structure. Experimental results using yeast phenotypes indicate that the developed machine learning models that use reference-free features and novel kernels outperform models based on traditional reference-based features. This work has implications for bioinformaticians and computational biologists working with pangenomes as well as computer scientists developing predictive models for genomic data.
  • Thumbnail Image
    Item
    Automated clinical transcription for behavioral health clinicians
    (Montana State University - Bozeman, College of Engineering, 2022) Kazi, Nazmul Hasan; Chairperson, Graduate Committee: Brendan Mumey; This is a manuscript style paper that includes co-authored chapters.
    Mental health disorder is one of the most common but expensive healthcare conditions in the world. Yet, more than half of all patients go untreated due to various reasons such as lack of access to resources and clinicians. On the other hand, providers rely on Electronic Health Records (EHRs) to compile and share clinical notes, which is a key component of clinical practice, but time-consuming data entry is considered one of the primary downsides of EHRs. Many practitioners are spending more time in EHR documentation than direct patient care, which adds to patient dissatisfaction and clinician burnout. In this work, we explore the feasibility of developing an end-to-end clinical transcription tool that fully automates the documentation process for behavioral health clinicians. We divide the task into several sub-tasks and primarily focus on the following: 1) extraction and classification of important information from patient-provider conversations, and 2) generation of clinical notes from extracted information. We develop a dataset of 65 transcripts from simulated provider-patient conversations. Then, we fine-tune a transformer language model that shows promising results on personalized data extraction (F1=0.94) and scope for improvement in classification (F1=0.18) of extracted information to EHR categories. Furthermore, we develop a rule-based natural language generation module that formalizes all types of extracted information and synthesizes them into clinical notes. The overall pipeline shows the potential of automatically generating draft clinical notes and reducing the documentation time for behavioral health clinicians by 70-80%. The findings of this work have implications for health behavioral care providers as well as machine learning and natural language processing application developers.
  • Thumbnail Image
    Item
    Towards reduced-cost hyperspectral and multispectral image classification
    (Montana State University - Bozeman, College of Engineering, 2021) Morales Luna, Giorgio L.; Chairperson, Graduate Committee: John Sheppard
    In recent years, Hyperspectral Imaging systems (HSI) have become a powerful source for reliable data in applications such as remote sensing, agriculture, and biomedicine. However, the abundant spectral and spatial information of hyperspectral images makes them highly complex, which leads to the need for specialized Machine Learning algorithms to process and classify them. In that sense, the contribution of this thesis is multi-folded. We present a low-cost convolutional neural network designed for hyperspectral image classification called Hyper3DNet. Its architecture consists of two parts: a series of densely connected 3-D convolutions used as a feature extractor, and a series of 2-D separable convolutions used as a spatial encoder. We show that this design involves fewer trainable parameters compared to other approaches, yet without detriment to its performance. Furthermore, having observed that hyperspectral images benefit from methods to reduce the number of spectral bands while retaining the most useful information for a specific application, we present two novel hyperspectral dimensionality reduction techniques. First, we propose a filter-based method called Inter-Band Redundancy Analysis (IBRA) based on a collinearity analysis between a band and its neighbors. This analysis helps to remove redundant bands and dramatically reduces the search space. Second, we apply a wrapper-based approach called Greedy Spectral Selection (GSS) to the results of IBRA to select bands based on their information entropy values and train a compact Convolutional Neural Network to evaluate the performance of the current selection. We also propose a feature extraction framework that consists of two main steps: first, it reduces the total number of bands using IBRA; then, it can use any feature extraction method to obtain the desired number of feature channels. Finally, we use the original hyperspectral data cube to simulate the process of using actual filters in a multispectral imager. Experimental results show that our proposed Hyper3DNet architecture in conjunction with our dimensionality reduction techniques yields better classification results than the compared methods, producing more suitable results for a multispectral sensor design.
  • Thumbnail Image
    Item
    Extracting abstract spatio-temporal features of weather phenomena for autoencoder transfer learning
    (Montana State University - Bozeman, College of Engineering, 2020) McAllister, Richard Arthur; Chairperson, Graduate Committee: John Sheppard
    In this dissertation we develop ways to discover encodings within autoencoders that can be used to exchange information among neural network models. We begin by verifying that autoencoders can be used to make predictions in the meteorological domain, specifically for wind vector determination. We use unsupervised pre-training of stacked autoencoders to construct multilayer perceptrons to accomplish this task. We then discuss the role of our approach as an important step in positioning Empirical Weather Prediction as a viable alternative to Numerical Weather Prediction. We continue by exploring the spatial extensibility of the previously developed models, observing that different areas in the atmosphere may be influenced unique forces. We use stacked autoencoders to generalize across an area of the atmosphere, expanding the application of networks trained in one area to the surrounding areas. As a prelude to exploring transfer learning, we demonstrate that a stacked autoencoder is capable of capturing knowledge universal to these dataspaces. Following this we observe that in extremely large dataspaces, a single neural network covering that space may not be effective, and generating large numbers of deep neural networks is not feasible. Using functional data analysis and spatial statistics we analyze deep networks trained from stacked autoencoders in a spatiotemporal application area to determine the extent to which knowledge can be transferred to similar regions. Our results indicate high likelihood that spatial correlation can be exploited if it can be identified prior to training. We then observe that artificial neural networks, being essentially black-box processes, would benefit by having effective methods for preserving knowledge for successive generations of training. We develop an approach to preserving knowledge encoded in the hidden layers of several ANN's and collect this knowledge in networks that more effectively make predictions over subdivisions of the entire dataspace. We show that this method has an accuracy advantage over the single-network approach. We extend the previously developed methodology, adding a non-parametric method for determining transferrable encoded knowledge. We also analyze new datasets, focusing on the ability for models trained in this fashion to be transferred to operating on other storms.
  • Thumbnail Image
    Item
    Large-scale automated human protein-phenotype relation extraction from biomedical literature
    (Montana State University - Bozeman, College of Engineering, 2020) Pourreza Shahri, Morteza; Chairperson, Graduate Committee: Indika Kahanda
    Identifying protein-phenotype relations is of paramount importance for applications such as uncovering rare and complex diseases. Human Phenotype Ontology (HPO) is a recently introduced standardized vocabulary for describing disease-related phenotypic abnormalities in humans. While the official HPO knowledge base maintains known associations between human proteins and HPO terms, it is widely believed that this is incomplete. However, due to the exponential growth of biomedical literature, timely manual curation is infeasible, rendering the need for efficient and accurate computational tools for automated curation. In this work, we present HPcurator, a novel two-step framework for extracting relations between proteins and HPO terms from biomedical literature. First, we implement ProPheno, a comprehensive online dataset composed of human protein-phenotype co-mentions extracted from the entire set of biomedical articles. Subsequently, we show that these co-mentions are useful as a complementary source of input for a different, but highly related, task of automated protein-phenotype prediction. Next, we develop a supervised machine learning model called PPPred, which, to the best of our knowledge, is the first predictive model that can classify the validity of a given sentence-level protein-phenotype co-mention. Using a gold standard dataset composed of manually curated sentence co-mentions, we demonstrate that PPPred significantly outperforms several baseline methods. Finally, we propose SSEnet, a novel deep semi-supervised ensemble framework for relation extraction that combines deep learning, semi-supervised learning, and ensemble learning. This framework is motivated by the fact that while the manual annotation of co-mentions is extremely prohibitive, we have access to millions of unlabeled co-mentions. We develop a prototype of HPcurator by instantiating SSEnet with ProPheno, self-learning, pre-trained language models, as well as convolutional and recurrent neural networks. This system can successfully output a ranked list of relevant sentences for a user input protein-phenotype pair. Our experimental results indicate that this system provides state-of-the-art performance in human protein- HPO term relation extraction. The findings and the insight gained from this work have implications for biocurators, biologists, and the computer science community involved in developing biomedical text mining tools.
  • Thumbnail Image
    Item
    Convolutional neural networks for multi- and hyper-spectral image classification
    (Montana State University - Bozeman, College of Engineering, 2019) Senecal, Jacob John; Chairperson, Graduate Committee: John Sheppard
    While a great deal of research has been directed towards developing neural network architectures for classifying RGB images, there is a relative dearth of research directed towards developing neural network architectures specifically for multi-spectral and hyper-spectral imagery. The additional spectral information contained in a multi-spectral or hyper-spectral image can be valuable for land management, agriculture and forestry, disaster control, humanitarian relief operations, and environmental monitoring. However, the massive amounts of data generated by a multi-spectral or hyper- spectral instrument make processing this data a challenge. Machine learning and computer vision techniques could automate the analysis process of these rich data sources. With these benefits in mind, we have adapted recent developments in small efficient convolutional neural networks (CNNs), to create a small CNN architecture capable of being trained from scratch to classify 10 band multi-spectral images, using much fewer parameters than popular deep architectures, such as the ResNet or DenseNet architectures. We show that this network provides higher classification accuracy and greater sample efficiency than the same network using RGB images. We also show that it is possible to employ a transfer learning approach and use a network pre-trained on multi-spectral satellite imagery to increase accuracy on a second much smaller multi-spectral dataset, even though the satellite imagery was captured from a much different perspective (high altitude, overhead vs. ground based at close stand-off distance). These results demonstrates that it is possible to train our small network architectures on small multi-spectral datasets and still achieve high classification accuracy. This is significant as labeled hyper-spectral and multi-spectral datasets are generally much smaller than their RGB counterparts. Finally, we approximate a Bayesian version of our CNN architecture using a recent technique known as Monte Carlo dropout. By keeping dropout in place during test time we can perform a Monte Carlo procedure using multiple forward passes of our network to generate a distribution of network outputs which can be used as a measure of uncertainty in the predictions a network is making. Large variance in the network output corresponds to high uncertainty and vice versa. We show that a network that is capable of working with multi-spectral imagery significantly reduces the uncertainty associated with class predictions compared to using RGB images. This analysis reveals that the benefits of an architecture that works effectively with multi-spectral or hyper-spectral imagery extends beyond higher classification accuracy. Multi-spectral and hyper-spectral imagery allows us to be more confident in the predictions that a deep neural network is making.
  • Thumbnail Image
    Item
    Improving a precision agriculture on-farm experimentation workflow through machine learning
    (Montana State University - Bozeman, College of Engineering, 2019) Peerlinck, Amy; Chairperson, Graduate Committee: John Sheppard
    Reducing environmental impact while simultaneously improving net return of crops is one of the key goals of Precision Agriculture (PA). To this end, an on-farm experimentation workflow was created that focuses on reducing the applied nitrogen (N) rate through variable rate application (VRA). The first step in the process, after gathering initial data from the farmers, creates experimental randomly stratified N prescription maps. One of the main concerns that arises for farmers within these maps is the large jumps in N rate between consecutive cells. To this end we successfully develop and apply a Genetic Algorithm to minimize rate jumps while maintaining stratification across yield and protein bins. The ultimate goal of the on-farm experiments is to determine the final N rate to be applied. This is accomplished by optimizing a net return function based on yield and protein prediction. Currently, these predictions are often done with simple linear and non-linear regression models. Our work introduces six different machine learning (ML) models for improving this task: a single layer feed-forward neural network (FFNN), a stacked auto-encoder (SAE), three different AdaBoost ensembles, and a bagging ensemble. The AdaBoost and bagging methods each use a single layer FFNN as its weak model. Furthermore, a simple spatial analysis is performed to create spatial data sets, to better represent the inherent spatial nature of the field data. These methods are applied to four actual fields' yield and protein data. The spatial data is shown to improve accuracy for most yield models. It does not perform as well on the protein data, possibly due to the small size of these data sets, resulting in a sparse data set and potential overfitting of the models. When comparing the predictive models, the deep network performed better than the shallow network, and the ensemble methods outperformed both the SAE and a single FFNN. Out of the four different ensemble methods, bagging had the most consistent performance across the yield and protein data sets. Overall, spatial bagging using FFNNs as the weak learner has the best performance for both yield and protein prediction.
Copyright (c) 2002-2022, LYRASIS. All rights reserved.