College of Engineering

Permanent URI for this communityhttps://scholarworks.montana.edu/handle/1/27

The College of Engineering at Montana State University will serve the State of Montana and the nation by fostering lifelong learning, integrating learning and discovery, developing and sharing technical expertise, and empowering students to be tomorrow's leaders.

Browse

Search Results

Now showing 1 - 4 of 4
  • Thumbnail Image
    Item
    MetaCDP: Metamorphic Testing for Quality Assurance of Containerized Data Pipelines
    (IEEE, 2024-06) ur Rehman, Faqeer; Umbreen, Sidrah; Rehman, Mudasser
    In the ever-evolving world of technology, companies are investing heavily in building and deploying state-of-the-art Machine Learning (ML) based systems. However, such systems heavily rely on the availability of high-quality data, which is often prepared/generated by the Extract Transform Load (ETL) data pipelines; thus, they are critical components of an end-to-end ML system. A low-performing model (trained on buggy data) running in a production environment can cause both financial and reputational losses for the organization. Therefore, it is of paramount significance to perform the quality assurance of underlying data pipelines from multiple perspectives. However, the computational complexity, continuous change in data, and the integration of multiple components make it challenging to test them effectively, ultimately causing such solutions to suffer from the Oracle problem. In this research paper, we propose MetaCDP, a Metamorphic Testing approach that can be used by both researchers and practitioners for quality assurance of modern Containerized Data Pipelines. We propose 10 Metamorphic Relations (MRs) that target the robustness and correctness of the data pipeline under test, which plays a crucial role in providing high-quality data for developing a clustering-based anomaly detection model. To show the applicability of the proposed approach, we tested a data pipeline (from the E-commerce domain) and uncovered several erroneous behaviors. We also present the nature of issues identified by the proposed MRs, which can better help/guide software engineers and researchers to use best coding practices for maintaining and improving the quality of their data pipelines.
  • Thumbnail Image
    Item
    Risk mapping of wildlife–vehicle collisions across the state of Montana, USA: a machine-learning approach for imbalanced data along rural roads
    (Oxford University Press, 2024-05) Bell, Matthew; Wang, Yiyi; Ament, Rob
    Wildlife–vehicle collisions (WVCs) with large animals are estimated to cost the USA over 8 billion USD in property damage, tens of thousands of human injuries and nearly 200 human fatalities each year. Most WVCs occur on rural roads and are not collected evenly among road segments, leading to imbalanced data. There are a disproportionate number of analysis units that have zero WVC cases when investigating large geographic areas for collision risk. Analysis units with zero WVCs can reduce prediction accuracy and weaken the coefficient estimates of statistical learning models. This study demonstrates that the use of the synthetic minority over-sampling technique (SMOTE) to handle imbalanced WVC data in combination with statistical and machine-learning models improves the ability to determine seasonal WVC risk across the rural highway network in Montana, USA. An array of regularized variables describing landscape, road and traffic were used to develop negative binomial and random forest models to infer WVC rates per 100 million vehicle miles travelled. The random forest model is found to work particularly well with SMOTE-augmented data to improve the prediction accuracy of seasonal WVC risk. SMOTE-augmented data are found to improve accuracy when predicting crash risk across fine-grained grids while retaining the characteristics of the original dataset. The analyses suggest that SMOTE augmentation mitigates data imbalance that is encountered in seasonally divided WVC data. This research provides the basis for future risk-mapping models and can potentially be used to address the low rates of WVCs and other crash types along rural roads.
  • Thumbnail Image
    Item
    Comparison of Supervised Learning and Changepoint Detection for Insect Detection in Lidar Data
    (MDPI AG, 2023-12) Vannoy, Trevor C.; Sweeney, Nathaniel B.; Shaw, Joseph A.; Whitaker, Bradley M.
    Concerns about decreases in insect population and biodiversity, in addition to the need for monitoring insects in agriculture and disease control, have led to an increased need for automated, non-invasive monitoring techniques. To this end, entomological lidar systems have been developed and successfully used for detecting and classifying insects. However, the data produced by these lidar systems create several problems from a data analysis standpoint: the data can contain millions of observations, very few observations contain insects, and the background environment is non-stationary. This study compares the insect-detection performance of various supervised machine learning and unsupervised changepoint detection algorithms and provides commentary on the relative strengths of each method. We found that the supervised methods generally perform better than the changepoint detection methods, at the cost of needing labeled data. The supervised learning method with the highest Matthew’s Correlation Coefficient score on the testing set correctly identified 99.5% of the insect-containing images and 83.7% of the non-insect images; similarly, the best changepoint detection method correctly identified 83.2% of the insect-containing images and 84.2% of the non-insect images. Our results show that both types of methods can reduce the need for manual data analysis.
  • Thumbnail Image
    Item
    Metamorphic Testing For Machine Learning: Applicability, Challenges, and Research Opportunities
    (IEEE, 2023-07) Rehman, Faqeer Ur; Srinivasan, Madhusudan
    The wide adoption and growth of Machine Learning (ML) have made tremendous advancements in revolutionizing a number of fields i.e., manufacturing, transportation, bio-informatics, and self-driving cars. Its ability to extract patterns from a large set of data and then use this knowledge to make future predictions is beyond the human imagination. However, the complex calculations internally performed in them make these systems suffer from the oracle problem; thus, hard to test them for identifying bugs in them and enhancing their quality. An application not properly tested can have disastrous consequences in the production environment. Metamorphic Testing (MT) has been widely accepted by researchers to address the oracle problem in testing both supervised and unsupervised ML-based systems. However, MT has several limitations (when used for testing ML) that the existing literature lacks in capturing them in a centralized place. Applying MT to test ML-based critical systems without prior knowledge/understanding of those limitations can cost organizations a waste of time and resources. In this study, we highlight those limitations to help both the researchers and practitioners to be aware of them for better testing of ML applications. Our efforts result in making the following contributions in this paper, i) providing insights into various challenges faced in testing ML-based solutions, ii) highlighting a number of key challenges faced when applying MT to test ML applications, and iii) presenting the potential future research opportunities/directions for the research community to address them.
Copyright (c) 2002-2022, LYRASIS. All rights reserved.