Scholarly Work - Computer Science

Permanent URI for this collectionhttps://scholarworks.montana.edu/handle/1/3034

Browse

Search Results

Now showing 1 - 10 of 55
  • Thumbnail Image
    Item
    Green Infrastructure Microbial Community Response to Simulated Pulse Precipitation Events in the Semi-Arid Western United States
    (MDPI AG, 2024-07) Hastings, Yvette D.; Smith, Rose M.; Mann, Kyra A.; Brewer, Simon; Goel, Ramesh; Jack Hinners, Sarah; Follstad Shah, Jennifer
    Processes driving nutrient retention in stormwater green infrastructure (SGI) are not well quantified in water-limited biomes. We examined the role of plant diversity and physiochemistry as drivers of microbial community physiology and soil N dynamics post precipitation pulses in a semi-arid region experiencing drought. We conducted our study in bioswales receiving experimental water additions and a montane meadow intercepting natural rainfall. Pulses of water generally elevated soil moisture and pH, stimulated ecoenzyme activity (EEA), and increased the concentration of organic matter, proteins, and N pools in both bioswale and meadow soils. Microbial community growth was static, and N assimilation into biomass was limited across pulse events. Unvegetated plots had greater soil moisture than vegetated plots at the bioswale site, yet we detected no clear effect of plant diversity on microbial C:N ratios, EEAs, organic matter content, and N pools. Differences in soil N concentrations in bioswales and the meadow were most directly correlated to changes in organic matter content mediated by ecoenzyme expression and the balance of C, N, and P resources available to microbial communities. Our results add to growing evidence that SGI ecological function is largely comparable to neighboring natural vegetated systems, particularly when soil media and water availability are similar.
  • Thumbnail Image
    Item
    The longest letter-duplicated subsequence and related problems
    (Springer Science and Business Media LLC, 2024-07) Lai, Wenfeng; Liyanage, Adiesha; Zhu, Binhai; Zou, Peng
    Motivated by computing duplication patterns in sequences, a new problem called the longest letter-duplicated subsequence (LLDS) is proposed. Given a sequence S of length n, a letter- duplicated subsequence is a subsequence of S in the form of x d1 1 x d2 2 . . . x d k k with x i ∈ , x j = x j+1 and di ≥ 2 for all i in [k] and j in [k − 1]. A linear time algorithm for computing a longest letter-duplicated subsequence (LLDS) of S can be easily obtained. In this paper, we focus on two variants of this problem: (1) ‘all-appearance’ version, i.e., all letters in must appear in the solution, and (2) the weighted version. For the former, we obtain dichotomous results: We prove that, when each letter appears in S at least 4 times, the problem and a relaxed version on feasibility testing (FT) are both NP-hard. The reduction is from (3+, 1, 2−)- SAT, where all 3-clauses (i.e., containing 3 lals) are monotone (i.e., containing only positive literals) and all 2-clauses contain only negative literals. We then show that when each letter appears in S at most 3 times, then the problem admits an O(n) time algorithm. Finally, we consider the weighted version, where the weight of a block x di i (di ≥ 2) could be any positive function which might not grow with di . We give a non-trivial O(n2) time dynamic programming algorithm for this version, i.e., computing an LD-subsequence of S whose weight is maximized.
  • Thumbnail Image
    Item
    A Comprehensive Study of Walmart Sales Predictions Using Time Series Analysis
    (Sciencedomain International, 2024-06) C., Cyril Neba; F., Gerard Shu; Nsuh, Gillian; A., Philip Amouda; F., Adrian Neba; Webnda, F.; Ikpe, Victory; Orelaja, Adeyinka; Sylla, Nabintou Anissia
    This article presents a comprehensive study of sales predictions using time series analysis, focusing on a case study of Walmart sales data. The aim of this study is to evaluate the effectiveness of various time series forecasting techniques in predicting weekly sales data for Walmart stores. Leveraging a dataset from Kaggle comprising weekly sales data from various Walmart stores around the United States, this study explores the effectiveness of time series analysis in forecasting future sales trends. Various time series analysis techniques, including Auto Regressive Integrated Moving Average (ARIMA), Seasonal Auto Regressive Integrated Moving Average (SARIMA), Prophet, Exponential Smoothing, and Gaussian Processes, are applied to model and forecast Walmart sales data. By comparing the performance of these models, the study seeks to identify the most accurate and reliable methods for forecasting retail sales, thereby providing valuable insights for improving sales predictions in the retail sector. The study includes an extensive exploratory data analysis (EDA) phase to preprocess the data, detect outliers, and visualize sales trends over time. Additionally, the article discusses the partitioning of data into training and testing sets for model evaluation. Performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are utilized to compare the accuracy of different time series models. The results indicate that Gaussian Processes outperform other models in terms of accuracy, with an RMSE of 34,116.09 and an MAE of 25,495.72, significantly lower than the other models evaluated. For comparison, ARIMA and SARIMA models both yielded an RMSE of 555,502.2 and an MAE of 462,767.3, while the Prophet model showed an RMSE of 567,509.2 and an MAE of 474,990.8. Exponential Smoothing also performed well with an RMSE of 555,081.7 and an MAE of 464,110.5. These findings suggest the potential of Gaussian Processes for accurate sales forecasting. However, the study also highlights the strengths and weaknesses of each forecasting methodology, emphasizing the need for further research to refine existing techniques and explore novel modeling approaches. Overall, this study contributes to the understanding of time series analysis in retail sales forecasting and provides insights for improving future forecasting endeavors.
  • Thumbnail Image
    Item
    Persistent and lagged effects of fire on stream solutes linked to intermittent precipitation in arid lands
    (Springer Science and Business Media LLC, 2024) Lowman, Heili; Blaszczak, Joanna; Cale, Ashely; Dong, Xiaoli; Earl, Stevan; Grabow, Julia; Grimm, Nancy B.; Harms, Tamara K.; Reinhold, Ann Marie; Summers, Betsy; Webster, Alex J.
    Increased occurrence, size, and intensity of fire result in significant but variable changes to hydrology and material retention in watersheds with concomitant effects on stream biogeochemistry. In arid regions, seasonal and episodic precipitation results in intermittency in flows connecting watersheds to recipient streams that can delay the effects of fire on stream chemistry. We investigated how the spatial extent of fire within watersheds interacts with variability in amount and timing of precipitation to influence stream chemistry of three forested, montane watersheds in a monsoonal climate and four coastal, chaparral watersheds in a Mediterranean climate. We applied state-space models to estimate effects of precipitation, fire, and their interaction on stream chemistry up to five years following fire using 15 + years of monthly observations. Precipitation alone diluted specific conductance and flushed nitrate and phosphate to Mediterranean streams. Fire had positive and negative effects on specific conductance in both climates, whereas ammonium and nitrate concentrations increased following fire in Mediterranean streams. Fire and precipitation had positive interactive effects on specific conductance in monsoonal streams and on ammonium in Mediterranean streams. In most cases, the effects of fire and its interaction with precipitation persisted or were lagged 2–5 years. These results suggest that precipitation influences the timing and intensity of the effects of fire on stream solute dynamics in aridland watersheds, but these responses vary by climate, solute, and watershed characteristics. Time series models were applied to data from long-term monitoring that included observations before and after fire, yielding estimated effects of fire on aridland stream chemistry. This statistical approach captured effects of local-scale temporal variation, including delayed responses to fire, and may be used to reduce uncertainty in predicted responses of water quality under changing fire and precipitation regimes of arid lands.
  • Thumbnail Image
    Item
    Univariate Skeleton Prediction in Multivariate Systems Using Transformers
    (Springer Nature, 2024-08) Morales, Giorgio; Sheppard, John W.
    Symbolic regression (SR) methods attempt to learn mathematical expressions that approximate the behavior of an observed system. However, when dealing with multivariate systems, they often fail to identify the functional form that explains the relationship between each variable and the system’s response. To begin to address this, we propose an explainable neural SR method that generates univariate symbolic skeletons that aim to explain how each variable influences the system’s response. By analyzing multiple sets of data generated artificially, where one input variable varies while others are fixed, relationships are modeled separately for each input variable. The response of such artificial data sets is estimated using a regression neural network (NN). Finally, the multiple sets of input–response pairs are processed by a pre-trained Multi-Set Transformer that solves a problem we termed Multi-Set Skeleton Prediction and outputs a univariate symbolic skeleton. Thus, such skeletons represent explanations of the function approximated by the regression NN. Experimental results demonstrate that this method learns skeleton expressions matching the underlying functions and outperforms two GP-based and two neural SR methods.
  • Thumbnail Image
    Item
    Advancing Retail Predictions: Integrating Diverse Machine Learning Models for Accurate Walmart Sales Forecasting
    (Sciencedomain International, 2024-06) C., Cyril Neba; F., Gerard Shu; Nsuh, Gillian; A., Philip Amouda; F.. Adrian Neba; Webnda, F.; Ikpe, Victory; Orelaja, Adeyinka; Sylla, Nabintou Anissia
    In the rapidly evolving landscape of retail analytics, the accurate prediction of sales figures holds paramount importance for informed decision-making and operational optimization. Leveraging diverse machine learning methodologies, this study aims to enhance the precision of Walmart sales forecasting, utilizing a comprehensive dataset sourced from Kaggle. Exploratory data analysis reveals intricate patterns and temporal dependencies within the data, prompting the adoption of advanced predictive modeling techniques. Through the implementation of linear regression, ensemble methods such as Random Forest, Gradient Boosting Machines (GBM), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), this research endeavors to identify the most effective approach for predicting Walmart sales. Comparative analysis of model performance showcases the superiority of advanced machine learning algorithms over traditional linear models. The results indicate that XGBoost emerges as the optimal predictor for sales forecasting, boasting the lowest Mean Absolute Error (MAE) of 1226.471, Root Mean Squared Error (RMSE) of 1700.981, and an exceptionally high R-squared value of 0.9999900, indicating near-perfect predictive accuracy. This model's performance significantly surpasses that of simpler models such as linear regression, which yielded an MAE of 35632.510 and an RMSE of 80153.858. Insights from bias and fairness measurements underscore the effectiveness of advanced models in mitigating bias and delivering equitable predictions across temporal segments. Our analysis revealed varying levels of bias across different models. Linear Regression, Multiple Regression, and GLM exhibited moderate bias, suggesting some systematic errors in predictions. Decision Tree showed slightly higher bias, while Random Forest demonstrated a unique scenario of negative bias, implying systematic underestimation of predictions. However, models like GBM, XGBoost, and LGB displayed biases closer to zero, indicating more accurate predictions with minimal systematic errors. Notably, the XGBoost model demonstrated the lowest bias, with an MAE of -7.548432 (Table 4), reflecting its superior ability to minimize prediction errors across different conditions. Additionally, fairness analysis revealed that XGBoost maintained robust performance in both holiday and non-holiday periods, with an MAE of 84273.385 for holidays and 1757.721 for non-holidays. Insights from the fairness measurements revealed that Linear Regression, Multiple Regression, and GLM showed consistent predictive performance across both subgroups. Meanwhile, Decision Tree performed similarly for holiday predictions but exhibited better accuracy for non-holiday sales, whereas, Random Forest, XGBoost, GBM, and LGB models displayed lower MAE values for the non-holiday subgroup, indicating potential fairness issues in predicting holiday sales. The study also highlights the importance of model selection and the impact of advanced machine learning techniques on achieving high predictive accuracy and fairness. Ensemble methods like Random Forest and GBM also showed strong performance, with Random Forest achieving an MAE of 12238.782 and an RMSE of 19814.965, and GBM achieving an MAE of 10839.822 and an RMSE of 1700.981. This research emphasizes the significance of leveraging sophisticated analytics tools to navigate the complexities of retail operations and drive strategic decision-making. By utilizing advanced machine learning models, retailers can achieve more accurate sales forecasts, ultimately leading to better inventory management and enhanced operational efficiency. The study reaffirms the transformative potential of data-driven approaches in driving business growth and innovation in the retail sector.
  • Thumbnail Image
    Item
    Width Helps and Hinders Splitting Flows
    (Association for Computing Machinery, 2024-01) Cáceres, Manuel; Cairo, Massimo; Grigorjew, Andreas; Khan, Shahbaz; Mumey, Brendan; Rizzi, Romeo; Tomescu, Alexandru I.; Williams, Lucia
    Minimum flow decomposition (MFD) is the NP-hard problem of finding a smallest decomposition of a network flow/circulation X on a directed graph G into weighted source-to-sink paths whose weighted sum equals X. We show that, for acyclic graphs, considering the width of the graph (the minimum number of paths needed to cover all of its edges) yields advances in our understanding of its approximability. For the version of the problem that uses only non-negative weights, we identify and characterise a new class of width-stable graphs, for which a popular heuristic is a O(log Val (X))-approximation (Val(X) being the total flow of X), and strengthen its worst-case approximation ratio from Ω(m−−√) to Ω (m/log m) for sparse graphs, where m is the number of edges in the graph. We also study a new problem on graphs with cycles, Minimum Cost Circulation Decomposition (MCCD), and show that it generalises MFD through a simple reduction. For the version allowing also negative weights, we give a (⌈ log ‖ X ‖ ⌉ +1)-approximation (‖ X ‖ being the maximum absolute value of X on any edge) using a power-of-two approach, combined with parity fixing arguments and a decomposition of unitary circulations (‖ X ‖ ≤ 1), using a generalised notion of width for this problem. Finally, we disprove a conjecture about the linear independence of minimum (non-negative) flow decompositions posed by Kloster et al. [2018], but show that its useful implication (polynomial-time assignments of weights to a given set of paths to decompose a flow) holds for the negative version.
  • Thumbnail Image
    Item
    Hyperspectral Band Selection for Multispectral Image Classification with Convolutional Networks
    (IEEE, 2021-07) Morales, Giorgio; Sheppard, John; Logan, Riley; Shaw, Joseph
    In recent years, Hyperspectral Imaging (HSI) has become a powerful source for reliable data in applications such as remote sensing, agriculture, and biomedicine. However, hyperspectral images are highly data-dense and often benefit from methods to reduce the number of spectral bands while retaining the most useful information for a specific application. We propose a novel band selection method to select a reduced set of wavelengths, obtained from an HSI system in the context of image classification. Our approach consists of two main steps: the first utilizes a filter-based approach to find relevant spectral bands based on a collinearity analysis between a band and its neighbors. This analysis helps to remove redundant bands and dramatically reduces the search space. The second step applies a wrapper-based approach to select bands from the reduced set based on their information entropy values, and trains a compact Convolutional Neural Network (CNN) to evaluate the performance of the current selection. We present classification results obtained from our method and compare them to other feature selection methods on two hyperspectral image datasets. Additionally, we use the original hyperspectral data cube to simulate the process of using actual filters in a multispectral imager. We show that our method produces more suitable results for a multispectral sensor design.
  • Thumbnail Image
    Item
    Counterfactual Explanations of Neural Network-Generated Response Curves
    (IEEE, 2023-06) Morales, Giorgio; Sheppard, John
    Response curves exhibit the magnitude of the response of a sensitive system to a varying stimulus. However, response of such systems may be sensitive to multiple stimuli (i.e., input features) that are not necessarily independent. As a consequence, the shape of response curves generated for a selected input feature (referred to as “active feature”) might depend on the values of the other input features (referred to as “passive features”). In this work we consider the case of systems whose response is approximated using regression neural networks. We propose to use counterfactual explanations (CFEs) for the identification of the features with the highest relevance on the shape of response curves generated by neural network black boxes. CFEs are generated by a genetic algorithm-based approach that solves a multi-objective optimization problem. In particular, given a response curve generated for an active feature, a CFE finds the minimum combination of passive features that need to be modified to alter the shape of the response curve. We tested our method on a synthetic dataset with 1-D inputs and two crop yield prediction datasets with 2-D inputs. The relevance ranking of features and feature combinations obtained on the synthetic dataset coincided with the analysis of the equation that was used to generate the problem. Results obtained on the yield prediction datasets revealed that the impact on fertilizer responsivity of passive features depends on the terrain characteristics of each field.
  • Thumbnail Image
    Item
    Metamorphic Testing For Machine Learning: Applicability, Challenges, and Research Opportunities
    (IEEE, 2023-07) Rehman, Faqeer Ur; Srinivasan, Madhusudan
    The wide adoption and growth of Machine Learning (ML) have made tremendous advancements in revolutionizing a number of fields i.e., manufacturing, transportation, bio-informatics, and self-driving cars. Its ability to extract patterns from a large set of data and then use this knowledge to make future predictions is beyond the human imagination. However, the complex calculations internally performed in them make these systems suffer from the oracle problem; thus, hard to test them for identifying bugs in them and enhancing their quality. An application not properly tested can have disastrous consequences in the production environment. Metamorphic Testing (MT) has been widely accepted by researchers to address the oracle problem in testing both supervised and unsupervised ML-based systems. However, MT has several limitations (when used for testing ML) that the existing literature lacks in capturing them in a centralized place. Applying MT to test ML-based critical systems without prior knowledge/understanding of those limitations can cost organizations a waste of time and resources. In this study, we highlight those limitations to help both the researchers and practitioners to be aware of them for better testing of ML applications. Our efforts result in making the following contributions in this paper, i) providing insights into various challenges faced in testing ML-based solutions, ii) highlighting a number of key challenges faced when applying MT to test ML applications, and iii) presenting the potential future research opportunities/directions for the research community to address them.
Copyright (c) 2002-2022, LYRASIS. All rights reserved.