Publications by Colleges and Departments (MSU - Bozeman)
Permanent URI for this communityhttps://scholarworks.montana.edu/handle/1/3
Browse
8 results
Search Results
Item MetaCDP: Metamorphic Testing for Quality Assurance of Containerized Data Pipelines(IEEE, 2024-06) ur Rehman, Faqeer; Umbreen, Sidrah; Rehman, MudasserIn the ever-evolving world of technology, companies are investing heavily in building and deploying state-of-the-art Machine Learning (ML) based systems. However, such systems heavily rely on the availability of high-quality data, which is often prepared/generated by the Extract Transform Load (ETL) data pipelines; thus, they are critical components of an end-to-end ML system. A low-performing model (trained on buggy data) running in a production environment can cause both financial and reputational losses for the organization. Therefore, it is of paramount significance to perform the quality assurance of underlying data pipelines from multiple perspectives. However, the computational complexity, continuous change in data, and the integration of multiple components make it challenging to test them effectively, ultimately causing such solutions to suffer from the Oracle problem. In this research paper, we propose MetaCDP, a Metamorphic Testing approach that can be used by both researchers and practitioners for quality assurance of modern Containerized Data Pipelines. We propose 10 Metamorphic Relations (MRs) that target the robustness and correctness of the data pipeline under test, which plays a crucial role in providing high-quality data for developing a clustering-based anomaly detection model. To show the applicability of the proposed approach, we tested a data pipeline (from the E-commerce domain) and uncovered several erroneous behaviors. We also present the nature of issues identified by the proposed MRs, which can better help/guide software engineers and researchers to use best coding practices for maintaining and improving the quality of their data pipelines.Item Green Infrastructure Microbial Community Response to Simulated Pulse Precipitation Events in the Semi-Arid Western United States(MDPI AG, 2024-07) Hastings, Yvette D.; Smith, Rose M.; Mann, Kyra A.; Brewer, Simon; Goel, Ramesh; Jack Hinners, Sarah; Follstad Shah, JenniferProcesses driving nutrient retention in stormwater green infrastructure (SGI) are not well quantified in water-limited biomes. We examined the role of plant diversity and physiochemistry as drivers of microbial community physiology and soil N dynamics post precipitation pulses in a semi-arid region experiencing drought. We conducted our study in bioswales receiving experimental water additions and a montane meadow intercepting natural rainfall. Pulses of water generally elevated soil moisture and pH, stimulated ecoenzyme activity (EEA), and increased the concentration of organic matter, proteins, and N pools in both bioswale and meadow soils. Microbial community growth was static, and N assimilation into biomass was limited across pulse events. Unvegetated plots had greater soil moisture than vegetated plots at the bioswale site, yet we detected no clear effect of plant diversity on microbial C:N ratios, EEAs, organic matter content, and N pools. Differences in soil N concentrations in bioswales and the meadow were most directly correlated to changes in organic matter content mediated by ecoenzyme expression and the balance of C, N, and P resources available to microbial communities. Our results add to growing evidence that SGI ecological function is largely comparable to neighboring natural vegetated systems, particularly when soil media and water availability are similar.Item The longest letter-duplicated subsequence and related problems(Springer Science and Business Media LLC, 2024-07) Lai, Wenfeng; Liyanage, Adiesha; Zhu, Binhai; Zou, PengMotivated by computing duplication patterns in sequences, a new problem called the longest letter-duplicated subsequence (LLDS) is proposed. Given a sequence S of length n, a letter- duplicated subsequence is a subsequence of S in the form of x d1 1 x d2 2 . . . x d k k with x i ∈ , x j = x j+1 and di ≥ 2 for all i in [k] and j in [k − 1]. A linear time algorithm for computing a longest letter-duplicated subsequence (LLDS) of S can be easily obtained. In this paper, we focus on two variants of this problem: (1) ‘all-appearance’ version, i.e., all letters in must appear in the solution, and (2) the weighted version. For the former, we obtain dichotomous results: We prove that, when each letter appears in S at least 4 times, the problem and a relaxed version on feasibility testing (FT) are both NP-hard. The reduction is from (3+, 1, 2−)- SAT, where all 3-clauses (i.e., containing 3 lals) are monotone (i.e., containing only positive literals) and all 2-clauses contain only negative literals. We then show that when each letter appears in S at most 3 times, then the problem admits an O(n) time algorithm. Finally, we consider the weighted version, where the weight of a block x di i (di ≥ 2) could be any positive function which might not grow with di . We give a non-trivial O(n2) time dynamic programming algorithm for this version, i.e., computing an LD-subsequence of S whose weight is maximized.Item A Comprehensive Study of Walmart Sales Predictions Using Time Series Analysis(Sciencedomain International, 2024-06) C., Cyril Neba; F., Gerard Shu; Nsuh, Gillian; A., Philip Amouda; F., Adrian Neba; Webnda, F.; Ikpe, Victory; Orelaja, Adeyinka; Sylla, Nabintou AnissiaThis article presents a comprehensive study of sales predictions using time series analysis, focusing on a case study of Walmart sales data. The aim of this study is to evaluate the effectiveness of various time series forecasting techniques in predicting weekly sales data for Walmart stores. Leveraging a dataset from Kaggle comprising weekly sales data from various Walmart stores around the United States, this study explores the effectiveness of time series analysis in forecasting future sales trends. Various time series analysis techniques, including Auto Regressive Integrated Moving Average (ARIMA), Seasonal Auto Regressive Integrated Moving Average (SARIMA), Prophet, Exponential Smoothing, and Gaussian Processes, are applied to model and forecast Walmart sales data. By comparing the performance of these models, the study seeks to identify the most accurate and reliable methods for forecasting retail sales, thereby providing valuable insights for improving sales predictions in the retail sector. The study includes an extensive exploratory data analysis (EDA) phase to preprocess the data, detect outliers, and visualize sales trends over time. Additionally, the article discusses the partitioning of data into training and testing sets for model evaluation. Performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are utilized to compare the accuracy of different time series models. The results indicate that Gaussian Processes outperform other models in terms of accuracy, with an RMSE of 34,116.09 and an MAE of 25,495.72, significantly lower than the other models evaluated. For comparison, ARIMA and SARIMA models both yielded an RMSE of 555,502.2 and an MAE of 462,767.3, while the Prophet model showed an RMSE of 567,509.2 and an MAE of 474,990.8. Exponential Smoothing also performed well with an RMSE of 555,081.7 and an MAE of 464,110.5. These findings suggest the potential of Gaussian Processes for accurate sales forecasting. However, the study also highlights the strengths and weaknesses of each forecasting methodology, emphasizing the need for further research to refine existing techniques and explore novel modeling approaches. Overall, this study contributes to the understanding of time series analysis in retail sales forecasting and provides insights for improving future forecasting endeavors.Item Persistent and lagged effects of fire on stream solutes linked to intermittent precipitation in arid lands(Springer Science and Business Media LLC, 2024) Lowman, Heili; Blaszczak, Joanna; Cale, Ashely; Dong, Xiaoli; Earl, Stevan; Grabow, Julia; Grimm, Nancy B.; Harms, Tamara K.; Reinhold, Ann Marie; Summers, Betsy; Webster, Alex J.Increased occurrence, size, and intensity of fire result in significant but variable changes to hydrology and material retention in watersheds with concomitant effects on stream biogeochemistry. In arid regions, seasonal and episodic precipitation results in intermittency in flows connecting watersheds to recipient streams that can delay the effects of fire on stream chemistry. We investigated how the spatial extent of fire within watersheds interacts with variability in amount and timing of precipitation to influence stream chemistry of three forested, montane watersheds in a monsoonal climate and four coastal, chaparral watersheds in a Mediterranean climate. We applied state-space models to estimate effects of precipitation, fire, and their interaction on stream chemistry up to five years following fire using 15 + years of monthly observations. Precipitation alone diluted specific conductance and flushed nitrate and phosphate to Mediterranean streams. Fire had positive and negative effects on specific conductance in both climates, whereas ammonium and nitrate concentrations increased following fire in Mediterranean streams. Fire and precipitation had positive interactive effects on specific conductance in monsoonal streams and on ammonium in Mediterranean streams. In most cases, the effects of fire and its interaction with precipitation persisted or were lagged 2–5 years. These results suggest that precipitation influences the timing and intensity of the effects of fire on stream solute dynamics in aridland watersheds, but these responses vary by climate, solute, and watershed characteristics. Time series models were applied to data from long-term monitoring that included observations before and after fire, yielding estimated effects of fire on aridland stream chemistry. This statistical approach captured effects of local-scale temporal variation, including delayed responses to fire, and may be used to reduce uncertainty in predicted responses of water quality under changing fire and precipitation regimes of arid lands.Item Univariate Skeleton Prediction in Multivariate Systems Using Transformers(Springer Nature, 2024-08) Morales, Giorgio; Sheppard, John W.Symbolic regression (SR) methods attempt to learn mathematical expressions that approximate the behavior of an observed system. However, when dealing with multivariate systems, they often fail to identify the functional form that explains the relationship between each variable and the system’s response. To begin to address this, we propose an explainable neural SR method that generates univariate symbolic skeletons that aim to explain how each variable influences the system’s response. By analyzing multiple sets of data generated artificially, where one input variable varies while others are fixed, relationships are modeled separately for each input variable. The response of such artificial data sets is estimated using a regression neural network (NN). Finally, the multiple sets of input–response pairs are processed by a pre-trained Multi-Set Transformer that solves a problem we termed Multi-Set Skeleton Prediction and outputs a univariate symbolic skeleton. Thus, such skeletons represent explanations of the function approximated by the regression NN. Experimental results demonstrate that this method learns skeleton expressions matching the underlying functions and outperforms two GP-based and two neural SR methods.Item Advancing Retail Predictions: Integrating Diverse Machine Learning Models for Accurate Walmart Sales Forecasting(Sciencedomain International, 2024-06) C., Cyril Neba; F., Gerard Shu; Nsuh, Gillian; A., Philip Amouda; F.. Adrian Neba; Webnda, F.; Ikpe, Victory; Orelaja, Adeyinka; Sylla, Nabintou AnissiaIn the rapidly evolving landscape of retail analytics, the accurate prediction of sales figures holds paramount importance for informed decision-making and operational optimization. Leveraging diverse machine learning methodologies, this study aims to enhance the precision of Walmart sales forecasting, utilizing a comprehensive dataset sourced from Kaggle. Exploratory data analysis reveals intricate patterns and temporal dependencies within the data, prompting the adoption of advanced predictive modeling techniques. Through the implementation of linear regression, ensemble methods such as Random Forest, Gradient Boosting Machines (GBM), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), this research endeavors to identify the most effective approach for predicting Walmart sales. Comparative analysis of model performance showcases the superiority of advanced machine learning algorithms over traditional linear models. The results indicate that XGBoost emerges as the optimal predictor for sales forecasting, boasting the lowest Mean Absolute Error (MAE) of 1226.471, Root Mean Squared Error (RMSE) of 1700.981, and an exceptionally high R-squared value of 0.9999900, indicating near-perfect predictive accuracy. This model's performance significantly surpasses that of simpler models such as linear regression, which yielded an MAE of 35632.510 and an RMSE of 80153.858. Insights from bias and fairness measurements underscore the effectiveness of advanced models in mitigating bias and delivering equitable predictions across temporal segments. Our analysis revealed varying levels of bias across different models. Linear Regression, Multiple Regression, and GLM exhibited moderate bias, suggesting some systematic errors in predictions. Decision Tree showed slightly higher bias, while Random Forest demonstrated a unique scenario of negative bias, implying systematic underestimation of predictions. However, models like GBM, XGBoost, and LGB displayed biases closer to zero, indicating more accurate predictions with minimal systematic errors. Notably, the XGBoost model demonstrated the lowest bias, with an MAE of -7.548432 (Table 4), reflecting its superior ability to minimize prediction errors across different conditions. Additionally, fairness analysis revealed that XGBoost maintained robust performance in both holiday and non-holiday periods, with an MAE of 84273.385 for holidays and 1757.721 for non-holidays. Insights from the fairness measurements revealed that Linear Regression, Multiple Regression, and GLM showed consistent predictive performance across both subgroups. Meanwhile, Decision Tree performed similarly for holiday predictions but exhibited better accuracy for non-holiday sales, whereas, Random Forest, XGBoost, GBM, and LGB models displayed lower MAE values for the non-holiday subgroup, indicating potential fairness issues in predicting holiday sales. The study also highlights the importance of model selection and the impact of advanced machine learning techniques on achieving high predictive accuracy and fairness. Ensemble methods like Random Forest and GBM also showed strong performance, with Random Forest achieving an MAE of 12238.782 and an RMSE of 19814.965, and GBM achieving an MAE of 10839.822 and an RMSE of 1700.981. This research emphasizes the significance of leveraging sophisticated analytics tools to navigate the complexities of retail operations and drive strategic decision-making. By utilizing advanced machine learning models, retailers can achieve more accurate sales forecasts, ultimately leading to better inventory management and enhanced operational efficiency. The study reaffirms the transformative potential of data-driven approaches in driving business growth and innovation in the retail sector.Item Sampling Bounds for Topological Descriptors(Undergraduate Scholars Program, 2024-04) Fasy, Brittany; Millman, David; Micka, Samuel; Padula, Luke; Makarchuk, MaksymIncreasingly, topological descriptors like the Euler characteristic curve and persistence diagrams are utilized to represent complex data. Recent studies suggest that a meticulously selected set of these descriptors can encode geometric and topological information about shapes in d-dimensional space. In practical applications, epsilon-nets are employed to sample data, presenting two extremes: oversampling, where epsilon is small enough to ensure a comprehensive representation but may lead to computational inefficiencies, and undersampling, where epsilon lacks a grounded rationale, offering faster computations but risking an incomplete shape description without theoretical guarantees. This research investigates phenomena of oversampling and undersampling, delving into their prevalence across synthetic and real-world datasets. It experimentally verifies excessive oversampling in theory-guided approaches and examines the implications of undersampling, shedding light on the behavior and consequences of both extremes. We establish lower bounds on the number of descriptors required for exact encodings and explore the trade-offs associated with undersampling, contributing insights into the potential information loss and the resulting impact on the overall shape representation.