Computer Science

Permanent URI for this communityhttps://scholarworks.montana.edu/handle/1/31

The Computer Science Department at Montana State University supports the Mission of the College of Engineering and the University through its teaching, research, and service activities. The Department educates undergraduate and graduate students in the principles and practices of computer science, preparing them for computing careers and for a lifetime of learning.

Browse

Search Results

Now showing 1 - 5 of 5
  • Thumbnail Image
    Item
    Improving RNA Assembly via Safety and Completeness in Flow Decompositions
    (Mary Ann Liebert Inc, 2022-12) Khan, Shahbaz; Kortelainen, Milla; Cáceres, Manuel; Williams, Lucia; Tomescu, Alexandru I.
    Decomposing a network flow into weighted paths is a problem with numerous applications, ranging from networking, transportation planning, to bioinformatics. In some applications we look for a decomposition that is optimal with respect to some property, such as the number of paths used, robustness to edge deletion, or length of the longest path. However, in many bioinformatic applications, we seek a specific decomposition where the paths correspond to some underlying data that generated the flow. In these cases, no optimization criteria guarantee the identification of the correct decomposition. Therefore, we propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition. In this work, we give the first local characterization of safe paths for flow decompositions in directed acyclic graphs, leading to a practical algorithm for finding the complete set of safe paths. In addition, we evaluate our algorithm on RNA transcript data sets against a trivial safe algorithm (extended unitigs), the recently proposed safe paths for path covers (TCBB 2021) and the popular heuristic greedy-width. On the one hand, we found that besides maintaining perfect precision, our safe and complete algorithm reports a significantly higher coverage ( = 50% more) compared with the other safe algorithms. On the other hand, the greedy-width algorithm although reporting a better coverage, it also reports a significantly lower precision on complex graphs (for genes expressing a large number of transcripts). Overall, our safe and complete algorithm outperforms (by = 20%) greedy-width on a unified metric (F-score) considering both coverage and precision when the evaluated data set has a significant number of complex graphs. Moreover, it also has a superior time (4 - 5x) and space performance (1.2 - 2.2x), resulting in a better and more practical approach for bioinformatic applications of flow decomposition.
  • Thumbnail Image
    Item
    Efficient Minimum Flow Decomposition via Integer Linear Programming
    (Mary Ann Liebert Inc, 2022-11) Dias, Fernando H.C.; Williams, Lucia; Mumey, Brendan; Tomescu, Alexandru I.
    Minimum flow decomposition (MFD) is an NP-hard problem asking to decompose a network flow into a minimum set of paths (together with associated weights). Variants of it are powerful models in multiassembly problems in Bioinformatics, such as RNA assembly. Owing to its hardness, practical multiassembly tools either use heuristics or solve simpler, polynomial time-solvable versions of the problem, which may yield solutions that are not minimal or do not perfectly decompose the flow. Here, we provide the first fast and exact solver for MFD on acyclic flow networks, based on Integer Linear Programming (ILP). Key to our approach is an encoding of all the exponentially many solution paths using only a quadratic number of variables. We also extend our ILP formulation to many practical variants, such as incorporating longer or paired-end reads, or minimizing flow errors. On both simulated and real-flow splicing graphs, our approach solves any instance in <13 seconds. We hope that our formulations can lie at the core of future practical RNA assembly tools. Our implementations are freely available on Github.
  • Thumbnail Image
    Item
    Flow Decomposition with Subpath Constraints
    (Institute of Electrical and Electronics Engineers, 2022-01) Williams, Lucia; Tomescu, Alexandru I. loan; Mumey, Brendan
    Flow network decomposition is a natural model for problems where we are given a flow network arising from superimposing a set of weighted paths and would like to recover the underlying data, i.e.,decompose the flow into the original paths and their weights. Thus, variations on flow decomposition are often used as subroutines in multiassembly problems such as RNA transcript assembly. In practice, we frequently have access to information beyond flow values in the form of subpaths, and many tools incorporate these heuristically. But despite acknowledging their utility in practice, previous work has not formally addressed the effect of subpath constraints on the accuracy of flow network decomposition approaches. We formalize the flow decomposition with subpath constraints problem, give the first algorithms for it, and study its usefulness for recovering ground truth decompositions. For finding a minimum decomposition, we propose both a heuristic and an FPT algorithm. Experiments on RNA transcript datasets show that for instances with larger solution path sets, the addition of subpath constraints finds 13% more ground truth solutions when minimal decompositions are found exactly, and 30% more ground truth solutions when minimal decompositions are found heuristically.
  • Thumbnail Image
    Item
    Maximal Perfect Haplotype Blocks with Wildcards
    (2020-05) Williams, Lucia; Mumey, Brendan
    Recent work provides the first method to measure the relative fitness of genomic variants within a population that scales to large numbers of genomes. A key component of the computation involves finding maximal perfect haplotype blocks from a set of genomic samples for which SNPs (single-nucleotide polymorphisms) have been called. Often, owing to low read coverage and imperfect assemblies, some of the SNP calls can be missing from some of the samples. In this work, we consider the problem of finding maximal perfect haplotype blocks where some missing values may be present. Missing values are treated as wildcards, and the definition of maximal perfect haplotype blocks is extended in a natural way. We provide an output-linear time algorithm to identify all such blocks and demonstrate the algorithm on a large population SNP dataset. Our software is publicly available.
  • Thumbnail Image
    Item
    Reconstructing embedded graphs from persistence diagrams
    (2020-10) Belton, Robin Lynne; Fasy, Brittany T.; Mertz, Rostik; Micka, Samuel; Millman, David L.; Salinas, Daniel; Schenfisch, Anna; Schupbach, Jordan; Williams, Lucia
    The persistence diagram (PD) is an increasingly popular topological descriptor. By encoding the size and prominence of topological features at varying scales, the PD provides important geometric and topological information about a space. Recent work has shown that well-chosen (finite) sets of PDs can differentiate between geometric simplicial complexes, providing a method for representing complex shapes using a finite set of descriptors. A related inverse problem is the following: given a set of PDs (or an oracle we can query for persistence diagrams), what is underlying geometric simplicial complex? In this paper, we present an algorithm for reconstructing embedded graphs in Rd (plane graphs in R2) with n vertices from n2 −n+d+1 directional (augmented) PDs. Additionally, we empirically validate the correctness and time-complexity of our algorithm in R2 on randomly generated plane graphs using our implementation, and explain the numerical limitations of implementing our algorithm.
Copyright (c) 2002-2022, LYRASIS. All rights reserved.