Browsing by Author "Mumey, Brendan"
Now showing 1 - 7 of 7
- Results Per Page
- Sort Options
Item Efficient Minimum Flow Decomposition via Integer Linear Programming(Mary Ann Liebert Inc, 2022-11) Dias, Fernando H.C.; Williams, Lucia; Mumey, Brendan; Tomescu, Alexandru I.Minimum flow decomposition (MFD) is an NP-hard problem asking to decompose a network flow into a minimum set of paths (together with associated weights). Variants of it are powerful models in multiassembly problems in Bioinformatics, such as RNA assembly. Owing to its hardness, practical multiassembly tools either use heuristics or solve simpler, polynomial time-solvable versions of the problem, which may yield solutions that are not minimal or do not perfectly decompose the flow. Here, we provide the first fast and exact solver for MFD on acyclic flow networks, based on Integer Linear Programming (ILP). Key to our approach is an encoding of all the exponentially many solution paths using only a quadratic number of variables. We also extend our ILP formulation to many practical variants, such as incorporating longer or paired-end reads, or minimizing flow errors. On both simulated and real-flow splicing graphs, our approach solves any instance in <13 seconds. We hope that our formulations can lie at the core of future practical RNA assembly tools. Our implementations are freely available on Github.Item Flow Decomposition with Subpath Constraints(Institute of Electrical and Electronics Engineers, 2022-01) Williams, Lucia; Tomescu, Alexandru I. loan; Mumey, BrendanFlow network decomposition is a natural model for problems where we are given a flow network arising from superimposing a set of weighted paths and would like to recover the underlying data, i.e.,decompose the flow into the original paths and their weights. Thus, variations on flow decomposition are often used as subroutines in multiassembly problems such as RNA transcript assembly. In practice, we frequently have access to information beyond flow values in the form of subpaths, and many tools incorporate these heuristically. But despite acknowledging their utility in practice, previous work has not formally addressed the effect of subpath constraints on the accuracy of flow network decomposition approaches. We formalize the flow decomposition with subpath constraints problem, give the first algorithms for it, and study its usefulness for recovering ground truth decompositions. For finding a minimum decomposition, we propose both a heuristic and an FPT algorithm. Experiments on RNA transcript datasets show that for instances with larger solution path sets, the addition of subpath constraints finds 13% more ground truth solutions when minimal decompositions are found exactly, and 30% more ground truth solutions when minimal decompositions are found heuristically.Item Genetic dissection of natural variation in oilseed traits of camelina by whole-genome resequencing and QTL mapping(Wiley, 2021-06) Li, Huang; Hu, Xiao; Lovell, John T.; Grabowski, Paul P.; Mamidi, Sujan; Chen, Cindy; Amirebrahimi, Mojgan; Kahanda, Indika; Mumey, Brendan; Barry, Kerrie; Kudrna, David; Schmutz, Jeremy; Lachowiec, Jennifer; Lu, ChaofuCamelina [Camelina sativa (L.) Crantz] is an oilseed crop in the Brassicaceae family that is currently being developed as a source of bioenergy and healthy fatty acids. To facilitate modern breeding efforts through marker-assisted selection and biotechnology, we evaluated genetic variation among a worldwide collection of 222 camelina accessions. We performed whole-genome resequencing to obtain single nucleotide polymorphism (SNP) markers and to analyze genomic diversity. We also conducted phenotypic field evaluations in two consecutive seasons for variations in key agronomic traits related to oilseed production such as seed size, oil content (OC), fatty acid composition, and flowering time. We determined the population structure of the camelina accessions using 161,301 SNPs. Further, we identified quantitative trait loci (QTL) and candidate genes controlling the above field-evaluated traits by genome-wide association studies (GWAS) complemented with linkage mapping using a recombinant inbred line (RIL) population. Characterization of the natural variation at the genome and phenotypic levels provides valuable resources to camelina genetic studies and crop improvement. The QTL and candidate genes should assist in breeding of advanced camelina varieties that can be integrated into the cropping systems for the production of high yield of oils of desired fatty acid composition.Item Hijacking a rapid and scalable metagenomic method reveals subgenome dynamics and evolution in polyploid plants(Wiley, 2024-04) Reynolds, Gillian; Mumey, Brendan; Strnadova-Neeley, Veronika; Lachowiec, JenniferPremise. The genomes of polyploid plants archive the evolutionary events leading to their present forms. However, plant polyploid genomes present numerous hurdles to the genome comparison algorithms for classification of polyploid types and exploring genome dynamics. Methods. Here, the problem of intra- and inter-genome comparison for examining polyploid genomes is reframed as a metagenomic problem, enabling the use of the rapid and scalable MinHashing approach. To determine how types of polyploidy are described by this metagenomic approach, plant genomes were examined from across the polyploid spectrum for both k-mer composition and frequency with a range of k-mer sizes. In this approach, no subgenome-specific k-mers are identified; rather, whole-chromosome k-mer subspaces were utilized. Results. Given chromosome-scale genome assemblies with sufficient subgenome-specific repetitive element content, literature-verified subgenomic and genomic evolutionary relationships were revealed, including distinguishing auto- from allopolyploidy and putative progenitor genome assignment. The sequences responsible were the rapidly evolving landscape of transposable elements. An investigation into the MinHashing parameters revealed that the downsampled k-mer space (genomic signatures) produced excellent approximations of sequence similarity. Furthermore, the clustering approach used for comparison of the genomic signatures is scrutinized to ensure applicability of the metagenomics-based method. Discussion. The easily implementable and highly computationally efficient MinHashing-based sequence comparison strategy enables comparative subgenomics and genomics for large and complex polyploid plant genomes. Such comparisons provide evidence for polyploidy-type subgenomic assignments. In cases where subgenome-specific repeat signal may not be adequate given a chromosomes' global k-mer profile, alternative methods that are more specific but more computationally complex outperform this approach.Item Maximal Perfect Haplotype Blocks with Wildcards(2020-05) Williams, Lucia; Mumey, BrendanRecent work provides the first method to measure the relative fitness of genomic variants within a population that scales to large numbers of genomes. A key component of the computation involves finding maximal perfect haplotype blocks from a set of genomic samples for which SNPs (single-nucleotide polymorphisms) have been called. Often, owing to low read coverage and imperfect assemblies, some of the SNP calls can be missing from some of the samples. In this work, we consider the problem of finding maximal perfect haplotype blocks where some missing values may be present. Missing values are treated as wildcards, and the definition of maximal perfect haplotype blocks is extended in a natural way. We provide an output-linear time algorithm to identify all such blocks and demonstrate the algorithm on a large population SNP dataset. Our software is publicly available.Item Safety in multi-assembly via paths appearing in all path covers of a DAG(Institute of Electrical and Electronics Engineers, 2021-01) Caceres, Manuel; Mumey, Brendan; Husic, Edin; Rizzi, Romeo; Cairo, Massimo; Sahlin, Kristoffer; Tomescu, Alexandru I. IoanA multi-assembly problem asks to reconstruct multiple genomic sequences from mixed reads sequenced from all of them. Standard formulations of such problems model a solution as a path cover in a directed acyclic graph, namely a set of paths that together cover all vertices of the graph. Since multi-assembly problems admit multiple solutions in practice, we consider an approach commonly used in standard genome assembly: output only partial solutions (contigs, or safe paths), that appear in all path cover solutions. We study constrained path covers, a restriction on the path cover solution that incorporate practical constraints arising in multi-assembly problems. We give efficient algorithms finding all maximal safe paths for constrained path covers. We compute the safe paths of splicing graphs constructed from transcript annotations of different species. Our algorithms run in less than 15 seconds per species and report RNA contigs that are over 99% precise and are up to 8 times longer than unitigs. Moreover, RNA contigs cover over 70% of the transcripts and their coding sequences in most cases. With their increased length to unitigs, high precision, and fast construction time, maximal safe paths can provide a better base set of sequences for transcript assembly programs.Item Width Helps and Hinders Splitting Flows(Association for Computing Machinery, 2024-01) Cáceres, Manuel; Cairo, Massimo; Grigorjew, Andreas; Khan, Shahbaz; Mumey, Brendan; Rizzi, Romeo; Tomescu, Alexandru I.; Williams, LuciaMinimum flow decomposition (MFD) is the NP-hard problem of finding a smallest decomposition of a network flow/circulation X on a directed graph G into weighted source-to-sink paths whose weighted sum equals X. We show that, for acyclic graphs, considering the width of the graph (the minimum number of paths needed to cover all of its edges) yields advances in our understanding of its approximability. For the version of the problem that uses only non-negative weights, we identify and characterise a new class of width-stable graphs, for which a popular heuristic is a O(log Val (X))-approximation (Val(X) being the total flow of X), and strengthen its worst-case approximation ratio from Ω(m−−√) to Ω (m/log m) for sparse graphs, where m is the number of edges in the graph. We also study a new problem on graphs with cycles, Minimum Cost Circulation Decomposition (MCCD), and show that it generalises MFD through a simple reduction. For the version allowing also negative weights, we give a (⌈ log ‖ X ‖ ⌉ +1)-approximation (‖ X ‖ being the maximum absolute value of X on any edge) using a power-of-two approach, combined with parity fixing arguments and a decomposition of unitary circulations (‖ X ‖ ≤ 1), using a generalised notion of width for this problem. Finally, we disprove a conjecture about the linear independence of minimum (non-negative) flow decompositions posed by Kloster et al. [2018], but show that its useful implication (polynomial-time assignments of weights to a given set of paths to decompose a flow) holds for the negative version.