Flow decomposition algorithms for multiassembly problems
Williams, Lucia Gean
MetadataShow full item record
Current genetic sequencing technologies allow for fast and cheap measurement of short substrings of genetic sequence called reads which must be assembled to recover the full unknown sequence. In some cases, such as when assembling RNA transcripts or the genomes of a mixture of species taken in a single sample, the reads come from multiple sequences. In this case, we would like to recover all of the distinct unknown sequences and their relative abundances, a task which we call multiassembly. A common model underlying many multiassembly approaches is flow decomposition, which decomposes a flow network into a set of paths and weights that parsimoniously explains the flow. In this dissertation, we formalize two new variations on flow decomposition to better model the information available when performing multiassembly from reads. The first, inexact flow decomposition, allows for some uncertainty in the flow measurements. The second, flow decomposition with subpath constraints, incorporates additional information that may be provided by longer reads. We give algorithms to solve these problems and demonstrated their usefulness for RNA assembly on a simulated dataset. Additionally, we give the first polynomial-size integer linear programming (ILP) formulation for minimum flow decomposition and show that it can be adapted to encode both of the variants mentioned above. An implementation of the ILP using the ILP solver CPLEX runs faster than existing exact MFD solvers on RNA sequencing datasets.