This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2022.3147697, IEEE/ACM Transactions on Computational Biology and Bioinformatics 1 Flow Decomposition with Subpath Constraints Lucia Williams, Alexandru I. Tomescu, Brendan Mumey F Abstract—Flow network decomposition is a natural model for problems initial paths is available in the form of subpath constraints. where we are given a flow network arising from superimposing a set These are subpaths in the graph that must be followed of weighted paths and would like to recover the underlying data, i.e., by at least one path in the flow decomposition; thus, we decompose the flow into the original paths and their weights. Thus, are looking for flow decompositions with the property that variations on flow decomposition are often used as subroutines in mul- every constraint is a subpath of some decomposition path. tiassembly problems such as RNA transcript assembly. In practice, we frequently have access to information beyond flow values in the form We call the resulting problem flow decomposition with subpath of subpaths, and many tools incorporate these heuristically. But despite constraints (FDSC). acknowledging their utility in practice, previous work has not formally In a version of this work presented at WABI 2021, we addressed the effect of subpath constraints on the accuracy of flow left open the question of whether FDSC (not necessarily network decomposition approaches. We formalize the flow decompo- minimal) can be solved in polynomial time. In this updated sition with subpath constraints problem, give the first algorithms for it, work, we give a polynomial time algorithm for FDSC, and and study its usefulness for recovering ground truth decompositions. For present additional discussion on the hardness of two FDSC finding a minimum decomposition, we propose both a heuristic and an variants. FPT algorithm. Experiments on RNA transcript datasets show that for in- stances with larger solution path sets, the addition of subpath constraints finds 13% more ground truth solutions when minimal decompositions 1.1 Biological setting are found exactly, and 30% more ground truth solutions when minimal decompositions are found heuristically. Algorithms that solve variations of the flow decomposition problem are at the heart of most RNA transcript assembly Index Terms—Flow decomposition, subpath constraints, RNA-Seq. software, including IsoLasso [1], Traph [2], FlipFlop [3], Scallop [4] and StringTie [5]. More recently, flow decom- position methods were used for another multi-assembly 1 INTRODUCTION problem, namely strain-aware genome assembly, with ap- Flow networks are useful models in many domains, from plications to viral quasispecies assembly [6], [7]. Briefly, transportation planning to computational biology. In some flow decomposition methods for sequence assembly work cases, the flow on a graph arises as the superposition of by using reads and their abundances to first construct a flow some set of weighted paths, such as trips through a road network whose vertices may represent exons (in the case of network, routing of information through a communication an RNA splice graph) or k-mers (in the case of a de Bruijn network, or paths in a graph encoding mixed reads se- graph). Edges in the network are present if there is read quenced from several biological sequences, as in the case evidence that some sequence followed the edge (e.g. two of RNA transcripts through a splice graph. exons are consecutive in some transcript). Furthermore, each In many such applications, we are actually presented edge is weighted by the number of reads that support it. with the inverse problem: given a flow in a graph, we With perfect data, we might expect the weights to directly need to recover the initial paths that made up the flow. provide a flow in the network; however in practice some This problem is also referred to as the flow decomposition adjustment to the weights may be needed to achieve a flow. (FD) problem. In computational biology, this is a common One such method uses a minimum-cost flow approach for subroutine in multiassembly problems, such as RNA tran- this adjustment [2]. Another approach [8] models the input script assembly or viral quasispecies assembly. Prioritizing as an inexact flow network in which edge flows belong to parsimonious solutions proved to be an accurate assembly intervals, that are estimated from the data. In all cases, method, but it can suffer when there are multiple parsimo- we seek a path decomposition for the flow network that nious solution to choose from. As such, in this paper we minimizes the number of paths. consider a natural generalization of the flow decomposition In Kloster et al. [9] it was shown that in the case of problem, by assuming that extra information about the RNA transcripts, most of the time the “true” transcripts also provide a minimum flow decomposition of the splice graph. However, there can often be more than one solution to the • Lucia Williams and Brendan Mumey are with the School of Computing, Montana State University, Bozeman, MT, USA. minimum flow decomposition problem; indeed, Kloster et E-mail: brendan.mumey@montana.edu al. found that, when the number of true transcripts is seven, • Alexandru Tomescu is with the Department of Computer Science, Univer- the minimum flow decomposition found corresponds to the sity of Helsinki, Finland. true paths in only 80% of the instances of that size, with This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2022.3147697, IEEE/ACM Transactions on Computational Biology and Bioinformatics 2 lower accuracies as the number of true paths increases. constraints correspond to restricted classes of flows in some In fact, practical methods for RNA assembly methods also DAG [26], [22], and thus the minimum FDSC problem is have a precision of 50%-60% on some human datasets [10], a strict generalization of the MPC problem with subpath [5]. Adding subpath constraints to the flow decomposition constraints. problem may further restrict the solution space, thus im- proving RNA assembly accuracy. 1.3 Contributions In practice, the subpath constraints can be derived from In this work, we initiate the formal study of the FDSC prob- reads overlapping three or more nodes of the flow graph. lem. This is a natural model for multiassembly problems, as Long RNA-Seq reads naturally have this property in many seen by the abundance of methods and tools that incorpo- cases; however, also short reads can exhibit this behavior in rate subpath information for improving RNA and viral qua- the case of short exons. As we review below, other possible sispecies assemblies. However, because finding a minimum sources of such constraints exist in practice as well, such as solution to the FDSC problem is NP-hard, these methods from partial assemblies, or super-reads [5] constructed from and tools have focused on either heuristic approaches or a short reads that can be uniquely extended. polynomial-time solvable particular version of the problem Finally, most of the RNA assembly tools cited above (MPC) that ignores valuable edge weight information. Here, work in a so-called genome-guided setting in which also a we make two advances that bring us closer to being able reference genome of the studied species is available. This to use the complete version of the problem in practical makes the splice graph acyclic (i.e. a DAG). While both tools. On the theoretical side, we formalize the problem and the original flow decomposition problem and our variant give the first algorithm to determine whether an instance with subpath constraints can be defined in flow networks is feasible (Theorem 18), and produce a solution if it is. with cycles (which would correspond to a de novo assembly The algorithm works via a reduction to the standard flow setting), in this paper we focus on DAGs only. decomposition problem where any solution must translate to a solution in the original graph that satisfies all of the subpath constraints. Based on this reduction, in Section 3.2 1.2 Related work we also propose a “bridge-reweighting” heuristic algorithm Finding a flow decomposition with the minimum number to solve the minimum FDSC problem. Additionally, we of weighted paths is a well-studied problem in computer give an FPT algorithm for the minimum FDSC problem science. Even when restricted to DAGs, the minimum FD (Theorem 20), extending the one of Kloster et al. [9]. Finally, problem is NP-hard [11], and thus various practical ap- in Section 4, to add to the complexity picture around the proaches to it exist: approximation algorithms [12], [13], [14], FDSC problem, we show that two application-oriented FD [15], [16], [17], FPT algorithms [9], greedy algorithms [11], problems related to FDSC are NP-hard in the strong sense, [18]. By taking the set of subpaths constraints to be empty even without requiring a solution with a minimum number (or to correspond to all edges of the graph with non-zero of paths. flow), it follows that also finding a solution to the FDSC We implement both algorithms for FDSC, and perform a problem with a minimum number of paths is NP-hard. proof-of-concept study of their usefulness in RNA assembly. The idea of improving RNA assembly by multi-edge We experiment on a dataset developed by Shao et al. [18] subpath information is in fact used by several flow-based to study their heuristic for the minimum FD problem. The tools, such as Scallop [4] and StringTie [5]. However, both same dataset was then used by Kloster et al. [9], who approaches integrate subpaths in a heuristic manner, with focused on studying the usefulness of standard minimum no overall formulation of the computational problem they flow decompositions in RNA assembly, as explained above. are solving. The same holds also for the viral quasispecies We find that our FDSC algorithms increase our ability to assembler [6]. Recently, the method TransBorrow [19] uses uncover the ground truth RNA transcripts, as more and partial assemblies from different RNA assembly tools, and longer subpath constraints are included in the input. This works by heuristically extending the subpaths they corre- holds both when minimality is enforced, through our FPT spond to in a splice graph. algorithm, and when it is only heuristically sought, through Moreover, our FDSC problem generalizes a related prob- the flow decomposition reduction and an associated path lem on DAGs. Recall that in the minimum path cover (MPC) reduction heuristic. For example, when there are seven problem, we are looking for a minimum-cardinality set of ground truth transcripts, we increase the accuracy by 13% path that together cover all nodes of a DAG (e.g. “explain” when an optimal solution is found (via FPT) and 30% when all exons of a splice graph). The problem is behind early a heuristic solution is found. RNA assembly methods such as Cufflinks [20], and early Our FDSC algorithm runs in polynomial time (i.e. al- virus quasispecies assembly methods such as ShoRAH [21]. ways finds a solution to the FDSC problem, not necessarily The MPC problem has been extended to include subpath minimum). Though we desire a minimum such path de- constraints as well [22], [23], [24], [25], by analogously composition, algorithms that guarantee such solutions in requiring that each constraint is a subpath of some solution general may be too slow to be used in practice. Despite a path. While these generalizations are polynomially-time lack of minimality guarantees in our heuristic FDSC algo- solvable, they (together with the initial MPC formulations) rithm, our experiments show that the addition of subpath are usually unsatisfactory since they ignore the weights of constraints yields solutions that approach the accuracy of a the graph (i.e. the abundances of the reads)—recall that minimum decomposition without subpath constraints; thus, most state-of-the-art RNA assembly methods cited above our results show that heuristic FDSC is a practical substitute are flow-based. Moreover, MPCs and MPCs with subpath for minimum FD without subpath constraints. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2022.3147697, IEEE/ACM Transactions on Computational Biology and Bioinformatics 3 x a x x d x 3 FDSC ALGORITHMS s c t 3.1 FDSC reduces to FD 1 b 1 1 e 1 We now describe a reduction from the FDSC problem to the FD problem. The idea is to convert subpath constraints into Fig. 1. An example FDSC flow network with the flow values of the edges edges in an FD instance so that any path decomposition of being 1 or x; the dashed paths indicate the subpath constraints. If x = 1, the FD instance translates into a path decomposition for the then the instance is infeasible. If x = 2, then the instance is feasible and requires three paths to decompose (whereas the associated FD FDSC instance that covers the subpath constraints. instance without subpath constraints can be decomposed with two paths Given a flow network G = (V,E, f) with subpath in both cases). constraints R, we define the overdemand of an edge as do(e) = max(0, |{i : e ∈ Ri}| − f(e)), (4) 2 PRELIMINARIES and say that e is overdemanded if do(e) > 0. The FDSC prob- Since flow weights represent read counts, we restrict atten- lem (G,R) may be feasible if multiple subpaths covering e tion to integral flow networks and flow decompositions. are satisfied by a single path in a path decomposition. If no edges are overdemanded, we can give a simple Definition 1. A flow network G = (V,E, f) is a directed reduction from FDSC to FD by transforming all subpath acyclic graph (DAG) comprised of a set of vertices V con- constraints in the FDSC instance into edges in the FD taining a source s and sink t, a set of directed edges E, and instance. We address this case in Section 3.1.1 and the case a flow function∑f : E → N, such t∑hat for v ∈ V \ {s, t}, with overdemanded edges in Section 3.1.2. f(u, v) = f(v, w). (1) 3.1.1 Instances without overdemanded edges u:(u,v)∈E w:(v,w)∈E Lemma 7. Let G = (V,E, f) be a flow network with Finally, for each ∈ , there is an - path in that includes subpath constraints R such that no edge is overdemanded.v V s t G . Let G ′ = (V,E′, f ′) be the flow network that results from v converting every subpath constraint Ri = [v1, v2, . . . , v|Ri|] Definition 2 (Flow decomposition). A flow decomposition for to a bridge edge e ′i = (v1, v|R ) with f (ei| i) = 1 and a flow network G = (V,E, f) consists of a set of s-t paths subtracting one from the flow values on the edges it covers. P = (P ′ ′1, . . . , Pk) and associated integral flow weights w = That is, for all e ∈ E, f (e) = f(e) − |{i : e ∈ Ri}|. G is a (w1, . . . , wk) with wi ∈∑N such that for each edge e ∈ E, flow network.Proof. Consider building G′ from G iteratively by convert- wi = f(e). (2) ing each subpath constraint into a new edge and subtracting i:e∈Pi its flow from the edges it covers. At each step, conservation We define several problems concerning finding decom- of flow holds. Thus, after the final step, f ′ is a flow on G′. positions of flow networks into paths. Additionally, because no edge is overdemanded, all flow values in f ′ are nonnegative. Thus,G′ is a flow network. Problem 3 (MFD). Given a flow network G = (V,E, f), the minimum flow decomposition problem is to find a decomposi- Figure 3 shows an example of the reduction of an FDSC tion (P, w) such that |P| is minimized. instance to a FD instance with a bridge edge. ′ Definition 4 (Flow decomposition with subpath con- Lemma 8. Let G and G be as described in Lemma 7. Let′ straints). Let G = (V,E, f) be a flow network. Subpath (P , w) be a size k solution to the FD problem on G ′. There constraints are simple paths R = {R , . . . , R } in G such exists a size k solution to the FDSC problem on G.1 ` that no Ri ⊆ Rj . A flow decomposition (P, w) satisfies the Proof. We show how to construct a size k solution to FDSC subpath constraints if and only if on G from (P ′, w). For each path P ′ ∈ P ′, create a new path P by replacing all bridge edges e′i with the original sequence∀Ri ∈ R ∃Pj ∈ P such that Ri is a subpath of Pj (3) of nodes Ri. By construction, Ri must form a path from the (in short, Ri ∈ Pj). start node of ei to the end node of ei in P , and so P is a valid path from s to t in G. We take P to be the set of all k Figure 1 shows an example of a flow network with such paths P . We now must show that (P, w) forms a flow subpath constraints. decomposition with subpath constraints for G. ′ Problem 5 (FDSC). Given a flow networkG = (V,E, f) and Let e be any edge in G and let R ⊆ R be the set of subpath constraints R, the flow decomposition with subpath subpath constraints containing e. We can divide the paths constraints problem is to determine if there exists, and if so, in P that cover e into two disjoint sets: PB , those that′ find a flow decomposition (P, w) satisfying (3). covered bridge edges ei : Ri ∈ R , and PO, those those that covered the original edge e in G′. Because (P ′, w) is a Problem 6 (MFDSC). Given a flow network G = (V,E, f) flow decomposition for G′, each path in PB must have unit and subpath constraints R, the minimum flow decomposition weight. Thus, paths in PB contribute |{i : R ′i ∈ R }| to e’s with subpath constraints problem is to determine if there flow. On the other hand, paths in PO must cover e’s flow exists, and if so, find a flow decomposition (P, w) satisfying in G′, which is f(e) − |{i : R ′i ∈ R }|. Thus, paths from (3) such that |P| is minimized. PB and PO together cover e with exactly f(e) units of flow. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2022.3147697, IEEE/ACM Transactions on Computational Biology and Bioinformatics 4 Additionally, PB must satisfy all of the subpath constraints contain e. By Lemma 9, there exists a P ∈ P andRi, Rj ∈ R′ R′, so together PB and PO do as well. such that Ri =6 Rj and Ri, Rj ∈ P . Since Ri and Rj both belong to P and overlap (since they each contain e), Because any FDSC instance without overdemanded it follows that they are compatible. If R and R are not edges can be solved by reduction to FD, it follows that all i jdirectly compatible, there must exist some R such that R FDSC instances without overdemanded edges are feasible. k iand Rk both contain e and are directly compatible. In this Corollary 8.1. Let G = (V,E, f) be a flow network with case, take Rj to be Rk. Furthermore, the path P satisfies the subpath constraints R. A sufficient condition for a flow subpath constraint Ri ∪ Rj , so (P, w) is a feasible solution decomposition to exist is that no edge is overdemanded. for (G,R∪ {Ri ∪Rj} \ {Ri, Rj}). (←) Let Ri and Rj be directly-compatible subpaths that 3.1.2 Resolving overdemanded edges both contain e. Let (P, w) be a feasible solution to (G,R ∪ When an FDSC instance has an overdemanded edge, the {Ri ∪ Rj} \ {Ri, Rj}). It follows that there exists a path reduction given above fails, because any overdemanded P ∈ P that covers Ri ∪ Rj . Clearly, P also covers Ri and edge would have a negative flow value after subtracting all Rj , so (P, w) is also a feasible solution for (G,R). of its demands from its original flow. However, if the FDSC R Corollary 12.1. Let (G,R) be an FDSC instance. If thereinstance (G, ) is feasible, it is possible to first transform R R∗ are no compatible subpaths Ri and Rj containing some(G, ) to an FDSC instance (G, ), where no edge is overdemanded edge e, then (G,R) is infeasible. overdemanded and any path decomposition for (G,R∗) also provides a feasible path decomposition for (G,R). By Lemma 12 suggests that we can determine feasibility Lemma 7, (G,R∗) can be solved via reduction to an FD by finding combinations of subpath constraints that are instance. We now show how to obtain (G,R∗), if it exists. satisfied by the same paths. We can then think of merging R these subpath constraints together to form an equivalentLemma 9. Let (G, ) be a feasible FDSC instance with instance without overdemanded edges. One way to deter- overdemanded edge e and (P, w) be a path decomposi- R R′ ⊆ R mine feasibility, then, would be to consider every possibletion for (G, ). Let be the set of subpath con- ∈ P way of merging subpaths; however, there are an exponentialstraints that contain e. There is some P such that |{Ri : Ri ∈ R′, Ri ∈ P}| ≥ number of such possibilities. 2. To do this it polynomial time, we define a new graph Proof. Suppose not. That is, suppose (P, w) is a path decom- built from the subpath constraints, and show that certain position for (G,R) but no path in P covers two or more path coverings in this graph correspond to valid ways to subpath constraints inR′ completely. This means that every merge the subpath constraints. If no such path cover exists, subpath constraint in R′ must be satisfied by a different then the instance is infeasible. path; call this set of paths P ′ and let the total weight We first define the new graph. assigned to these paths be w′ ≥ |P ′| = |R′| = |{i : e ∈ Ri}|. |{ ∈ }| Definition 13 (Constraint graph). Let (G,R) be an FDSCAs e is overdemanded, we have i : e Ri > f(e). But ′ P instance. We define its constraint graph G c as the graph then w > f(e), contradicting the fact that ( , w) is a path where the vertices are constraints inR and an edge (R ,R ) decomposition for (G,R). i jindicates that Ri and Rj are directly compatible. Inspired by the above lemma, we consider pairs of An example constraint graph for an FDSC instance is subpath constraints that may be satisfied by the same path shown in Fig. 2. in a decomposition. Let L be the total length of all subpath constraints and Definition 10 (Compatible subpaths). Two subpaths recall that ` denotes the number of subpath constraints. R ,R ∈ R are compatible if and only if R and R have We can construct the constraint graph Gc in O(L + `3i j i j ) a suffix-prefix overlap (so that Ri ∪ Rj forms a simple path time by first using Gusfield’s algorithm for all pairs suffix- in G). prefix overlaps in O(L + `2) time [27] and then finding the transitive reduction of this relation in O(`3) time using Definition 11 (Directly-compatible subpaths). Two sub- ∈ R Aho’s algorithm with standard matrix multiplication [28].paths Ri, Rj are directly compatible if and only if Ri (Note that the original all pairs suffix-prefix overlap relation and Rj are compatible and there does not exist a subpath is acyclic since no subpath constraint is properly contained Rk such that Ri and Rk are compatible and Rk and Rj are inside another, and G is a DAG.) compatible. Remark 14. Gc is a DAG. We remark that the directly-compatible relation is just the transitive reduction of the compatible relation. Definition 15 (Edge-induced subgraph). Let e be an edge c c Lemma 12. Let (G,R) be an FDSC instance with some in G. The edge-induced subgraph G (e) is the subgraph of Gc overdemanded edge . Then P is a solution for R consisting of all vertices Ri in G (and induced edges) suche ( , w) (G, ) if and only if there exist directly-compatible subpaths that e ∈ R .R ii and Rj , each containing e, such that (P, w) is a solution for We now show that a certain path cover of the constraint (G,R∪ {Ri ∪Rj} \ {Ri, Rj}). graph corresponds to a valid way to merge subpath con- Proof. (→) Let (G,R) be a feasible FDSC instance with straints. overdemanded edge e. Let (P, w) be a path decomposition Lemma 16. Let (G,R) be an FDSC instance and let Gc be for (G,R). LetR′ ⊆ R be the set of subpath constraints that its constraint graph. (G,R) is feasible if and only if there This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2022.3147697, IEEE/ACM Transactions on Computational Biology and Bioinformatics 5 is a vertex-disjoint path cover Pc of Gc such that, for every a edge e in G, at most f(e) paths in Pc visit Gc(e). c d e f Proof. (→) Assume (G,R) is feasible. By Lemma 12 and b Corollary 12.1, it is possible to merge subpath constraints (a) FDSC instance until no edge is overdemanded. The resulting constraint subpaths are each formed from the union of original con- R1 straint subpaths in G that were directly compatible, so each R3 resulting subpath corresponds to a path in Gc. Since each R2 original subpath belongs to exactly one of the final subpaths, (b) Constraint graph all such paths provide a vertex-disjoint path cover Pc of Gc. Let e ∈ G. Since e is not overdemanded, the number of Fig. 2. Illustration of the greedy choice for extending paths used in paths from Pc that visit Gc(e) is at most f(e). Algorithm 1. When processing node R3 in the constraint graph in Fig- (←) Let Pc be a vertex-disjoint path cover for c such ure 2b, the path from R2 is extended because the overlap between R ’sG 2subpath constraint and R3’s subpath constraint is greater than R1’s and that, for every edge e in G, at most f(e) paths in Pc visit R3’s. Gc(e). Create a new FDSC instance (G,R′) as follows: For each path in Pc, add a single subpath constraint to R′ that corresponds to union of the subpath constraints in the path. in Gc, consider the set of e ∈ G for which e belongs to both In (G,R′) no edge is overdemanded, since no edge in G had Ri and R; these edges form a path in G that is a prefix of more associated paths in P c than its flow. Thus (G,R′) is R. Thus, choosing (Ri, R) will benefit — i.e., not increase feasible and has a solution (P, w). Since any path covering — the visitation counts to exactly those edges in this prefix a merged subpath constraint must cover the original un- of R. It follows that simply choosing the (Ri, R) that has merged subpath constraints it contains, (P, w) also provides the longest suffix-prefix overlap will minimize all visitation a solution to (G,R). counts. By considering the vertices in topological order, we can extend paths in O(1) time. There is a simple greedy strategy (Algorithm 1 Since Gc has ` vertices, we can perform the topological and Fig. 2) to find a vertex-disjoint path cover Pc of Gc sort in O(`2) time. Each edge is examined once during the that minimizes the number of paths intersecting Gc(e) for algorithm, and we assume that suffix-prefix overlaps have all edges e in G. been pre-computed during the construction of Gc, so the total running time of this step is O(`2). Algorithm 1 Greedy algorithm to find a vertex-disjoint path cover for Gc such that for all edges e in G, the minimum possible number of paths in the cover visit Gc(e). Because we can find an optimal path cover in polynomial 1: function PATHCOVER((G,R), Gc) time, we can check the feasibility of an FDSC instance (and, 2: V ← topological sorting R , . . . , R of vertices of Gc if it is feasible, find a solution) in polynomial time.1 ` 3: Pc ← ∅ Theorem 18. Let (G,R) be an FDSC instance with |R| = `, 4: for Ri ∈ V do total length of all subpath constraints L, and at least one 5: U ← all in-neighbors of Ri overdemanded edge. In O(L + `3) time, we can determine 6: if |U | > 0 then whether (G,R) is feasible, and if so, reduce (G,R) to an 7: ui ← the u ∈ U with greatest number of edges equivalent FD instance with at most ` additional edges. in suffix-prefix overlap with Ri 8: Extend the path ending at ui to end at Ri Proof. As discussed above, Gc and the suffix-prefix overlaps 9: else can be found inO(L+`3) time. We can then use Algorithm 1 10: Add new path starting and ending at Ri to Pc to find an optimal vertex-disjoint path cover P c of Gc in 11: end if O(`2) time. Because the path cover is optimal, Lemma 16 12: end for tells us that (G,R) is feasible if and only if for every 13: return Pc edge e of G, at most f(e) paths in P c visit Gc(e); this 14: end function can be checked in O(L) time. If the instance is feasible, by Lemma 16, we can merge the subpath constraints in each Lemma 17. In O(`2) time, Algorithm 1 finds a vertex- path found by Algorithm 1, yielding an equivalent FDSC disjoint path cover for Gc such that for every edge e in G, instance with no overdemanded edges. Then, we can use the minimum possible number of paths in the cover visit the method of Lemma 7 to transform that FDSC instance Gc(e). into an equivalent FD instance with at most ` additional edges. Proof. Consider any vertex R in Gc, with incoming edges from R1, . . . , Ra. Clearly, at most one such edge (Ri, R) can belong to any vertex-disjoint path cover. Observe that for 3.2 A heuristic algorithm for MFDSC any edge e in G, whether (Ri, R) belongs to a path in the In practice, we can run an MFD heuristic algorithm to deter- cover only affects the number of cover paths visiting Gc(e) mine a solution to the FD instance found via the reduction provided both R ci and R belong to G (e); in other words, e in the previous section. We use greedy-width, first proposed belongs to both Ri and R. For each incoming edge (Ri, R) in [11], which greedily chooses the heaviest (“widest”) paths This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2022.3147697, IEEE/ACM Transactions on Computational Biology and Bioinformatics 6 30 a 10 logarithm of the largest flow value present. To solve MFD, Toboggan tests increasing k values until a solution is found. s 20 c We briefly describe Toboggan’s approach and then discuss 980 1000 how to modify the algorithm so that it can also check if anb FD solution satisfies subpath constraints. (a) FDSC instance Toboggan considers the vertices ofG in topological order and computes a table Ti for each vertex vi using dynamic 10(29) a 10 programming. Table entries are of the form (g, L), where 0(19) g indicates how paths from the previous table Ti−1 ares 20(1) c extended, and L is a linear system indicating how the 980 1000 weights of these paths are constrained to satisfy the flowb requirements on all edges encountered so far. This linear (b) Reduced graph system can be written as Aw = b, where A is a binary matrix of k columns representing whether each row’s edge Fig. 3. Demonstration of reduction and bridge reweighthing procedure used in the heuristic MFDSC algorithm. The resulting FD instance in is covered by each column’s path, w is the length k solution Figure 3b is solved using greedy-width. The dashed edge is a bridge vector, and b is the flow on the row’s edge. Because there edge for the corresponding subpath constraint. Weights in parentheses are k weights and all coefficients are integers, each linear are the weights before bridge reweighting. system can be reduced to k linearly independent rows. As noted in [9], testing an integer linear system L for feasibility 2.5k+o(k) in order to decompose the flow1. As G′ is a DAG, a greedy- and finding a solution can be done in O(k |L|) time, O(1) width path can found in O(|V |+ |E|+ `) time, by standard where |L| is shown to be k λ. dynamic programming. In [11] it is shown that at most When the final vertex in the order is reached, these linear |E| − |V | + 2 greedy-width paths can be found, so the systems indicate the path flow constraints on all edges in G, total time to find an FD solution is O(|E|(|V | + |E| + `)). and so, if a particular system is feasible, the corresponding Translating the FD solution back to the original graph (fol- paths and weights provide an FD solution. lowing Lemma 8) yields a path decomposition for the FDSC To modify Toboggan to also consider the subpath con- problem. However, in applications, we are often interested straints, for the final table T|V |, we add a second linear in finding solution to the MFDSC problem, i.e. finding a so- system to simultaneously satisfy of the form Aw ≥ b, T lution with the minimum number of paths. The introduction where A is an ` × k binary matrix and b = (d1, . . . , d`). of bridge edges in the reduction described above may lead Here A(i, j) ∈ {0, 1} indicates whether path Pj contains Ri. to more paths being required to decompose the reduced FD We give an updated version of a lemma [9, Lemma 5] that instance than the original FDSC instance. This is because bounds the number of distinct linear systems in the final we now must find paths through bridge edges, as well as table. in the original flow network. For this reason, we apply a k2+ k`Lemma 19. The final table has at most 4 2 distinct linear bridge reweighting heuristic before decomposing the network k!kksystems. in order to reduce the number of paths. For some arbitrary ordering of the bridge edges, we do the following: Proof. We follow the proof of [9, Lemma 5]. Since A is k` 1) For each bridge edge, find the minimum flow f an ` × k binary matrix, there are 2 possible systems ofmin over the flow values on the edges of its correspond- the second form. We must multiply this by the number of ing subpath constraint in the original network. Since flow matching systems which was bounded ([9, Lemma 5]) k2 the FDSC is feasible, fmin ≥ 0. by 4 . So, the total number of possible combined linear k!kk 2 k` 2) Subtract fmin from each of the subpath constraint 2 k + systems is 2k` 4 k = 4 2k k . edges, and add f k!k k!kmin to the bridge edge. Theorem 20. Let (G,R) be an FDSC instance with |R| = ` Since the bridge edge starts at the first node of the subpath and λ is the logarithm of the largest flow value in the input. constraint and ends at the last, flow conservation holds Modifying the Toboggan algorithm as described provides and the mapping of the bridge paths back to the original O(k2) network again provides a solution to the FDSC instance. an FPT algorithm for MFDSC with running time 2 |V |+O(k2+k`) O(1) Figure 3 demonstrates the reduction to an FD instance and 2 (k + `) λ. the bridge reweighting step on an example FDSC instance. Proof. Kloster et al. prove ([9, Lemma 4]), that in any table √Ti, the number of distinct g values present is at most 3.3 An FPT scheme for MFDSC k(0.649k)k. This implies (following [9, Theorem 7]) that In this section, we describe an extension of Toboggan [9], there are at most an FPT algorithm for decomposing DAG flows, to also 4k 2+ k` 2 k`2 √ √ k + 2 k 4 0.649 k handle subpath constraints. Toboggan is able to find a k- · k(0.649k) = kk!kk k! path decomposition for a flow network G = (V,E, f), 2 if one exists, in 2O(k ) · (|V | + λ)) time, where λ is the final linear systems L to check for integer solutions. The encoding size of a linear system L is now bounded by (k + O(1) 1. Greedy-width is a common heuristic algorithm for MFD. Other `) λ, where λ is the logarithm of the largest flow value heuristics such as Catfish [18] could also be substituted. in the input. Checking feasibility and finding a solution for This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2022.3147697, IEEE/ACM Transactions on Computational Biology and Bioinformatics 7 L can now be done in O(k2.5k+o(k)(k+ `)O(1)λ) time, so the f(e1) = a1 f(b1) = B total time needed to check all such linear systems is at most f(e2) = a2 f(b2) = B √ k2+ k`4 2 0.649k k ·O(k2.5k+o(k) s t(k + `)O(1)λ) k! ≤ O(4k(k+ `2 )k1.5k1.765kko(k)(k + `)O(1)λ), (5) f(e3q) = a3q f(bq) = B k using the fact that k kk! ≤ e . The total running time of the 2 algorithm becomes 2O(k )|V |+ 2O(k2+k`)(k + `)O(1)λ. F∑ig. 4. Given an instance of 3-PARTITION A = {a1, a2, . . . , a3q} with a∈A a = qB , we construct the flow network with edges e1, . . . , e3q , b1, . . . , bq . For all i ∈ {1, . . . , 3q}, we set f(ei) = ai, and for all 4 H j ∈ {1, . . . , q} we set f(b ) = B. For each e we add the single-edgeARDNESS OF RELATED FLOW DECOMPOSI- j isubpath constraint [ei] with demand di = ai. TION PROBLEMS In this section we add to the computational complexity picture around the FDSC problem by studying two natural Given an input for 3-PARTITION, we construct the flow application-oriented variants of it. In contrast to the FDSC network G = (V,E, f) with subpath constraints R with de- problem, where deciding if the instance is feasible can be mands D as in Figure 4. We claim that this is a YES instance done in polynomial time (recall Theorem 18), for both of for 3-PARTITION{ if and only if the reduction cr}eates a YES-these problems feasibility checking is NP-complete in the instance for FDSCD. strong sense (i.e. even if the input flow values are bounded (→) Assume {aj1 , aj2 , aj3 | j ∈ {1, . . . , q}} is a proper by a polynomial in the size of the input; in fact, the former 3-partition of A. For each j ∈ {1, . . . , q}, we build the one is NP-complete even when all flow values equal 1). ∑three flow paths (ej1 , bj), (ej2 , bj), (ej3 , bj), with weightsSince their application setting is slightly different from the aj1 , aj2 , aj3 , respectively. This is possible since f(bj) = B =3 one of the FDSC problem (see our FDSC experiments in k=1 aj . As such, each subpath constraint [e ] of demandk i Section 5), in this paper we do not give algorithms for them, ai is satisfied. and leave that for future work. (←) Let (P, w) be a flow decomposition with subpath First, we consider the version of the problem where the constraints R and demands D of G as indicated. Since subpath constraints have associated demands that must be the demand of each constraint [ei] is ai, and f(ei) = ai met by the flow assigned to a path that covers them. This it follows that each edge ei is used by exactly one flow problem could arise if we would like a subpath constraint to path of weight ai, and thus that P consists of exactly 3q be covered by at least one flow path of higher weight, since paths P1, . . . , P3q , with weights a1, a2, . . . , a3q , respectively. a more heavily-weighted path may be less likely to be result For each j ∈ {1, . . . , q}, consider the flow paths passing of noisy data. thorough bj . Since B/4 < a < B/2 holds for all a ∈ A, there are exactly 3 such paths, say Pj , P1 j , Pj , each of2 3 Definition 21 (Flow decomposition with subpath con- weight aj , aj , aj respectively, passing through bj . As1 2 3 straints and demands). Let G = (V,E, f) be a flow net- such, {{aj , a , a | j ∈ {1, . . . , q}}} is a proper 3-partition1 j2 j3 work. Let R = {R1, . . . , R`} be a set of subpath constraints of A. in G, and let D = {d1, . . . , d`} be a set of demands, where Finally, since 3-PARTITION is NP-complete in the strong each di is associated to subpath constraint Ri. We say that a sense [29], it follows that FDSCD is as well. flow decomposition (P, w) ofG satisfies subpath constraints R and demands D if and only if Our second problem variant is motivated by paired-end reads, which naturally induce pairs of subpath constraints ∀Ri ∈ R ∃Pj ∈ P such that Ri ∈ Pj and di ≤ wj . (6) that must be covered by the same flow path (since e.g. they are sequenced from the same RNA transcript). Problem 22 (FDSCD). Given a flow network G = (V,E, f) and subpaths constraintsR and demands D, the flow decom- Definition 24 (Flow decomposition with paired subpath position with subpaths constraints and demands problem is to constraints). Let G = (V,E, f) be a flow network. Paired determine if there exists, and if so, find a flow decomposi- subpaths constraints are defined to be a set of pairs of simple tion (P, w) satisfying (6). paths R = {(R1, R′1), . . . , (R`, R′`)} in G. A flow decompo- sition (P, w) satisfies the paired subpaths constraints if and We note that FDSC is a special case of FDSCD, where only if all demands are equal to one. The following proof is very similar to the NP-completeness proof of [11] for MFD. ∀(R ′i, Ri) ∈ R ∃Pj ∈ P such that Ri ∈ Pj and R′i ∈ Pj . Theorem 23. FDSCD is NP-complete in the strong sense. (7) Problem 25 (FDPSC). Given a flow network G = (V,E, f) Proof. Clearly, FDSCD belongs to NP. For NP-hardness, and paired subpaths constraints R, the flow decomposition we reduce from the 3-P∑ARTITION problem. The input of with paired subpaths constraints problem is to determine ifthis problem consists of a set of positive integers A = { } there exists, and if so, find a flow decomposition (P, w)a1, a2, . . . , a3q , where a∈A a = qB and B/4 < a < B/2 for all a ∈ satisfying (7).A. The question is whether there exists a partition of A into q disjoint subsets, each with exactly three elements Our NP-completeness proof below is similar to the NP- summing up to B. completeness proof of the minimum path cover problem This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2022.3147697, IEEE/ACM Transactions on Computational Biology and Bioinformatics 8 with paired subpath constraints from [30]. Note that hard- 5.1 Datasets ness holds even if all flow values equal 1. As in previous studies on flow decomposition methods Theorem 26. FDPSC is NP-complete even when all flows for RNA-Seq assembly [9], [18], [8], we use a simulated values are 1. RNA-Seq dataset from [18] where each instance is a flow network generated by simulating RNA transcripts and their Proof. Clearly, FDPSC belongs to NP. For NP-hardness, we abundances with Flux-Simulator [32]. The original dataset reduce from the problem of deciding whether the chromatic includes human, mouse, and zebrafish genes, but we restrict number of an undirected graph G = (V,E) is 3. Given our attention to instances in the human dataset, which G, we construct the flow network G∗ = (V ∗, E∗, f), with contains 100 independently generated transcriptomes. As paired subpaths constraints R, as in Figure 5. We claim that in [9], [8], we use only instances with at least two ground the chromatic number of G is 3 if and only if (G∗,R) is a truth paths (since a single ground truth path is trivial to feasible FDPSC instance. decompose). We also restrict the dataset to instances with 10 (→) Let V , V , V be a partition of V such that no edge ground truth paths or fewer, yielding a total of 528,544 in-1 2 3 of G has endpoints in the same Vk. Then, we can construct stances. Because the transcripts and abundances are known, a flow decomposition (P , P , P ∗ we have ground truth paths and weights for each splice1 2 3) of G , where each path has weight 1, as follows. In the vertex part of G∗, suppose graph instance. We measure accuracy as the proportion of v ∈ V ; P follows the edge labeled v , and the other two instances for which the algorithm returns the ground truthi k k i paths follow the other two edges parallel to it, respectively. set of paths and weights exactly. In the edge part of G∗, for each edge e = (vi, vj), where vi ∈ Vk and vj ∈ Vk , path Pk follows the edge labeledi j i v′ and path P follows the edge labeled v′ 5.2 Simulating subpath constraints i kj j . The remaining path follows the middle unlabeled edge. Clearly, all subpath In order to simulate subpath constraints, we take subpaths constraint pairs are satisfied. of the ground truth paths according to two parameters: the ← Any flow decomposition of ∗ has exactly 3 paths, number of subpaths `, and a fixed length for all subpaths( ) G each of weight 1. Let |R|. As noted in [9, Lemma 8], we can simplify the graphP1, P2, P3 be a flow decomposition with paired subpaths constraints of (G∗,R). We construct a by bypassing any vertex with out-degree or in-degree equal partition one. We set |R| as the length of subpaths in this contractedV1, V2, V ∗3 of V : in the vertex part of G , an edge labeled belongs exactly to one , and we assign to . graph. To generate subpath constraints that are consistentvi Pk vi Vk Assume for a contradiction that some edge of across experiments, we fix an arbitrary ordering for thee = (vi, vj) G has endpoints in the same Vk. Recall that for e, R contains ground truth paths for each instance, and take the first a constraint pairing the edge labeled with the one labeled |R| edges of the first ` (contracted) paths as the subpaths.vi v′i, and one constraint pairing the edge labeled vj with the We note that the method of generating subpath constraints one labeled ′ . Because of these constraints, since passes described here does not yield any overdemanded edges.vj Pk through both vi and vj (in the vertex part of G∗), in the edge part of G∗ corresponding to e we have that Pk passed 5.3 Accuracy results through both the edge labeled v′i and the one labeled v ′ j , To study the effect of the subpath constraints on the accu- which is a contradiction. racy of the RNA-Seq assembly, we vary ` and |R| indepen- dently, letting ` ∈ [0, 4] and |R| ∈ {3, 4}. Because instances become more difficult to solve correctly as the number of 5 EXPERIMENTS ground truth paths increases, we separate results by k. The algorithms described in Section 3 were implemented Accuracy results for both algorithms are reported in Table 1. in Python in a package called Coaster2. We refer to the For each k value, we also report the percentage of instances algorithm of Section 3.2 as heuristic MFDSC and Section 3.3 that completed for all parameter combinations tested. As as FPT MFDSC. Experiments were performed on a high already shown by Kloster et al. [9], the MFD solutions found performance research cluster, where each run was executed by Coaster for ` = 0 (for them, Toboggan) do correspond to on a single Intel Xeon Ivy Bridge E (3.4 Ghz) or similar the the ground truth paths and weights most of the time. CPU. We denote the number of groundtruth paths for an However, for larger k values, we can see that FPT MFD instance by k, and set a CPU time limit of 30 seconds for solutions (without subpath constraints) do not necessarily smaller instances (2 ≤ k ≤ 8) and 1 hour for larger instances recover the correct set of paths and weights. For k = 7, for (k = 9, 10). For fairness of comparison, we report only on example, only 81% of the optimal decompositions produced graph instances that ran to completion for all algorithm by Coaster are the ground truth decomposition that we and parameter combinations, unless otherwise mentioned. are seeking. Similarly, we see that FPT MFDSC solutions Overall, we find that heuristic MFDSC completes for all test tend to be correct, with accuracy decreasing as k increased. instances in under one second, and FPT MFDSC completes However, FPT MFDSC has higher accuracy for all parameter in under 30 seconds for all instances with k ≤ 5, which combinations than FPT MFD at the same k value. For k = 7, includes the majority of instances. We give additional details when we add four subpath constraints of length four each, and results in the following sections. the ground truth decomposition is found 91% of the time, a 13% increase over FPT MFD. 2. Coaster is based on the codebase for Toboggan [31] and is available When ` = 0, our heuristic MFDSC algorithm is equiva- at github.com/msu-alglab/coaster. lent to the often-used greedy-width heuristic for MFD; our This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2022.3147697, IEEE/ACM Transactions on Computational Biology and Bioinformatics 9 vi′ v1 v2 vs n t vj′ e = (vi, vj) Vertex part Edge part Fig. 5. Given an undirected graph G = (V,E), with V = {v1, v2, . . . , vn}, we construct a flow network G∗ as illustrated. For every vertex vi we add three parallel edges, labelling one of them with vi. For every edge e = (vi, vj) of G, we add three parallel edges, one labeled v′ and one labeled v′ i j . For e we also add one constraint pairing the edge labeled vi from the vertex part, with the edge labeled v ′ i, and one constraint pairing the edge labeled vj from the vertex part with the edge labeled v′j . The flow value of every edge is 1. results show that adding subpath constraints to greedy- 5.4 Effect of the bridge reweighting width increases its accuracy considerably for larger k values, To confirm the effectiveness of the bridge reweighting for example, by 30% when k = 7. The increased accuracy of heuristic for MFDSC, as opposed to simply using a path heuristic MFDSC is also good news for the use of MFDSC in decomposition found by the method of Lemma 8, we mea- practical methods, since heuristic MFD methods are already sured the accuracy of the FDSC algorithm without bridge commonly used in RNA-Seq tools. In fact, the inclusion reweighthing on the same dataset studied above. In that of many long subpath constraints makes heuristic MFDSC case, the addition of subpath constraints in our experiments more accurate than FPT MFD for k values up to 5, which reduces the accuracy of the path decompositions returned, account for 95.6% of the full dataset studied (all k ≥ 2). as shown in Table 2. Bridge reweighting allows the max- Part of the success of the heuristic MFDSC can be imum flow that can cover a subpath constraint to do so, attributed to the fact that it finds optimal solutions in without introducing extra weight-one paths. most cases. Without subpath constraints, heuristic MFDSC (i.e. greedy-width MFD) finds an optimal solution in 98.0% and the ground truth solution in 95.3% of instances in 5.5 Performance results our dataset. With two subpath constraints of length 4, that We measured CPU runtime of the implementation for both increases to 99.0% and 98.1%, respectively. (For ` > 2, small algorithms using all instances for the given k range, even k values are excluded, so results are not comparable with those that timed out for some experimental conditions. For ` = 0 experiments.) heuristic MFDSC on 2 ≤ k ≤ 8 instances, the average, mini- With and without subpath constraints, the vast majority mum, and maximum runtimes were 0.0059s, 0.00096s, and of incorrectly predicted path decompositions are due to the 0.977s. For FPT instances, they were 0.076s, 0.001s, and 30s algorithm returning an optimal decomposition of the same (the maximum time allowed). On k = 9, 10 instances, the size as the ground truth one, but different from it, rather average, minimum, and maximum runtimes were 0.018s, than a too-small optimal decomposition. As found in [9], in 0.004s, 0.155s for heuristic MFDSC and 289.3s, 0.932s, and nearly all instances, the ground truth path decomposition 3600s (the maximum time allowed) for FPT FDSC. is also an optimal decomposition. (They find that 0.043% of Because of the optimizations made in Toboggan [9] on instances of all ground truth k that ran to completion in 50 which our FPT MFDSC implementation is based, memory seconds had non-optimal ground truth decompositions; we use is generally very limited even as k increases. We mea- find that 0.100% of instances that completed in 30 seconds sured the peak memory use for large instances (k = 9, 10) for all parameter combinations and had ground truth k with a timeout limit of 1 hour for all experimental combi- less than 9 had non-optimal ground truth decompositions.) nations (` and |R| values) for the FPT MFDSC algorithm. However, most instances are solved correctly, so it could be All but four instances used under 100MB of memory in the case that the few instances that are not solved correctly all experimental combinations; the average memory use are those that had non-optimal ground truths. This tends over all instances and all experimental combinations, even not to be the case. Overall, only 0.027% of instances for those that timed out after an hour, was 47MB. For most which the FPT for MFD yields incorrect solutions have non- instances, the memory use is dominated by loading the optimal ground truth path decompositions. This is domi- required Python packages (about 40MB). nated by the k = 2 instances, however, for which no instance had a non-optimal ground truth; for k = 3 through k = 8, between 0.1% and 0.3% of instances that were predicted 6 DISCUSSION incorrectly had non-optimal ground truths. With many and In this work, we initiate the formal study of the MFDSC longer subpath constraints (|R| = 4 and ` = 4), it is still problem, which is used as a model in applications such as only a very small number – 0.052% – of incorrect solutions RNA sequencing and viral quasispecies assembly. We give that have non-optimal ground truth path decompositions. both a heuristic algorithm, based on a novel reduction to Thus, this implies that the addition of subpath constraints flow decomposition, and an FPT algorithm, which extends restricts the solution space, allowing the algorithm to return the FPT MFD algorithm of Kloster et al. [9]. Through exper- the correct one more frequently and explaining the increase iments on a previously-studied simulated transcriptomics in accuracy when they are included. dataset, we verify the base assumption underlying the use of This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2022.3147697, IEEE/ACM Transactions on Computational Biology and Bioinformatics 10 TABLE 1 Accuracy results using the MFDSC with bridge-reweighting heuristic algorithm (labeled “H-br”) and the MFDSC FPT algorithm (labeled “FPT”). For k = 2 through k = 8 we use a CPU time limit of 30 seconds; for k = 9, 10 we use 1 hour. We only report instances that finished in the time limit for all `, |R|, and for both algorithms for each k; the “pc” column reports the percentage of instances that completed for all runs for each k value. The italicized values agree with the ones reported in [9, Figure 3], with some slight differences due to the fact that we restrict to the human dataset (they studied two additional datasets) and timeout differences. |R| = 3 |R| = 4 k n pc alg ` = 0 ` = 1 ` = 2 ` = 3 ` = 4 ` = 1 ` = 2 ` = 3 ` = 4 2 291734 100% H-br 0.992 0.999 0.999 1.000 1.000 FPT 0.991 0.999 0.999 1.000 1.000 3 130867 100% H-br 0.961 0.977 0.983 0.986 0.985 0.993 0.994 FPT 0.969 0.983 0.990 0.994 0.986 0.996 0.998 4 58167 100% H-br 0.901 0.926 0.941 0.948 0.958 0.942 0.964 0.974 0.979 FPT 0.934 0.952 0.964 0.974 0.983 0.958 0.976 0.987 0.995 5 25933 100% H-br 0.822 0.853 0.873 0.887 0.900 0.876 0.911 0.930 0.944 FPT 0.892 0.913 0.928 0.940 0.953 0.922 0.944 0.962 0.976 6 11774 99.6% H-br 0.727 0.763 0.784 0.805 0.816 0.787 0.831 0.862 0.883 FPT 0.849 0.870 0.885 0.898 0.911 0.881 0.906 0.928 0.944 7 5095 94.6% H-br 0.617 0.659 0.692 0.706 0.729 0.681 0.738 0.775 0.802 FPT 0.810 0.835 0.855 0.871 0.883 0.845 0.872 0.894 0.912 8 2109 83.7% H-br 0.495 0.523 0.558 0.589 0.611 0.545 0.607 0.664 0.702 FPT 0.787 0.808 0.822 0.833 0.845 0.819 0.840 0.863 0.884 9 1323 83.14% H-br 0.455 0.495 0.527 0.565 0.592 0.522 0.582 0.643 0.698 FPT 0.714 0.731 0.742 0.762 0.781 0.745 0.769 0.795 0.821 10 699 69.53% H-br 0.420 0.442 0.459 0.484 0.508 0.465 0.506 0.541 0.578 FPT 0.726 0.747 0.757 0.772 0.784 0.757 0.776 0.807 0.825 TABLE 2 Accuracy values for FD heuristic with (“H-br”) and without (“H-b”) bridge reweighthing, averaged over all instances with k ≤ 2 ≤ 10. If all bridges are kept at weight one, subpath constraints reduce the accuracy of the path decomposition, though less if they are longer. |R| = 3 |R| = 4 ` = 0 ` = 1 ` = 2 ` = 3 ` = 4 ` = 1 ` = 2 ` = 3 ` = 4 H-br 0.951 0.965 0.970 0.942 0.899 0.971 0.979 0.963 0.930 H-b 0.951 0.926 0.809 0.582 0.379 0.945 0.913 0.793 0.633 MFDSC in practical RNA-Seq tools: that the minimum-size accuracy decreases because there are multiple optimal solu- path decomposition should correspond to the ground truth tions to choose from, even with the maximum length and set of paths and weights. Additionally, we show that the use number of subpath constraints that we tested. To increase of subpath constraints increases accuracy when compared accuracy, either more subpath constraints are needed (which to MFD without subpath constraints. We also find that may be possible, depending on the domain), or additional our heuristic algorithm is practical, completing in less than optimality criteria could be used. For example, the two 1 second for all instances studied, and achieves accuracy FDSC variants considered in Section 4 (together with yet levels near those of FPT MFDSC. This is an encouraging to be discovered algorithms for them and possible problem result, because while RNA sequencing data tends toward generalizations) could guide the search for such optimality very small ground truth path sets, other multiassembly criteria. problems such as viral quasispecies assembly may not – for example, some benchmarking datasets of [33] contain 10 and 15 strains, meaning that MFDSC (or even MFD without subpath constraints) would be intractable without REFERENCES a heuristic. [1] W. Li, J. Feng, and T. Jiang, “Isolasso: A LASSO regression The research presented here suggests a number of future approach to RNA-Seq based transcriptome assembly,” Journal directions. One is to develop MFDSC algorithms for graphs of Computational Biology, vol. 18, no. 11, pp. 1693–1707, 2011. [Online]. Available: http://dx.doi.org/10.1089/cmb.2011.0171 containing cycles. Though splice graphs for RNA assembly [2] A. I. Tomescu, A. Kuosmanen, R. Rizzi, and V. Mäkinen, are usually DAGs, graphs for de novo assembly of viral or “A novel min-cost flow method for estimating transcript other genomes would likely contain cycles due to repeated expression with RNA-Seq,” BMC Bioinformatics, vol. 14, no. sequences. Another is to explore additional methods to S-5, p. S15, 2013, proceedings paper from RECOMB-seq: Third Annual RECOMB Satellite Workshop on Massively Parallel increase the accuracy of the decompositions found. Our Sequencing Beijing, China. 11-12 April 2013. [Online]. Available: experiments show that as the size of ground truth gets large, http://dx.doi.org/10.1186/1471-2105-14-S5-S15 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2022.3147697, IEEE/ACM Transactions on Computational Biology and Bioinformatics 11 [3] E. Bernard, L. Jacob, J. Mairal, and J. Vert, “Efficient RNA isoform assembly,” BMC Bioinform., vol. 15, no. S-9, p. S5, 2014. [Online]. identification and quantification from RNA-Seq data with network Available: https://doi.org/10.1186/1471-2105-15-S9-S5 flows,” Bioinformatics, vol. 30, no. 17, pp. 2447–2455, 2014. [Online]. [23] A. Kuosmanen, A. Sobih, R. Rizzi, V. Mäkinen, and A. I. Tomescu, Available: http://dx.doi.org/10.1093/bioinformatics/btu317 “On using longer RNA-Seq reads to improve transcript prediction [4] M. Shao and C. Kingsford, “Accurate assembly of transcripts accuracy,” in Proceedings of the 9th International Joint Conference on through phase-preserving graph decomposition,” Nature biotech- Biomedical Engineering Systems and Technologies (BIOSTEC 2016) - nology, vol. 35, no. 12, pp. 1167–1169, 2017. Volume 3: BIOINFORMATICS, Rome, Italy, February 21-23, 2016, [5] M. Pertea, G. M. Pertea, C. M. Antonescu, T.-C. Chang, J. T. J. P. Gilbert, H. Azhari, H. H. Ali, C. Quintão, J. Sliwa, C. Ruiz, Mendell, and S. L. Salzberg, “Stringtie enables improved recon- A. L. N. Fred, and H. Gamboa, Eds. SciTePress, 2016, pp. 272–277. struction of a transcriptome from RNA-Seq reads,” Nature Biotech- [Online]. Available: https://doi.org/10.5220/0005819702720277 nology, vol. 33, no. 3, pp. 290–295, 2015. [24] E. Bao, T. Jiang, and T. Girke, “Branch: boosting RNA-Seq assem- [6] J. A. Baaijens, L. Stougie, and A. Schönhuth, “Strain-aware assem- blies with partial or related genomic sequences.” Bioinformatics, bly of genomes from mixed samples using flow variation graphs,” vol. 29, no. 10, pp. 1250–1259, 2013. in International Conference on Research in Computational Molecular [25] A. Kuosmanen, T. Norri, and V. Mäkinen, “Evaluating approaches Biology. Springer, 2020, pp. 221–222. to find exon chains based on long reads,” Briefings in bioinformatics, [7] J. A. Baaijens, B. V. der Roest, J. Köster, L. Stougie, vol. 19, no. 3, pp. 404–414, 2018. and A. Schönhuth, “Full-length de novo viral quasispecies [26] J. Bang-Jensen and G. Z. Gutin, Digraphs Theory, Algorithms and assembly through variation graph construction,” Bioinform., Applications, 1st ed. Berlin: Springer-Verlag, 2000. vol. 35, no. 24, pp. 5086–5094, 2019. [Online]. Available: [27] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer https://doi.org/10.1093/bioinformatics/btz443 Science and Computational Biology. New York, NY, USA: Cam- [8] L. Williams, G. Reynolds, and B. Mumey, “RNA transcript assem- bridge University Press, 1997. bly using inexact flows,” in 2019 IEEE International Conference on [28] A. V. Aho, M. R. Garey, and J. D. Ullman, “The transitive reduction Bioinformatics and Biomedicine (BIBM), 2019, pp. 1907–1914. of a directed graph,” SIAM Journal on Computing, vol. 1, no. 2, pp.131–137, 1972. [9] K. Kloster, P. Kuinke, M. P. O’Brien, F. Reidl, F. S. Villaamil, B. D. [29] M. R. Garey and D. S. Johnson, “Complexity results for multipro- Sullivan, and A. van der Poel, “A practical FPT algorithm for cessor scheduling under resource constraints,” SIAM Journal on flow decomposition and transcript assembly,” in 2018 Proceedings Computing, vol. 4, no. 4, pp. 397–411, 1975. of the Twentieth Workshop on Algorithm Engineering and Experiments [30] R. Rizzi, A. I. Tomescu, and V. Mäkinen, “On the complexity (ALENEX). SIAM, 2018, pp. 75–86. of minimum path cover with subpath constraints for multi- [10] T. Steijger, J. F. Abril, P. G. Engstrom, F. Kokocinski, T. R. assembly,” in BMC Bioinformatics, vol. 15, no. S9. Springer, 2014, Consortium, T. J. Hubbard, R. Guigo, J. Harrow, and P. Bertone, p. S5. “Assessment of transcript reconstruction methods for RNA-Seq,” [31] K. Kloster, P. Kuinke, M. P. O’Brien, F. Reidl, F. S. Villaamil, B. D. Nat Meth, vol. 10, no. 12, pp. 1177–1184, 12 2013. [Online]. Sullivan, and A. van der Poel, “Toboggan: Version 1.0,” Jun. 2017. Available: http://dx.doi.org/10.1038/nmeth.2714 [Online]. Available: http://dx.doi.org/10.5281/zenodo.821634 [11] B. Vatinlen, F. Chauvet, P. Chrétienne, and P. Mahey, “Simple [32] T. Griebel, B. Zacher, P. Ribeca, E. Raineri, V. Lacroix, R. Guigó, bounds and greedy algorithms for decomposing a flow into a and M. Sammeth, “Modelling and simulating generic RNA-Seq minimal set of paths,” European Journal of Operational Research, vol. experiments with the flux simulator,” Nucleic acids research, vol. 40, 185, no. 3, pp. 1390–1401, 2008. no. 20, pp. 10 073–10 083, 2012. [12] T. Hartman, A. Hassidim, H. Kaplan, D. Raz, and M. Segalov, [33] J. A. Baaijens, A. Z. El Aabidine, E. Rivals, and A. Schönhuth, “How to split a flow?” in 2012 Proceedings IEEE INFOCOM. IEEE, “De novo assembly of viral quasispecies using overlap graphs,” 2012, pp. 828–836. Genome research, vol. 27, no. 5, pp. 835–848, 2017. [13] V. Suppakitpaisarn, “An approximation algorithm for multiroute flow decomposition,” Electronic Notes in Discrete Mathematics, vol. 52, pp. 367 – 374, 2016, iNOC 2015 – 7th International Network Optimization Conference. [Online]. Available: http:// www.sciencedirect.com/science/article/pii/S1571065316300531 [14] K. Pieńkosz and K. Kołtyś, “Integral flow decomposition with minimum longest path length,” European Journal of Operational Brendan Mumey obtained his Ph.D. in Com- Research, vol. 247, no. 2, pp. 414–420, 2015. puter Science and Engineering at the University [15] B. Mumey, S. Shahmohammadi, K. McManus, and S. Yaw, “Parity of Washington in 1997 and joined the faculty balancing path flow decomposition and routing,” in 2015 IEEE at Montana State University in 1998 where he Globecom Workshops (GC Wkshps). IEEE, 2015, pp. 1–6. currently serves as Professor of Computer Sci- [16] G. Baier, E. Köhler, and M. Skutella, “On the k-splittable flow ence. In the spring of 2020, he held a visiting problem,” in European Symposium on Algorithms. Springer, 2002, Fulbright-Nokia Distinguished Chair position at pp. 101–113. the University of Helsinki. He served as the Edi- [17] ——, “The k-splittable flow problem,” Algorithmica, vol. 42, no. tor of ACM SIGACT News, a quarterly periodical 3-4, pp. 231–248, 2005. for the theoretical computer science community, [18] M. Shao and C. Kingsford, “Theory and a heuristic for the mini- from 2008 to 2015. mum path flow decomposition problem,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 16, no. 2, pp. 658– 670, 2017. [19] T. Yu, Z. Mu, Z. Fang, X. Liu, X. Gao, and J. Liu, “Transborrow: genome-guided transcriptome assembly by borrowing assemblies from different assemblers,” Genome Research, vol. 30, no. 8, pp. Alexandru I. Tomescu obtained his PhD in 1181–1190, 2020. [Online]. Available: http://genome.cshlp.org/ Computer Science in 2012 at the University of content/30/8/1181.abstract Udine, Italy. Between 2012 and 2020, he worked [20] C. Trapnell, B. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. van at the University of Helsinki in several postdoc- Baren, S. Salzberg, B. Wold, and L. Pachter, “Transcript assembly toral positions on Algorithmic Bioinformatics, in- and quantification by RNA-Seq reveals unannotated transcripts cluding Academy of Finland Postdoctoral Fellow and isoform switching during cell differentiation,” Nature Biotech- (2014–2017), and Researcher (2019–). In 2020 nology, vol. 28, pp. 511–515, 2010. he was appointed as Associate Professor at the [21] O. Zagordi, A. Bhattacharya, N. Eriksson, and N. Beerenwinkel, University of Helsinki, where he currently leads “ShoRAH: estimating the genetic diversity of a mixed sample from the Graph Algorithms team, of the wider Algo- next-generation sequencing data.” BMC Bioinformatics, vol. 12, rithmic Bioinformatics research group. In 2019 no. 1, pp. 119+, 2011. he obtained the ERC Starting Grant. [22] R. Rizzi, A. I. Tomescu, and V. Mäkinen, “On the complexity of minimum path cover with subpath constraints for multi- This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2022.3147697, IEEE/ACM Transactions on Computational Biology and Bioinformatics 12 Lucia Williams is pursuing a Ph.D. in Computer Science at Montana State University. She com- pleted her B.S. at the University of Washington in 2014. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/