Browsing by Author "Zou, Peng"

Now showing 1 - 6 of 6

Computing a consensus trajectory in a vehicular network
(Springer Science and Business Media LLC, 2022-09) Zou, Peng; Qingge, Letu; Yang, Qing; Zhu, Binhai
In this paper, we study the problem of computing a consensus trajectory of a vehicle given the history of Points of Interest visited by the vehicle over a certain period of time. The problem arises when a system tries to establish the social connection between two vehicles in a vehicular network, where three versions of the problem are studied. Formally, given a set of m trajectories, the first version of the problem is to compute a target (median) sequence T over Σ such that the sum of similarity measure (i.e., number of adjacencies) between T and all Si’s is maximized. For this version, we show that the problem is NP-hard and we present a simple factor-2 approximation based on a greedy method. We implement the greedy algorithm and a variation of it which is based on a more natural greedy search on a new data structure called adjacency map. In the second version of the problem where the sequence T is restricted to be a permutation, we show that the problem remains NP-hard but the approximation factor can be improved to 1.5. In the third version where the sequence T has to contain all letters of Σ, we again prove that it is NP-hard. We implement a simple greedy algorithm and a variation of the 1.5-approximation algorithm for the second version, and which are used to construct solution for the third version. Our algorithms are tested on the simulation data and the empirical results are very promising.
Computing the Tandem Duplication Distance is NP-Hard
(Society for Industrial & Applied Mathematics, 2022-03) Lafond, Manuel; Zhu, Binhai; Zou, Peng
In computational biology, tandem duplication is an important biological phenomenon which can occur either at the genome or at the DNA level. A tandem duplication takes a copy of a genome segment and inserts it right after the segment---this can be represented as the string operation AXB⇒AXXB. Tandem exon duplications have been found in many species such as human, fly, and worm and have been largely studied in computational biology. The tandem duplication (TD) distance problem we investigate in this paper is defined as follows: given two strings S and T over the same alphabet Σ, compute the smallest sequence of TDs required to convert S to T. The natural question of whether the TD distance can be computed in polynomial time was posed in 2004 by Leupold et al. and had remained open, despite the fact that TDs have received much attention ever since. In this paper, we focus on the special case when all characters of S are distinct. This is known as the exemplar TD distance, which is of special relevance in bioinformatics. We first prove that this problem is NP-hard when the alphabet size is unbounded, settling the 16-year-old open problem. We then show how to adapt the proof to |Σ|=4, hence proving the NP-hardness of the TD problem for any |Σ|≥4. One of the tools we develop for the reduction is a new problem called Cost-Effective Subgraph, for which we obtain W[1]-hardness results that might be of independent interest. We finally show that computing the exemplar TD distance between S and T is fixed-parameter tractable. Our results open the door to many other questions, and we conclude with several open problems.
Dispersing and grouping points on planar segments
(Elsevier BV, 2022-09) He, Xiaozhou; Lai, Wenfeng; Zhu, Binhai; Zou, Peng
Motivated by (continuous) facility location, we study the problem of dispersing and grouping points on a set of segments (of streets) in the plane. In the former problem, given a set of n disjoint line segments in the plane, we investigate the problem of computing a point on each of the n segments such that the minimum Euclidean distance between any two of these points is maximized. We prove that this 2D dispersion problem is NP-hard, in fact, it is NP-hard even if all the segments are parallel and are of unit length. This is in contrast to the polynomial solvability of the corresponding 1D problem by Li and Wang (2016), where the intervals are in 1D and are all disjoint. With this result, we also show that the Independent Set problem on Colored Linear Unit Disk Graph (meaning the convex hulls of points with the same color form disjoint line segments) remains NP-hard, and the parameterized version of it is in W[2]. In the latter problem, given a set of n disjoint line segments in the plane we study the problem of computing a point on each of the n segments such that the maximum Euclidean distance between any two of these points is minimized. We present a factor-1.1547 approximation algorithm which runs in time. Our results can be generalized to the Manhattan distance.
Duplications and deletions in genomes: theory and applications
(Montana State University - Bozeman, College of Engineering, 2022) Zou, Peng; Chairperson, Graduate Committee: Binhai Zhu
In computational biology, duplications and deletions in genome rearrangements are important to understand an evolutionary process. In cancer genomics research, intra-tumor genetic heterogeneity is one of the central problems. Gene duplications and deletions are observed occurring rapidly in cancer during tumour formation. Hence, they are recognized as critical mutations of cancer evolution. Understanding these mutations are important to understand the origins of cancer cell diversity which could help with cancer prognostics as well as drug resistance explanation. In this dissertation, first, we prove that the tandem duplication distance problem is NP-complete, even if |sigma| > or = 4, settling a 16-year old open problem. And we obtain some positive results by showing that if one of the input sequences, S, is exemplar, then one can decide if S can be transformed into T using at most k tandem duplications in time 2 O (k 2) + poly(n). Motivated by computing duplication patterns in sequences, a new fundamental problem called the longest letter-duplicated subsequence (LLDS) is investigated. We investigate several variants of this problem. Due to fast mutations in cancer, genome rearrangements on copy number profiles are used more often than genome themselves. We explore the Minimum Copy Number Generation problem. We prove that it is NP-hard to even obtain a constant factor approximation. We also show that the corresponding parameterized version is W[1]-hard. These either improve the previous hardness result or solve an open problem. And we then give a polynomial algorithm for the Copy Number Profile Conforming problem. Finally, we investigate the pattern matching with 1-reversal distance problem. With the known results on Longest Common Extension queries, one can design an O(n+m) time algorithm for this problem. However, we find empirically that this algorithm is very slow for small m. We then design an algorithm based on the Karp-Rabin fingerprints which runs in an expected O(nm) time. The algorithms are implemented and tested on real bacterial sequence dataset. The empirical results shows that the shorter the pattern length is (i.e., when m < 200), the more substrings with 1-reversal distance the bacterial sequences have.
Finding disjoint dense clubs in an undirected graph
(Montana State University - Bozeman, College of Engineering, 2016) Zou, Peng; Chairperson, Graduate Committee: Binhai Zhu
For over a decade, software like Twitter, Facebook and WeChat have changed people's lives by creating social groups and networks easily. They give people a new convenient 'world' where we can share everything that happens around us, and social networks have grown enormously in recent years. In essence, social networks are full of data and have become an indispensable part of our life. Trust is an important feature of the relationship between two users in a social network. With the development of social networks, the trust among its members has become a big issue. In a social network, the trust among its members usually cannot be carried over many users. In the corresponding social network modeled as a graph, a user is denoted by a vertex and an edge between two vertices means that these two users communicate a lot above some threshold and they trust each other. An online social community is usually corresponding to a dense region in such a graph. A complex social network is usually composed of several groups/communities (the regions with a lot of edges), and this characterization of community structure means the appearance of densely connected groups of vertices, with only sparse connections between groups. For analyzing the structure of social networks and the relationship between users, it is important to find disjoint groups/communities with a small diameter and with a decent size, formally called dense clubs in this thesis. We focus on handling this NP-complete problem in this thesis. First, from the parameterized computational complexity point of view, we show that this problem does not admit a polynomial kernel (implying that it is unlikely to apply some reduction rules to obtain a practically small problem size). Then, we focus on the dual version of the problem, i.e., deleting 'd' vertices to obtain some disjoint dense clubs. We show that this dual problem admits a simple FPT algorithm using a bounded search tree method (the running time is still too high for practical datasets). Finally, we combine a simple reduction rule together with some heuristic methods to obtain a practical solution (verified by extensive testing on practical datasets). Empirical results show that this heuristic algorithm is very sensitive to all parameters. This algorithm is suitable on graphs which have a mixture of dense and sparse regions. These graphs are very common in the real world.
The longest letter-duplicated subsequence and related problems
(Springer Science and Business Media LLC, 2024-07) Lai, Wenfeng; Liyanage, Adiesha; Zhu, Binhai; Zou, Peng
Motivated by computing duplication patterns in sequences, a new problem called the longest letter-duplicated subsequence (LLDS) is proposed. Given a sequence S of length n, a letter- duplicated subsequence is a subsequence of S in the form of x d1 1 x d2 2 . . . x d k k with x i ∈ , x j = x j+1 and di ≥ 2 for all i in [k] and j in [k − 1]. A linear time algorithm for computing a longest letter-duplicated subsequence (LLDS) of S can be easily obtained. In this paper, we focus on two variants of this problem: (1) ‘all-appearance’ version, i.e., all letters in must appear in the solution, and (2) the weighted version. For the former, we obtain dichotomous results: We prove that, when each letter appears in S at least 4 times, the problem and a relaxed version on feasibility testing (FT) are both NP-hard. The reduction is from (3+, 1, 2−)- SAT, where all 3-clauses (i.e., containing 3 lals) are monotone (i.e., containing only positive literals) and all 2-clauses contain only negative literals. We then show that when each letter appears in S at most 3 times, then the problem admits an O(n) time algorithm. Finally, we consider the weighted version, where the weight of a block x di i (di ≥ 2) could be any positive function which might not grow with di . We give a non-trivial O(n2) time dynamic programming algorithm for this version, i.e., computing an LD-subsequence of S whose weight is maximized.