Clustering exact matches of pairwise sequence alignments by weighted linear regression

被引:2
|
作者
Gonzalez, Alvaro J. [1 ]
Liao, Li [1 ]
机构
[1] Univ Delaware, Comp & Informat Sci Dept, Lab Bioinformat, Newark, DE 19716 USA
基金
美国国家科学基金会;
关键词
D O I
10.1186/1471-2105-9-102
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: At intermediate stages of genome assembly projects, when a number of contigs have been generated and their validity needs to be verified, it is desirable to align these contigs to a reference genome when it is available. The interest is not to analyze a detailed alignment between a contig and the reference genome at the base level, but rather to have a rough estimate of where the contig aligns to the reference genome, specifically, by identifying the starting and ending positions of such a region. This information is very useful in ordering the contigs, facilitating post-assembly analysis such as gap closure and resolving repeats. There exist programs, such as BLAST and MUMmer, that can quickly align and identify high similarity segments between two sequences, which, when seen in a dot plot, tend to agglomerate along a diagonal but can also be disrupted by gaps or shifted away from the main diagonal due to mismatches between the contig and the reference. It is a tedious and practically impossible task to visually inspect the dot plot to identify the regions covered by a large number of contigs from sequence assembly projects. A forced global alignment between a contig and the reference is not only time consuming but often meaningless. Results: We have developed an algorithm that uses the coordinates of all the exact matches or high similarity local alignments, clusters them with respect to the main diagonal in the dot plot using a weighted linear regression technique, and identifies the starting and ending coordinates of the region of interest. Conclusion: This algorithm complements existing pairwise sequence alignment packages by replacing the time-consuming seed extension phase with a weighted linear regression for the alignment seeds. It was experimentally shown that the gain in execution time can be outstanding without compromising the accuracy. This method should be of great utility to sequence assembly and genome comparison projects.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] AN EXACT SEQUENCE OF WEIGHTED NASH COMPLEXES
    Taalman, Laura
    ILLINOIS JOURNAL OF MATHEMATICS, 2008, 52 (02) : 591 - 610
  • [22] Kraken: ultrafast metagenomic sequence classification using exact alignments
    Wood, Derrick E.
    Salzberg, Steven L.
    GENOME BIOLOGY, 2014, 15 (03)
  • [23] Speeding up pairwise sequence alignments: A scoring scheme reweighting based approach
    Gao, Yong
    Henderson, Michael
    PROCEEDINGS OF THE 7TH IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, VOLS I AND II, 2007, : 1194 - 1198
  • [24] New amino acid substitution matrix brings sequence alignments into agreement with structure matches
    Jia, Kejue
    Jernigan, Robert L.
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2021, 89 (06) : 671 - 682
  • [25] Instance weighted linear regression
    Faculty of Mathematics, China University of Geosciences, Wuhan 430074, China
    J. Comput. Inf. Syst., 2008, 6 (2395-2402):
  • [26] A weighted linear quantile regression
    Huang, Mei Ling
    Xu, Xiaojian
    Tashnev, Dmitry
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2015, 85 (13) : 2596 - 2618
  • [27] WEIGHTED PAIRWISE GAUSSIAN LIKELIHOOD REGRESSION FOR DEPRESSION SCORE PREDICTION
    Cummins, Nicholas
    Epps, Julien
    Sethu, Vidhyasaharan
    Krajewski, Jarek
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4779 - 4783
  • [28] Lightweight comparison of RNAs based on exact sequence-structure matches
    Heyne, Steffen
    Will, Sebastian
    Beckstette, Michael
    Backofen, Rolf
    BIOINFORMATICS, 2009, 25 (16) : 2095 - 2102
  • [29] Exact Subspace Clustering in Linear Time
    Wang, Shusen
    Tu, Bojun
    Xu, Congfu
    Zhang, Zhihua
    PROCEEDINGS OF THE TWENTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2014, : 2113 - 2120
  • [30] A linear programming based algorithm for multiple sequence alignments
    Hunt, FY
    Kearsley, AJ
    O'Gallagher, A
    PROCEEDINGS OF THE 2003 IEEE BIOINFORMATICS CONFERENCE, 2003, : 532 - 533