fulltext.study @t Gmail

Efficient tools for comparative substring analysis

Paper ID Volume ID Publish Year Pages File Format Full-Text
24216 43505 2010 7 PDF Available
Title
Efficient tools for comparative substring analysis
Abstract

This paper introduces an efficient implementation of approaches to alignment-free comparative genome analysis and genome-based phylogeny relying on substring composition. Distances derived from substring statistics have been proposed recently as a meaningful alternative to distances derived from sequence alignment. In particular, procaryote phylogenies based on comparative 5- and 6-mer analysis of whole proteomes have successfully been worked out. The present implementation extends the computation of composition-based distances so as to involve allk-mers for anyk up to any preset m aximum length K (including K = ∞). Remarkably, although there may be Θ(L2) distinct strings that occur in a given sequence of length L (and Θ(KL) of length k ≤ K), it is shown that composition-based distances as well as many other details of interest in comparative genome analysis can be computed in O(L) time and space (with a constant that is independent of the size of K, that is, the same constant works for all K).A typical run with 2 sequences of altogether 1.5 million characters computes their composition-based distance in about 2 s, a performance to be contrasted with the several hours needed, even when restricting attention to substrings of length at most 6, by the direct method in use. This paper •describes the details of this implementation—an implementation that allows the user to compute composition-based distances for a wide range of instances on data sets of unprecedented size which may be useful in assessing the validity of the approach and to fine-tune the identification of those values of k (or K) yielding the best separators and descriptors in correspondence with different inputs,•indicates how the proposed algorithm can also be used for other tasks related to the identification and comparative analysis of highly over- or under-represented (sub)strings in given genomes, meta-genomes, or any other sequence families of interest (e.g., all proteins encoded by a given genome, all strings of non-coding or regulatory RNA, all introns, etc.),•and thus conforms with the increasing need for radically new, fast, and massive techniques for comparative genome analysis.

Keywords
Suffix tree; Phylogeny; Maximal words
First Page Preview
Efficient tools for comparative substring analysis
Get Full-Text Now
Don't Miss Today's Special Offer
Price was $35.95
You save - $31
Price after discount Only $4.95
100% Money Back Guarantee
Full-text PDF Download
Online Support
Any Questions? feel free to contact us
Publisher
Database: Elsevier - ScienceDirect
Journal: Journal of Biotechnology - Volume 149, Issue 3, 1 September 2010, Pages 120–126
Authors
, , ,
Subjects
Physical Sciences and Engineering Chemical Engineering Bioengineering
Get Full-Text Now
Don't Miss Today's Special Offer
Price was $35.95
You save - $31
Price after discount Only $4.95
100% Money Back Guarantee
Full-text PDF Download
Online Support
Any Questions? feel free to contact us