Information

Applying Constraint Programming to sequence alignment/analysis

Applying Constraint Programming to sequence alignment/analysis


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

My Masters program is focused on formal methods such as SAT solving and constraint programming. I am interested in applying such techniques to problems in sequence alignment and sequence analysis, areas that have been dominated by statistical methods.

So can you give me examples of some research problems suitable to be approached as constraints satisfaction problems?

Note: By constraint programming I also include more flexible paradigms like weighted constraint programming, where a solution may violate some of the constraints.


First read how local and global alignment differ from each other. Then decide an objective function, i.e. how to measure the similarity between two sequences? Maybe use Hamming distance? Or Levenshtein distance? After this the optimization part probably comes out quite naturally.

Be sure to check the existing softwares. Easy things have been tried out already. (And by the way, for some reason biologists call n-grams as k-mers.)


SIAM Journal on Applied Mathematics

The study and comparison of sequences of characters from a finite alphabet is relevant to various areas of science, notably molecular biology. The measurement of sequence similarity involves the consideration of the different possible sequence alignments in order to find an optimal one for which the “distance” between sequences is minimum. By associating a path in a lattice to each alignment, a geometric insight can be brought into the problem of finding an optimal alignment. This problem can then be solved by applying a dynamic programming algorithm. However, the computational effort grows rapidly with the number N of sequences to be compared $(O(l^N ))$, where l is the mean length of the sequences to be compared).

It is proved here that knowledge of the measure of an arbitrarily chosen alignment can be used in combination with information from the pairwise alignments to considerably restrict the size of the region of the lattice in consideration. This reduction implies fewer computations and less memory space needed to carry out the dynamic programming optimization process. The observations also suggest new variants of the multiple alignment problem.


Applying Constraint Programming to sequence alignment/analysis - Biology

Write a program to compute the optimal sequence alignment of two DNA strings. This program will introduce you to the field of computational biology in which computers are used to do research on biological systems. Further, you will be introduced to a powerful algorithmic design paradigm known as dynamic programming.

Biology review. A genetic sequence is a string formed from a four-letter alphabet of biological macromolecules referred to together as the DNA bases. A gene is a genetic sequence that contains the information needed to construct a protein. All of your genes taken together are referred to as the human genome, a blueprint for the parts needed to construct the proteins that form your cells. Each new cell produced by your body receives a copy of the genome. This copying process, as well as natural wear and tear, introduces a small number of changes into the sequences of many genes. Among the most common changes are the substitution of one base for another and the deletion of a substring of bases such changes are generally referred to as point mutations. As a result of these point mutations, the same gene sequenced from closely related organisms will have slight differences.

The problem. Through your research you have found the following sequence of a gene in a previously unstudied organism.

What is the function of the protein that this gene encodes? You could begin a series of uninformed experiments in the lab to determine what role this gene plays. However, there is a good chance that it is a variant of a known gene in a previously studied organism. Since biologists and computer scientists have laboriously determined (and published) the genetic sequence of many organisms (including humans), you would like to leverage this information to your advantage. We'll compare the above genetic sequence with one which has already been sequenced and whose function is well understood.

Edit-distance. In this assignment we will measure the similarity of two genetic sequences by their edit distance, a concept first introduced in the context of coding theory, but which is now widely used in spell checking, speech recognition, plagiarism detection, file revisioning, and computational linguistics. We align the two sequences, but we are permitted to insert gaps in either sequence (e.g., to make them have the same length). We pay a penalty for each gap that we insert and also for each pair of characters that mismatch in the final alignment. Intuitively, these penalties model the relative likeliness of point mutations arising from deletion/insertion and substitution. We produce a numerical score according to the following table, which is widely used in biological applications:

operation cost
insert a gap 2
align two characters that mismatch 1
align two characters that match 0

Here are two possible alignments of the strings x = "AACAGTTACC" and y = "TAAGGTCA":

The first alignment has a score of 8, while the second one has a score of 7. The edit-distance is the score of the best possible alignment between the two genetic sequences over all possible alignments. In this example, the second alignment is in fact optimal, so the edit-distance between the two strings is 7. Computing the edit-distance is a nontrivial computational problem because we must find the best alignment among exponentially many possibilities. For example, if both strings are 100 characters long, then there are more than 10^75 possible alignments.

We will explain a recursive solution which is an elegant approach. However it is far too inefficient because it recalculates each subproblem over and over. Once we have defined the recursive definition we can redefine the solution using a dynamic programming approach which calculates each subproblem once.

A recursive solution. We will calculate the edit-distance between the two original strings x and y by solving many edit-distance problems on smaller suffixes of the two strings. We use the notation x[i] to refer to character i of the string. We also use the notation x[i..M] to refer to the suffix of x consisting of the characters x[i], x[i+1], . x[M-1]. Finally, we use the notation opt[i][j] to denote the edit distance of x[i..M] and y[j..N]. For example, consider the two strings x = "AACAGTTACC" and y = "TAAGGTCA" of length M = 10 and N = 8, respectively. Then, x[2] is 'C', x[2..M] is "CAGTTACC", and y[8..N] is the empty string. The edit distance of x and y is opt[0][0].

Now we describe a recursive scheme for computing the edit distance of x[i..M] and y[j..N]. Consider the first pair of characters in an optimal alignment of x[i..M] with y[j..N]. There are three possibilities:

API specification. Your program EditDistance.java must be organized as a library of static methods with the following API:

public class EditDistance -------------------------------------------------------------------------------- int penalty(char a, char b) // return the penalty for aligning char a and char b int min(int a, int b, int c) // return the min of 3 integers void main(String[] args) // read 2 strings from standard input. // compute and print the edit distance between them. // output an optimal alignment and associated penalties.

Your program. Write a program EditDistance.java that reads, from standard input, two strings of characters. (Although, in the application described, the characters represent genetic sequences, your program should handle any sequence of alphanumeric characters.) Your program should then compute and print the edit distance between the two strings. Finally, it should recover the optimal alignment and print it out along with the individual penalties, using the following format:

Be sure to test thoroughly using the short test files and the longer actual data files. Also, make up a short test file of your own and describe it in your readme.txt file.

Analysis. After you have tested your program using not only the example provided above, but also the many short test data files in the sequence subdirectory, it is time to analyze its running time and memory usage. Using the genomic data sets referred to in the readme.txt file, use the doubling method to estimate the running time (in seconds) of your program as a function of the lengths of the two input strings M and N. For simplicity, assume M = N in your analysis. Also analyze the memory usage (in bytes). Be sure to enter these results in your readme and answer all the questions.

See the checklist for information about giving Java more memory and running timing tests.

Submission. One partner should submit the files EditDistance.java and readme.txt (including the analysis and test data you created). If you are partnering, the second partner should only submit this abbreviated partner readme.txt. One application and sample data set is for spell checking. Is the extra credit interesting enough?

Extra credit. One of the most powerful tools available today are the databases that allow a user to submit a genetic sequence and query for a similar sequence found in another organism?s genome The National Center For Biotechnology Information contains many powerful examples of such database and alignment software. For extra credit, use the alignment code you have written above to implement a program that takes as input one source string, followed by a list of target strings (one per line), and outputs the target string(s) that are most similar to the source string. -->

This assignment was created by Thomas Clarke, Robert Sedgewick, Scott Vafai and Kevin Wayne. Copyright © 2002.


Sequence Alignment and Dynamic Programming

Sequence alignment is a standard method to compare two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences [1]. Also, it is a way of arranging two or more sequences of characters to recognize regions of similarity [2].

Importance of sequence alignment

Sequence alignment is significant because in bimolecular sequences (DNA, RNA, or protein), high sequence similarity usually implies important functional or structural similarity that is the first step of many biological analysis [3]. Besides, sequence alignment can address significant questions such as detecting gene sequences that cause disease or susceptibility to disease, identifying changes in gene sequences that cause evolution, finding the relationship between various gene sequences that can indicate the common ancestry [4], detecting functionally important sites, and demonstrating mutation events [5].

Analysis of the alignment can reveal important information. It is possible to identify the parts of the sequences that are likely to be important for the function, if the proteins are involved in similar processes .The random mutations can accumulate more easily in parts of the sequence of a protein which are not very essential for its function. In the parts of the sequence that are essential for the function hardly any mutations will be accepted because approximately all changes in such regions will destroy the function [6]. Moreover, Sequence alignment is important for assigning function to unknown proteins [7]. Protein alignment of two residues implies that those residues perform similar roles in the two different proteins [8].

The main purpose of sequence alignments methods is finding maximum degree of similarities and minimum evolutionary distance. Generally, computational approaches to solve sequence alignment problems can be divided into two categories: global alignments and local alignments. Global alignments traverse the entire length of all query sequences, and match as many characters as possible from end to end. These alignment methods are most useful when the sequences have approximately the same size or they are similar. The alignment is performed from beginning of the sequence to end of the sequences to find out the best possible alignment. On the other hand, Local alignments find the local regions with high level of similarity. They are more useful for sequences that are suspected to contain regions of similarity within their larger sequence context. [9]

Besides, pairwise sequence alignment is used to find the regions of similarity between two sequences. As the number of sequences increases, comparing each and every sequence to every other may be impossible. So, we need multiple sequence alignment, where all similar sequences can be compared in one single figure or table. The basic idea is that the sequences are aligned on top of each other, so that a coordinate system is set up, where each row is the sequence for one protein, and each column is the same position in each sequence. [10]

There are many different approaches and implementations of the methods to perform sequence alignment. These include techniques such as dynamic programming , heuristic algorithms (BLAST and FASTA similarity searching), probabilistic methods, dot-matrix methods, progressive methods, ClustalW , MUSCLE , T-Coffee , and DIALIGN.

Dynamic programming (DP) is a problem solving method for a class of problems that can be solved by dividing them down into simpler sub-problems. It finds the alignment by giving some scores for matches and mismatches (Scoring matrices).This method is widely used in sequence alignments problems. [11] However, when the number of the sequences is more than two, multiple dimensional Dynamic programming in infeasible because of the large storage and computational complexities.[16]

Dynamic programming algorithms use gap penalties to increase the biological meaning [9]. There are different gap penalties such as linear gap, constant gap, gap open and gap extension. The gap score is a penalty given to alignment when there is insertion or deletion. There may be a case where there are continuous gaps all along the sequence during the evolution, so the linear gap penalty would not be suitable for the alignment. Therefore, gap opening penalty and gap extension penalty has been introduced when there are continuous gaps. The gap opening penalty is applied at the start of the gap, and then the other gap following it is given with a gap extension penalty which will be less compared to the open penalty. Different gap penalty functions require different dynamic programming algorithms [12]. Also there is a substitution matrix to score alignments. The mainly used predefined scoring matrices for sequence alignment are PAM (Point Accepted Mutation) and BLOSUM (Blocks Substitution Matrix).

The two algorithms, Smith-Waterman for local alignment and Needleman-Wunsch for global alignment, are based on dynamic programming.

Needleman-Wunsch algorithm requires alignment score for a pair of residues to be equal or more than zero. No gap penalty is required, and score cannot decrease between two cells of pathway. Smith-Waterman requires a gap penalty to work efficiently. Residue alignment score may be positive or negative .Score can increase, decrease, or stay level between two cells of pathway [13].

Sequence Alignment Problems

For an n-character sequence s, and an m-character sequence t , we construct an (n+1)?(m+1)matrix .

Global alignment: F ( i, j ) = score of the best alignment of s[1…i ] with t[1…j]

Local alignment: F ( i, j ) = score of the best alignment of a suffix of s[1…i ] and a suffix of t[1…j]

There are three steps in the sequence alignments algorithms:

In the initialization phase, we assign values for the first row and column of the alignment matrix .The next step of the algorithm depends on this.

In the fill stage, the entire matrix is filled with scores from top to bottom, left to right with appropriate values that depend on the gap penalties and scoring matrix.

For each F ( i, j ), save pointers to cell that resulted in best score . For global alignment, we trace pointers back from F (m, n) to F(0, 0) to recover sequence alignments . For local alignment, we are looking for the maximum value of the F (i, j) that can be anywhere in the matrix. We trace pointers back from F (i, j) and stop when we get to a cell with value 0.

Local alignment with scoring matrix

After creating and initializing the alignment matrix ( F ) and trace back matrix, the score of F (i, j) for every cell is calculated as follows:

diagonal_score=F[i-1[ j-1] + PAM250(s[i], t[j]),

scores=max[ 0, left_score, diagonal_score, up_score]

Also, we should keep the reference to each cell to perform backtracking.

After filling the F matrix, we find the optimal alignment score and the optimal end points by finding the highest scoring cell, maxi,jF(i , j) . best_score has a default value equals to -1 .

i_maximum_score, j_maximum_score = i, j

To recover the optimal alignment, we trace back from i_maximum_score, j_maximum_score position , terminating the trace back when we reach a cell with score 0 .

The time and space complexity of this algorithm is O(mn) which m is the length of sequence s , and n is the length of sequence t.

Local alignment with affine gap penalty

For this problem, there are gap opening penalty and gap extension penalty. The gap opening penalty is applied at the start of the gap, and then the other gap following it is given with a gap extension penalty.

There are Four different matrices: up_score , left_score ,m_score , trace_back


Applying Constraint Programming to sequence alignment/analysis - Biology

A compilation of data from the NIAID Influenza Genome Sequencing Project and GenBank. It provides tools for flu sequence analysis, annotation and submission to GenBank. This resource also has links to other flu sequence resources, and publications and general information about flu viruses.

Downloads

BLAST executables for local use are provided for Solaris, LINUX, Windows, and MacOSX systems. See the README file in the ftp directory for more information. Pre-formatted databases for BLAST nucleotide, protein, and translated searches also are available for downloading under the db subdirectory.

Sequence databases for use with the stand-alone BLAST programs. The files in this directory are pre-formatted databases that are ready to use with BLAST.

Sequence databases in FASTA format for use with the stand-alone BLAST programs. These databases must be formatted using formatdb before they can be used with BLAST.

This site contains the UniVec and UniVec_Core databases in FASTA format. See the README.uv file for details.

Tools

Performs a BLAST search for similar sequences from selected complete eukaryotic and prokaryotic genomes.

Performs a BLAST search of the genomic sequences in the RefSeqGene/LRG set. The default display provides ready navigation to review alignments in the Graphics display.

Finds regions of local similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as to help identify members of gene families.

COBALT is a protein multiple sequence alignment tool that finds a collection of pairwise constraints derived from conserved domain database, protein motif database, and sequence similarity, using RPS-BLAST, BLASTP, and PHI-BLAST.

Identifies the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (Reverse Position-Specific BLAST) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD).

Tool for aligning a query sequence (nucleotide or protein) to GenBank sequences included on microarray or SAGE platforms in the GEO database.

This tool compares nucleotide or protein sequences to genomic sequence databases and calculates the statistical significance of matches using the Basic Local Alignment Search Tool (BLAST) algorithm.

A genome browser for interactive navigation of eukaryotic RefSeq genome assemblies with comprehensive inspection of gene, expression, variation and other annotations. GDV offers easy-to-load analytical track pre-configurations, a menu of data tracks for easy display and customization, and supports upload and analysis of user data. This browser also enables the production of displays for publishing.

NCBI's Remap tool allows users to project annotation data and convert locations of features from one genomic assembly to another or to RefSeqGene sequences through a base by base analysis. Options are provided to adjust the stringency of remapping, and summary results are displayed on the web page. Full results can be downloaded for viewing in NCBI's Genome Workbench graphical viewer, and annotation data for the remapped features, as well as summary data, is also available for download.

An integrated application for viewing and analyzing sequence data. With Genome Workbench, you can view data in publically available sequence databases at NCBI, and mix these data with your own data.

An interactive web application that enables users to visualize multiple alignments created by database search results or other software applications. The MSA Viewer allows users to upload an alignment and set a master sequence, and to explore the data using features such as zooming and changing of coloration.

A graphical analysis tool that finds all open reading frames in a user's sequence or in a sequence already in the database. Sixteen different genetic codes can be used. The deduced amino acid sequence can be saved in various formats and searched against protein databases using BLAST.

The Primer-BLAST tool uses Primer3 to design PCR primers to a sequence template. The potential products are then automatically analyzed with a BLAST search against user specified databases, to check the specificity to the target intended.

A utility for computing alignment of proteins to genomic nucleotide sequence. It is based on a variation of the Needleman Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, ProSplign is accurate in determining splice sites and tolerant to sequencing errors.

Provides a configurable graphical display of a nucleotide or protein sequence and features that have been annotated on that sequence. In addition to use on NCBI sequence database pages, this viewer is available as an embeddable webpage component. Detailed documentation including an API Reference guide is available for developers wishing to embed the viewer in their own pages.

A utility for computing cDNA-to-Genomic sequence alignments. It is based on a variation of the Needleman-Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, Splign is accurate in determining splice sites and tolerant to sequencing errors.

A tool for creating and displaying phylogenetic tree data. Tree Viewer enables analysis of your own sequence data, produces printable vector images as PDFs, and can be embedded in a webpage.

A system for quickly identifying segments of a nucleic acid sequence that may be of vector origin. VecScreen searches a query sequence for segments that match any sequence in a specialized non-redundant vector database (UniVec).


How to compute multiple sequence alignment for text strings

I'm writing a program which has to compute a multiple sequence alignment of a set of strings. I was thinking of doing this in Python, but I could use an external piece of software or another language if that's more practical. The data is not particularly big, I do not have strong performance requirements and I can tolerate approximations (ie. I just need to find a good enough alignment). The only problem is that the strings are regular strings (ie. UTF-8 strings potentially with newlines that should be treated as a regular character) they aren't DNA sequences or protein sequences.

I can find tons of tools and information for the usual cases in bioinformatics with specific complicated file formats and a host of features I don't need, but it is unexpectly hard to find software, libraries or example code for the simple case of strings. I could probably reimplement any one of the many algorithms for this problem or encode my string as DNA, but there must be a better way. Do you know of any solutions?


Benchmarking multiple aligners accuracy

Quantifying the accuracy of multiple aligners is just as critical as aligning sequences, especially when considering the aligners approximate nature. This seemingly obvious aspect has been generally overlooked by the community as reflected by the relative lack of correlation between the packages overall usage and their reported accuracy. ClustalW, for instance—whose 42 000 citations suggest a global usage level higher than all other packages put together—has not been consistently reported as the most accurate method. This surprising observation probably reflects on a combination of factors. The most obvious is the relationship between benchmarks rankings and day-to-day usability. It is likely that ClustalW, even though it does not rank #1 on all benchmarks, is nonetheless sufficiently accurate for many modeling activities, especially when dealing with orthologous data sets. One may also speculate on the existence of a strong methodological inertia within the biological community, where tool usage tends to snowball through protocol recycling.

The most critical component of an MSA is its scoring/objective function, the mathematical formula that quantifies the total score and therefore defines optimality, given a set of sequences. The rest of the algorithm is an optimization procedure attempting to generate an MSA model that maximizes the objective function. It is well established that even the best objective functions are merely approximations trying to model the behavior of biological sequences [ 107]. As a consequence, there is no guarantee that a perfectly optimized MSA will systematically result in the most biologically meaningful MSA. This is the reason why multiple aligners also need to be evaluated/benchmarked for their capacity to produce correct alignments. A benchmarking procedure relies on existing collections of reference alignments considered as gold standards. These reference MSAs are routinely used as predictors for the accuracy of a given aligner on a given type of data sets and have had a major influence on methodological developments. Existing protein benchmark collections were recently extensively and critically reviewed in [ 108] and [ 109] where the authors propose to group benchmarks in four categories: simulation based, consistency based, structure based and phylogeny based. The latter three categories meet the criterion of reference data sets, in that they can be pre-compiled and used to quantify the relative merits of one aligner over another. The simulation-based benchmarks, however, define an objective function rather than a benchmark procedure and cannot be considered a benchmark measure in the same sense as the others.


Genetics and string algorithms

Strands of genetic material — DNA and RNA — are sequences of small units called nucleotides. For purposes of answering some important research questions, genetic strings are equivalent to computer science strings — that is, they can be thought of as simply sequences of characters, ignoring their physical and chemical properties. (Although, strictly speaking, their chemical properties are usually coded as parameters to the string algorithms you’ll be looking at in this article.)

This article’s examples use DNA, which consists of two strands of adenine (A), cytosine (C), thymine (T), and guanine (G) nucleotides. DNA’s two strands are reverse complements of each other. A and T are complementary bases, and C and G are complementary bases. This means that A s in one strand are paired with T s in the other strand (and vice versa), and C s in one strand are paired with G s in the other strand (and vice versa). So, if you know the sequence of one strand’s A s, C s, T s, and G s, you can derive the other strand’s sequence. Hence, you can think of a DNA strand simply as a string of the letters A, C, G, and T.


Where else can alignment-free sequence comparison methods be applied?

Progress over the past two decades has led alignment-free research from bioinformatics “curiosities” to a broadening range of successful applications that accompany mainstream biology [37].

Distantly related, remote sequences that evolve beyond recognizable similarity are one of the most classic applications of alignment-free mastering. For example, alignment-free approaches were successfully employed in functional annotation of unknown G-protein-coupled receptor (integral cell membrane proteins that play a key role in transducing extracellular signals and have great relevance for pharmacology) sequences that could not be assigned to any previously known receptor family [98]. Another rising trend for the use of word-based alignment-free methods is the detection of functional and/or evolutionary similarities among regulatory sequences (e.g., promoters, enhancers, and silencers) to estimate their in vivo activities in different organisms (flies and mammals, including humans) [99,100,101,102,103].

Sequence rearrangements are particularly well handled by alignment-free sequence analyses. Recent studies described the mosaic structure of viral and bacterial genomes (e.g., by characterizing the recombination break points in HIV-1 strain and Escherichia coli genomes). This analysis provides new evidence for the long-held suspicion that animal E. coli pathogens can also infect humans [104]. Another study [105] discovered a clear signal for a pair of E. coli genomes that had undergone an engineered 125-kb horizontal gene transfer 20 years ago. Alignment-free measures were also applied to detect domain shuffling signatures in proteins [106] and to identify the members of complex multidomain proteins, such as kinases [107].

Horizontal gene transfer strongly complicates the task of reconstructing the evolutionary history of genes and species, and alignment-free methods have also proved to be helpful in this field. For example, in a comprehensive study of bacterial genomes, the authors used oligonucleotides as genomic signatures and showed that horizontal gene transfers accounted for 6% of the genomes on average [108]. Furthermore, the statistical relationships between genomic signatures among several thousand species provided information about possible donor taxa for the identified foreign sequences. In other studies [109, 110], alignment-free approaches were applied to the genomes of the human pathogen Staphylococcus aureus and recovered regions of lateral origin that corresponded to genes involved in transport, antibiotic resistance, pathogenicity, and virulence.

Whole-genome phylogeny [111] is another area where alignment-free methods play an increasing role. Many studies [34, 112,113,114,115,116,117,118] addressed the phylogenetic reconstruction of prokaryotes, such as the whole-genome phylogeny of E. coli O104:H4, which was the strain that caused the 2011 outbreak in Germany. The analysis revealed a direct line of ancestry leading from a putative typical enteroaggregative E. coli ancestor through the 2001 strain to the 2011 outbreak strain [113]. The alignment-free based phylogeny of almost a hundred Zika virus strains suggested that this mosquito-borne flavivirus originated from Africa and then spread to Asia, the Pacific islands, and throughout the Americas [119]. Alignment-free methods have recently been applied to infer phylogenetic relationships among eukaryotic species (fungi [120], plants [121], and mammals [35]) the resulting trees were extremely similar to the species trees created by the manually curated NCBI taxonomic database, which reflects the current taxonomic consensus in the literature.

Sequence classification is another field that might benefit from bringing together different alignment-free approaches, such as grouping expressed sequences tags that originate from the same locus or gene family [122], clustering expressed sequence tag sequences with full-length cDNA data [123], and aggregating gene and protein sequences into functional families [124,125,126]. Alignment-free methods are also used to recognize and classify antigens that are encoded in a sequence in a subtle and recondite manner that is not identifiable by sequence alignment. A recent approach [127, 128] based on the statistical transformation of protein sequences into uniform vectors with various amino acid properties showed an impressive prediction accuracy of up to 89% in discriminating positive and negative sets of bacterial, viral, and tumor antigen datasets. Another common use of alignment-free methods is the classification of species based on a short DNA sequence fragments that can act as true taxon barcodes [129,130,131,132,133].

The available alignment-free-based software for general sequence comparison are listed in Table 2. For convenience, we categorized the listed programs into basic research tasks, such as small scale pairwise/multiple sequence comparisons, whole genome phylogeny (from viral to mammalian scale), BLAST-like sequence similarity search, identification of horizontally transferred genes and recombination events, as well as annotation of long non-coding RNAs and regulatory elements.


Constrained Multiple Sequence Alignment Tool Development and Its Application to RNase Family Alignment

In this paper, we design a heuristic algorithm of computing a constrained multiple sequence alignment (CMSA for short) for guaranteeing that the generated alignment satisfies the user-specified constraints that some particular residues should be aligned together. If the number of residues needed to be aligned together is a constant α, then the time-complexity of our CMSA algorithm for aligning K sequences is O(αKn 4 ), where n is the maximum of the lengths of sequences. In addition, we have built up such a CMSA software system and made several experiments on the RNase sequences, which mainly function in catalyzing the degradation of RNA molecules. The resulting alignments illustrate the practicability of our method.

A preliminary version of this paper appears in the Proceedings of the First IEEE Computer Society Bioinformatics Conference (CSB 2002). This research was supported partly by VTY89-P4-33, NSC89-2213-E-259-010, NSC91-2321-B-007-002, NSC91-3112-B-007-004 and MOE Program for Promoting Academic Excellence of Universities under the grant number 89-B-FA04-1-4.