The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.

Bit score

An adjusted score assigned to a sequence that accounts for the type of scoring system used. It is calculated from the raw score by normalizing with the statistical variables that define a given scoring system. Therefore, bit scores from different alignments, even those employing different scoring matrices, can be compared.

The higher the score the better the alignment, although the significance of an alignment cannot be deduced from the score alone.


Basic Local Alignment Search Tool. A sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query.

The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S." The "T" parameter dictates the speed and sensitivity of the search.


Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid overweighting closely related family members.

conservation, conserved

Changes at a specific position of an amino acid (or, less commonly, DNA) that preserve the physicochemical properties of the original residue.

See Expressed Sequence Tags.

E value or
Expect value

Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

The E value is the statistical significance threshold for reporting matches against database sequences. The default value is 10, according to the stochastic model of Karlin and Altschul (1990).

If the statistical significance ascribed to a match is greater than the Expect threshold, the match will not be reported. Lower Expect thresholds are more stringent, leading to fewer chance matches being reported. Increasing the threshold shows less stringent matches.

An input format characterized by a header line beginning with the "greater than" symbol (>). This format is automatically detected and accepted in a sequence query.

Filtering (low complexity)

A technique for masking or removing segments of the query sequence that are repeated or have low compositional complexity, in order to improve the sensitivity of sequence similarity searches performed with that sequence. Filtering can eliminate statistically significant but biologically uninteresting reports from the BLAST® output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.

Filtering is applied to the query sequence (or its translation products) only, not to database sequences.

A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment.

Gap cost

There are two Gap Costs: the penalty to open a gap, and the penalty to extend it. Because a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is frequently given more significance than the length of the gap. Hence the gap is penalized heavily and a lesser penalty assigned to each subsequent residue in the gap.

Increasing the Gap Cost results in alignments that decrease the number of gaps introduced.

The raw score of an alignment is the sum of the scores for aligning pairs of residues and the (negative) scores for gaps.

Global alignment

The alignment of two nucleic acid or protein sequences over their entire length.


See Genome Survey Sequences.

High-scoring segment pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search.


See High Throughput Genomic Sequences.

International Union of Pure and Applied Chemistry – International Union of Biochemistry

Local alignment

The alignment of some portion of two nucleotide or protein sequences.

Low-complexity filter

See Filtering (low complexity).

A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix,"which assigns a score for aligning any possible pair of residues. The matrix used in a BLAST search can be changed depending on the type of sequences you are searching with.

MSP (Maximal-scoring segment pair)

Given two sequences and a scoring system, the highest scoring of all possible segment pairs that can be produced from the two sequences.

Percent Accepted Mutation. A unit that quantifies the amount of evolutionary change in a protein sequence. 1.0 PAM unit is the amount of evolution that will change, on average, 1% of amino acids in a protein sequence.

A PAM-x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced x amount of evolutionary divergence.

Raw score

The raw score "S" of the alignment is usually calculated by summing the scores for each letter-to-letter and letter-to-null position in the alignment. Scores for each position of an alignment are derived from a substitution matrix, such as the BLOSUM and PAM matrices.


CAS Registry Number assigned by Chemical Abstracts Service to a biosequence or other substance.

Short sequence

Depending on sequence composition, a short sequence is a sequence with less than 20 residues.


See Sequence Tagged Sites.

Weight matrix

A Weight matrix, or substitution matrix, assigns a score for aligning any possible pair of residues. Generally, different matrices are tailored to detect similarities among sequences that diverge by differing degrees, although a single matrix may be reasonably efficient over a relatively broad range of evolutionary change.

Word size

Used by the BLAST® algorithm to nucleate regions of similarity. The default values are 11 bp for nucleotides and 3 for peptides.

