Pairwise Sequence Alignment

The Problem

A Warm-Up: Edit Distance

$$d(i,j) = \min!\begin{cases} d(i-1,\,j-1) & \text{if } a_i = b_j \quad\text{(characters match, no cost)} \ d(i-1,\,j-1) + 1 & \text{if } a_i \neq b_j \quad\text{(substitute)} \ d(i-1,\,j) + 1 & \text{(delete from } a \text{)} \ d(i,\,j-1) + 1 & \text{(insert into } a \text{)} \end{cases}$$

Why is $d(i,0) = i$ in the edit distance recurrence, rather than $d(i,0) = 0$?

Because the first $i$ characters of $a$ have already been aligned perfectly
Wrong: $d(i,0)$ represents aligning $a[1..i]$ against an empty string, which requires $i$ deletions, not zero.
Because you need $i$ deletions to transform $a[1..i]$ into the empty string
Correct: each of the $i$ characters in the prefix must be removed, costing one unit per character.
Because each character in $a$ must be substituted for a gap
Wrong: substitution replaces one character with another; removing a character is a deletion, not a substitution.
Because the minimum edit distance is always equal to the longer sequence's length
Wrong: if both sequences are identical the edit distance is zero, regardless of length.

Scoring an Alignment

MATCH_SCORE = 2      # reward for aligning identical characters
MISMATCH_PENALTY = -1  # penalty for aligning different characters
GAP_PENALTY = -1     # penalty per gap character inserted in either sequence

# Example sequences for the lesson.
SEQ_A = "ACGT"
SEQ_B = "AGT"

The Needleman-Wunsch Algorithm

$$H(i,j) = \max!\begin{cases} H(i-1,\,j-1) + s(a_i,\,b_j) & \text{(match or mismatch)} \ H(i-1,\,j) + g & \text{(gap in } b \text{)} \ H(i,\,j-1) + g & \text{(gap in } a \text{)} \end{cases}$$

where $s(a_i, b_j)$ is MATCH_SCORE or MISMATCH_PENALTY and $g$ is GAP_PENALTY.

def _nt_score(a, b):
    """Return match score or mismatch penalty for two nucleotide characters."""
    return MATCH_SCORE if a == b else MISMATCH_PENALTY
def fill_matrix(seq_a, seq_b):
    """Return the Needleman-Wunsch scoring matrix for two sequences.

    The matrix has shape (len(seq_a)+1, len(seq_b)+1).  Entry H[i, j] holds
    the score of the best global alignment of seq_a[:i] with seq_b[:j].
    The first row and column are initialised to cumulative gap penalties to
    represent aligning a non-empty prefix with an empty string.
    """
    m, n = len(seq_a), len(seq_b)
    H = np.zeros((m + 1, n + 1), dtype=int)
    for i in range(m + 1):
        H[i, 0] = i * GAP_PENALTY
    for j in range(n + 1):
        H[0, j] = j * GAP_PENALTY
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            diag = H[i - 1, j - 1] + _nt_score(seq_a[i - 1], seq_b[j - 1])
            delete = H[i - 1, j] + GAP_PENALTY  # gap in seq_b
            insert = H[i, j - 1] + GAP_PENALTY  # gap in seq_a
            H[i, j] = max(diag, delete, insert)
    return H

Needleman-Wunsch does NOT include a max(0, ...) floor in its recurrence. What is the consequence?

Negative cells are discarded, making the algorithm run faster
Wrong: there is no discarding; every cell is filled and may hold a negative value.
The algorithm can produce alignments whose total score is negative
Correct: a global alignment must span both sequences end to end, so if the sequences share few matches the total score can be negative.
The alignment always starts and ends at the same position in both sequences
Wrong: this describes a property of global alignment (both sequences are fully consumed) but is not a direct consequence of removing the zero floor.
Mismatches are penalised more heavily than gaps
Wrong: the relative penalties are set by MISMATCH_PENALTY and GAP_PENALTY, not by the presence or absence of a zero floor.

An Example DP Table

Aligning SEQ_A = "ACGT" with SEQ_B = "AGT" using MATCH_SCORE = 2, MISMATCH_PENALTY = -1, GAP_PENALTY = -1:

A G T
0 -1 -2 -3
A -1 2 1 0
C -2 1 1 0
G -3 0 3 2
T -4 -1 2 5

Using the table above, what is the value of $H[2,2]$ (row C, column G)?

Traceback

def traceback(H, seq_a, seq_b):
    """Trace back from the bottom-right cell to recover the global alignment.

    Returns (aligned_a, aligned_b, score) where '-' marks a gap.
    Global alignment always ends at H[m, n] and traces back to H[0, 0].
    When the diagonal, delete, and insert moves give equal scores, the
    diagonal (match/mismatch) move is preferred to produce a unique result.
    """
    i, j = len(seq_a), len(seq_b)
    score = int(H[i, j])
    aligned_a, aligned_b = [], []
    while i > 0 or j > 0:
        if (
            i > 0
            and j > 0
            and H[i, j] == H[i - 1, j - 1] + _nt_score(seq_a[i - 1], seq_b[j - 1])
        ):
            aligned_a.append(seq_a[i - 1])
            aligned_b.append(seq_b[j - 1])
            i -= 1
            j -= 1
        elif i > 0 and H[i, j] == H[i - 1, j] + GAP_PENALTY:
            aligned_a.append(seq_a[i - 1])
            aligned_b.append("-")
            i -= 1
        else:
            aligned_a.append("-")
            aligned_b.append(seq_b[j - 1])
            j -= 1
    return "".join(reversed(aligned_a)), "".join(reversed(aligned_b)), score
def align(seq_a, seq_b):
    """Return the best global alignment of seq_a and seq_b as (aligned_a, aligned_b, score)."""
    H = fill_matrix(seq_a, seq_b)
    return traceback(H, seq_a, seq_b)

Put these Needleman-Wunsch steps in the correct order.

  1. Initialise the first row and column of an (m+1) x (n+1) matrix with cumulative gap penalties
  2. Fill the remaining cells left-to-right and top-to-bottom using the recurrence
  3. Read off the alignment score from the bottom-right cell H[m, n]
  4. Follow the traceback path from H[m, n] back to H[0, 0]
  5. Reverse the collected characters to produce the aligned sequences

Label each traceback move with the alignment operation it represents.

Traceback Move Alignment Operation
Diagonal move (up-left) Match or mismatch: one character from each sequence is consumed
Upward move Gap inserted in sequence B: a character of A has no counterpart in B
Leftward move Gap inserted in sequence A: a character of B has no counterpart in A
Reached H[0, 0] End of the global alignment: both sequences are fully consumed

Displaying the Alignment

def format_alignment(aligned_a, aligned_b):
    """Return a three-line string representation of an alignment.

    The middle row marks matches with '|' and mismatches or gaps with ' '.
    """
    middle = "".join("|" if a == b else " " for a, b in zip(aligned_a, aligned_b))
    return f"a: {aligned_a}\n   {middle}\nb: {aligned_b}"

Running the algorithm on SEQ_A = "ACGT" and SEQ_B = "AGT" produces:

a: ACGT
   | ||
b: A-GT

The gap in seq_b at position 2 accounts for the C in seq_a that has no counterpart. The score is $3 \times 2 + (-1) = 5$: three matches at MATCH_SCORE = 2 and one gap at GAP_PENALTY = -1.

Aligning "AGTC" with "AGC" using MATCH_SCORE = 2, MISMATCH_PENALTY = -1, GAP_PENALTY = -1: the best global alignment has three matches and one gap. What is the total alignment score?

Visualizing the Scoring Matrix

def plot_matrix(H, seq_a, seq_b):
    """Return an Altair heatmap of the scoring matrix."""
    rows = [
        {
            "i": i,
            "j": j,
            "score": int(H[i, j]),
            "row_label": seq_a[i - 1] if i > 0 else "",
            "col_label": seq_b[j - 1] if j > 0 else "",
        }
        for i in range(H.shape[0])
        for j in range(H.shape[1])
    ]
    df = pl.DataFrame(rows)
    return (
        alt.Chart(df)
        .mark_rect()
        .encode(
            x=alt.X(
                "j:O",
                title=f"seq_b: {seq_b}",
                axis=alt.Axis(labelExpr="datum.value == 0 ? '' : datum.value"),
            ),
            y=alt.Y(
                "i:O",
                title=f"seq_a: {seq_a}",
                sort="descending",
                axis=alt.Axis(labelExpr="datum.value == 0 ? '' : datum.value"),
            ),
            color=alt.Color("score:Q", scale=alt.Scale(scheme="blues"), title="Score"),
            tooltip=["i", "j", "score"],
        )
        .properties(width=300, height=300, title="Needleman-Wunsch scoring matrix")
    )
Heatmap of the Needleman-Wunsch scoring matrix for ACGT vs AGT, with brightest cell at bottom right.
Figure 1: Scoring matrix for SEQ_A = 'ACGT' vs SEQ_B = 'AGT'. The traceback follows the brightest path from H[m,n] (bottom right) back to H[0,0] (top left).

Testing

Unlike the diffusion and Lotka-Volterra lessons, all scores here are integers computed by exact arithmetic: no floating-point rounding occurs and no tolerance is needed in any test.

Matrix shape

Border initialization

Identical sequences

Mismatch in alignment

Gap in alignment

All mismatches

from align import (
    GAP_PENALTY,
    MATCH_SCORE,
    MISMATCH_PENALTY,
    SEQ_A,
    SEQ_B,
    align,
    fill_matrix,
)


def test_matrix_shape():
    # The scoring matrix must have one extra row and column for the gap-penalty
    # border that represents aligning a non-empty prefix with an empty string.
    H = fill_matrix(SEQ_A, SEQ_B)
    assert H.shape == (len(SEQ_A) + 1, len(SEQ_B) + 1)


def test_border_initialisation():
    # Needleman-Wunsch initialises H[i, 0] = i * GAP_PENALTY and
    # H[0, j] = j * GAP_PENALTY to represent the cost of consuming
    # i or j characters as pure gaps.
    H = fill_matrix(SEQ_A, SEQ_B)
    for i in range(len(SEQ_A) + 1):
        assert H[i, 0] == i * GAP_PENALTY
    for j in range(len(SEQ_B) + 1):
        assert H[0, j] == j * GAP_PENALTY


def test_identical_sequences():
    # A sequence aligned with itself must match at every position with no gaps.
    # The global score is exactly len(seq) * MATCH_SCORE.
    # Integer arithmetic means no tolerance is needed.
    seq = "ACGT"
    aligned_a, aligned_b, score = align(seq, seq)
    assert aligned_a == seq
    assert aligned_b == seq
    assert score == len(seq) * MATCH_SCORE


def test_mismatch_in_alignment():
    # "AAAC" vs "AGAC": the second character is a mismatch (A vs G) but the
    # surrounding matches make a gapless alignment preferable.
    # Score = 3 * MATCH_SCORE + MISMATCH_PENALTY.
    aligned_a, aligned_b, score = align("AAAC", "AGAC")
    assert aligned_a == "AAAC"
    assert aligned_b == "AGAC"
    assert score == 3 * MATCH_SCORE + MISMATCH_PENALTY


def test_gap_in_alignment():
    # "ACGT" vs "AGT": C in seq_a has no counterpart in seq_b so the best
    # global alignment inserts a gap.
    # Score = 3 * MATCH_SCORE + GAP_PENALTY.
    aligned_a, aligned_b, score = align("ACGT", "AGT")
    assert aligned_a == "ACGT"
    assert aligned_b == "A-GT"
    assert score == 3 * MATCH_SCORE + GAP_PENALTY


def test_all_mismatches():
    # When every pair of characters is a mismatch, the best global alignment
    # still aligns them without gaps (each mismatch costs MISMATCH_PENALTY,
    # which equals GAP_PENALTY here, but diagonal is preferred on ties).
    # Score = len * MISMATCH_PENALTY for equal-length sequences.
    aligned_a, aligned_b, score = align("AAAA", "CCCC")
    assert score == 4 * MISMATCH_PENALTY


def test_example_sequences():
    # Verify the lesson example: SEQ_A = "ACGT", SEQ_B = "AGT".
    # The known optimal global alignment inserts a gap in seq_b for C.
    aligned_a, aligned_b, score = align(SEQ_A, SEQ_B)
    assert aligned_a == "ACGT"
    assert aligned_b == "A-GT"
    assert score == 5

Exercises

Effect of gap penalty

Change GAP_PENALTY from -1 to -3 (while keeping MISMATCH_PENALTY = -1) and re-run align("ACGT", "AGT"). Does the alignment still contain a gap, or does it change? Explain the result in terms of the recurrence relation.

Local alignment (Smith-Waterman)

The Smith-Waterman algorithm for local alignment differs from Needleman-Wunsch in exactly two ways: the first row and column are all zeros (not cumulative gap penalties), and each cell uses max(0, diag, delete, insert) instead of max(diag, delete, insert). Implement fill_local(seq_a, seq_b) and traceback_local(H, seq_a, seq_b) where the traceback starts at the cell with the highest value and stops when it reaches a zero. Verify that aligning "AAACGT" with "CCGT" locally produces a different result than aligning them globally.

Scoring matrix for amino acids

Biological sequence aligners use substitution matrices such as BLOSUM62 that assign different scores to each pair of amino acid characters, reflecting how often each substitution occurs in evolution. Replace _nt_score with a function that looks up scores in a small hand-written $4 \times 4$ substitution matrix for the four DNA bases A, C, G, T. Assign a higher score to transitions (A vs G, C vs T) than transversions (A vs C, A vs T, etc.), since transitions are more common mutations. Verify that aligning "ATCG" with "AGCG" now gives a different score than with the flat MATCH_SCORE / MISMATCH_PENALTY scheme.

Affine gap penalty

The linear gap penalty $g$ treats opening a gap and extending an existing gap identically. A more realistic affine gap penalty charges a higher cost $g_o$ to open a gap and a lower cost $g_e$ to extend it: a gap of length $k$ costs $g_o + (k-1)g_e$. Look up the three-matrix formulation of Needleman-Wunsch with affine gaps (using matrices $H$, $E$, $F$) and implement it. Verify that with $g_o = -3$ and $g_e = -1$ the alignment of "ACCCGT" with "AGT" produces a single gap of length 3 rather than three individual gaps.