Phylogenetic Tree Reconstruction

The Problem

Species that share a common ancestor tend to have similar DNA or protein sequences.
A phylogenetic tree shows how a set of taxa (species, genes, or individuals) are related by common descent.
The leaves of the tree are the observed taxa; internal nodes represent hypothetical ancestors.
Given only pairwise sequence distances, we want to infer the tree topology (who is related to whom) and branch lengths (how different they are).

A rooted binary tree on 4 leaves has how many internal nodes (not counting the leaves themselves)?

2: Wrong: a rooted binary tree on N leaves has N-1 internal nodes, so 4-1=3.
3: Correct: a rooted binary tree on N leaves has N-1 internal nodes; here 4-1=3.
4: Wrong: 4 is the number of leaves; the number of internal nodes is N-1=3.
5: Wrong: this would require N=6 leaves.

Distance Matrices

A distance matrix $D$ stores the evolutionary distance between every pair of taxa.
Distances are estimated from aligned sequences: the fraction of positions that differ, corrected for multiple substitutions.

TAXA = ["Bat", "Chimp", "Human", "Gorilla"]

# True branch lengths from the reference tree used to generate distances.
# The tree is: ((Bat:0.6, Chimp:0.8):1.2, (Human:0.4, Gorilla:0.5))
# Pairwise distances are sums of branches on the path between each pair.
_BRANCH = {
    ("Bat", "Chimp"): 0.6 + 0.8,       # 1.4
    ("Bat", "Human"): 0.6 + 1.2 + 0.4, # 2.2
    ("Bat", "Gorilla"): 0.6 + 1.2 + 0.5, # 2.3
    ("Chimp", "Human"): 0.8 + 1.2 + 0.4, # 2.4
    ("Chimp", "Gorilla"): 0.8 + 1.2 + 0.5, # 2.5
    ("Human", "Gorilla"): 0.4 + 0.5,   # 0.9
}

The six pairwise distances for our four-taxon example:

Pair	Distance
Bat, Chimp	1.4
Human, Gorilla	0.9
Bat, Human	2.2
Bat, Gorilla	2.3
Chimp, Human	2.4
Chimp, Gorilla	2.5

def make_distance_matrix(noise_scale=0.0, seed=SEED):
    """Return (names, D) where D is the symmetric pairwise distance matrix.

    When noise_scale=0.0 (default) the distances are exact tree-additive values.
    A non-zero noise_scale adds symmetric Gaussian noise so that the distances
    no longer satisfy the four-point condition exactly, simulating real data.
    """
    n = len(TAXA)
    D = np.zeros((n, n))
    for i, a in enumerate(TAXA):
        for j, b in enumerate(TAXA):
            if i < j:
                key = (a, b) if (a, b) in _BRANCH else (b, a)
                D[i, j] = D[j, i] = _BRANCH[key]
    if noise_scale > 0.0:
        rng = np.random.default_rng(seed)
        noise = rng.normal(0, noise_scale, (n, n))
        noise = (noise + noise.T) / 2  # keep symmetric
        np.fill_diagonal(noise, 0)
        D = np.maximum(D + noise, 0)  # distances must be non-negative
    return list(TAXA), D

The UPGMA Algorithm

UPGMA (Unweighted Pair Group Method with Arithmetic Mean) reconstructs a rooted tree from a distance matrix using only arithmetic means.
The key idea: at each step, merge the pair of taxa with the smallest pairwise distance.
Place the new internal node at height $D(i,j)/2$ — half the merged distance — which is the estimated time of the common ancestor.
Branch length from the new node to each merged taxon is the difference in heights.
Update the distance from the new node $u$ to any remaining taxon $k$ as the arithmetic mean:

$$D(u, k) = \frac{D(i, k) + D(j, k)}{2}$$

Replace $i$ and $j$ with $u$ in the matrix and repeat until only one node remains.

Put these steps for one iteration of UPGMA in the correct order.

Find the pair (i, j) with the smallest distance in the current matrix
Create a new internal node u at height D(i, j) / 2
Record branch lengths: branch to i = height(u) - height(i), branch to j = height(u) - height(j)
Compute distances from u to all remaining taxa as arithmetic means
Remove i and j from the matrix, add u, and repeat

A Worked Example

Starting with $n = 4$ taxa and the distances above, Human and Gorilla have the smallest distance (0.9), so they are merged first into node4 at height $0.9/2 = 0.45$.

Branch lengths: $\ell_\text{Human} = 0.45 - 0 = 0.45$ and $\ell_\text{Gorilla} = 0.45 - 0 = 0.45$ (both start at height 0).

After replacing Human and Gorilla with node4, the next smallest distance is Bat-Chimp (1.4), which becomes node5 at height $1.4/2 = 0.70$. Branch lengths to Bat and Chimp are each $0.70 - 0 = 0.70$.

The final merge joins node4 and node5. The average distance between node4 and node5, computed as the arithmetic mean of all pairwise distances between the two groups, works out to $2.35$. The final node (node6) sits at height $2.35/2 = 1.175$. Branch lengths are $1.175 - 0.45 = 0.725$ to node4 and $1.175 - 0.70 = 0.475$ to node5.

In the first UPGMA iteration Human and Gorilla are merged. $D(\text{Human}, \text{Gorilla}) = 0.9$. The new internal node is placed at height $D/2$. What is the branch length from the new node to Human? Give your answer to two decimal places.

Why does UPGMA assume a molecular clock?

Because it always picks the pair with the smallest distance, regardless of how far each taxon is from others.: Wrong: that is the merge criterion, not the clock assumption.
Because it places each internal node at half the merged distance, assuming all lineages evolve at the same rate.: Correct: placing the node at $D(i,j)/2$ assumes that $i$ and $j$ are equidistant from their common ancestor, which requires equal rates of evolution in both lineages.
Because it uses arithmetic means to update distances.: Wrong: the arithmetic mean update is a computational convenience, not the molecular clock assumption.
Because it produces a rooted tree rather than an unrooted tree.: Wrong: whether a tree is rooted or unrooted is independent of the clock assumption.

Displaying the Tree

def draw_tree(edges, taxa, filename):
    """Save a rooted tree diagram to `filename`.

    Leaf nodes (members of `taxa`) are drawn in blue; internal nodes in grey.
    Edge labels show branch lengths rounded to two decimal places.
    """
    G = nx.DiGraph()
    for parent, child, length in edges:
        G.add_edge(parent, child, weight=round(length, 4))

    # Use a simple hierarchical layout via networkx spring layout.
    pos = nx.spring_layout(G.to_undirected(), seed=7)

    leaf_nodes = [nd for nd in G.nodes() if nd in taxa]
    internal_nodes = [nd for nd in G.nodes() if nd not in taxa]

    fig, ax = plt.subplots(figsize=(6, 5))
    nx.draw_networkx_nodes(
        G, pos, nodelist=leaf_nodes, node_color="steelblue", node_size=600, ax=ax
    )
    nx.draw_networkx_nodes(
        G, pos, nodelist=internal_nodes, node_color="lightgrey", node_size=400, ax=ax
    )
    nx.draw_networkx_labels(G, pos, ax=ax, font_size=9)
    nx.draw_networkx_edges(G, pos, ax=ax, arrows=True)
    edge_labels = {(p, c): f"{length:.2f}" for p, c, length in edges}
    nx.draw_networkx_edge_labels(
        G, pos, edge_labels=edge_labels, font_size=7, ax=ax
    )

    ax.set_title("UPGMA tree")
    ax.axis("off")
    plt.tight_layout()
    plt.savefig(filename)
    plt.close(fig)

Rooted UPGMA tree for Bat, Chimp, Human, Gorilla with branch lengths labelled. — Figure 1: UPGMA tree for four taxa. Blue nodes are observed taxa; grey nodes are reconstructed ancestors. Branch lengths reflect the arithmetic-mean distances used at each merge step.

Testing

Matrix symmetry

Distances must satisfy $D(i,j) = D(j,i)$ and $D(i,i) = 0$.
An asymmetric matrix would indicate a bug in the construction code.

Edge count

A rooted binary tree on $N$ leaves has exactly $2N - 2$ directed edges (each of the $N - 1$ internal nodes contributes two child edges).
Any other count means the loop terminated early or ran too many iterations.

Topology recovery

Two pairs of taxa (Bat, Chimp) and (Human, Gorilla) should each share a private internal node.
If either pair is split across different internal nodes, the algorithm recovered the wrong topology.

Branch length recovery

With exact tree-additive distances the algorithm recovers every branch length to machine precision.
No tolerance is needed: floating-point round-off in the arithmetic means is below $10^{-10}$.

Topology under noise

Real distance data are not perfectly additive.
With small Gaussian noise (scale 0.05) the correct topology should still be recovered.
This test documents the robustness of UPGMA to modest measurement error.

import numpy as np
from phylo import TAXA, make_distance_matrix, upgma

# Tolerance for floating-point comparisons in branch-length tests.
# UPGMA computes only averages and halvings, so round-off is well below 1e-10.
BRANCH_TOLERANCE = 1e-10


def test_matrix_symmetric():
    # The distance matrix must be symmetric with zero diagonal.
    names, D = make_distance_matrix()
    assert np.allclose(D, D.T)
    assert np.all(np.diag(D) == 0.0)


def test_upgma_edge_count():
    # A rooted binary tree on N leaves has exactly 2N-2 directed edges
    # (each internal node contributes two child edges; there are N-1 internal nodes).
    names, D = make_distance_matrix()
    edges = upgma(names, D)
    assert len(edges) == 2 * len(TAXA) - 2


def test_upgma_recovers_topology():
    # The reference tree groups (Bat, Chimp) and (Human, Gorilla).
    # UPGMA should merge Human+Gorilla first (smallest distance 0.9),
    # then Bat+Chimp (next smallest distance 1.4), then join the two clades.
    names, D = make_distance_matrix()
    edges = upgma(names, D)

    # Build parent-child adjacency from edges.
    children = {}
    for parent, child, _ in edges:
        children.setdefault(parent, []).append(child)

    # Find the internal node whose children are both leaves from the same clade.
    def siblings(taxon_a, taxon_b):
        for parent, kids in children.items():
            if taxon_a in kids and taxon_b in kids:
                return True
        return False

    assert siblings("Human", "Gorilla"), "Human and Gorilla should be sisters"
    assert siblings("Bat", "Chimp"), "Bat and Chimp should be sisters"


def test_upgma_recovers_branch_lengths():
    # With exact distances UPGMA should recover branch lengths to machine precision.
    # UPGMA merges in this order:
    #   node4 <- Human (0.45), Gorilla (0.45)   [merge height = 0.9/2 = 0.45]
    #   node5 <- Bat (0.70), Chimp (0.70)        [merge height = 1.4/2 = 0.70]
    #   node6 <- node4 (0.725), node5 (0.475)   [merge height = 2.35/2 = 1.175]
    # Branch length = merge height of parent - height of child.
    expected = {
        ("node4", "Human"): 0.45,
        ("node4", "Gorilla"): 0.45,
        ("node5", "Bat"): 0.70,
        ("node5", "Chimp"): 0.70,
        ("node6", "node4"): 0.725,
        ("node6", "node5"): 0.475,
    }
    names, D = make_distance_matrix()
    edges = upgma(names, D)
    assert len(edges) == len(expected), f"Expected {len(expected)} edges, got {len(edges)}"
    for parent, child, length in edges:
        key = (parent, child)
        assert key in expected, f"Unexpected edge {parent} -> {child}"
        assert abs(length - expected[key]) < BRANCH_TOLERANCE, (
            f"Edge {parent}->{child}: got {length:.8f}, expected {expected[key]:.8f}"
        )


def test_upgma_with_noise_recovers_topology():
    # With small Gaussian noise (scale 0.05) the correct topology should still
    # be recovered.  The Human-Gorilla distance (0.9) is well below the next
    # smallest (Bat-Chimp at 1.4), so modest noise rarely disrupts the merge order.
    names, D = make_distance_matrix(noise_scale=0.05)
    edges = upgma(names, D)
    children = {}
    for parent, child, _ in edges:
        children.setdefault(parent, []).append(child)

    def siblings(taxon_a, taxon_b):
        for parent, kids in children.items():
            if taxon_a in kids and taxon_b in kids:
                return True
        return False

    assert siblings("Human", "Gorilla"), (
        "Human and Gorilla should be sisters even under small noise"
    )
    assert siblings("Bat", "Chimp"), (
        "Bat and Chimp should be sisters even under small noise"
    )

Generating Random Trees

A single hand-crafted example is useful for tracing through the algorithm, but it cannot reveal how UPGMA behaves on different topologies or at larger scales.
generate_tree.py provides make_random_tree which builds a random unrooted binary tree and returns its exact distance matrix alongside the true edge list.

Building a random topology

Start with three leaves connected to a single internal node.
For each additional taxon, pick a random existing edge, insert a new internal node in the middle of it, and attach the new taxon to that node.
Branch lengths are drawn independently from an exponential distribution with a specified mean.

def make_random_tree(
    n_taxa, seed=SEED, branch_mean=DEFAULT_BRANCH_MEAN, noise_scale=0.0
):
    """Return (names, D, true_edges) for a random unrooted binary tree.

    names      — list of n_taxa strings "S00", "S01", ...
    D          — n_taxa x n_taxa symmetric distance matrix
    true_edges — list of (node_a, node_b, length) for the generating tree
    """
    if n_taxa < MIN_TAXA:
        raise ValueError(f"n_taxa must be at least {MIN_TAXA}, got {n_taxa}")

    rng = np.random.default_rng(seed)
    names = [f"S{i:02d}" for i in range(n_taxa)]

    # Start with the three-leaf star: S00, S01, S02 all connected to I0.
    internal_counter = 0
    first_internal = f"I{internal_counter}"
    internal_counter += 1
    edges = [
        [names[0], first_internal, rng.exponential(branch_mean)],
        [names[1], first_internal, rng.exponential(branch_mean)],
        [names[2], first_internal, rng.exponential(branch_mean)],
    ]

    # Insert each remaining taxon by splitting a random existing edge.
    for i in range(3, n_taxa):
        idx = int(rng.integers(len(edges)))
        u, v, _ = edges.pop(idx)

        new_internal = f"I{internal_counter}"
        internal_counter += 1
        edges.append([u, new_internal, rng.exponential(branch_mean)])
        edges.append([v, new_internal, rng.exponential(branch_mean)])
        edges.append([names[i], new_internal, rng.exponential(branch_mean)])

    # Compute all pairwise distances through the tree.
    G = nx.Graph()
    for a, b, w in edges:
        G.add_edge(a, b, weight=w)
    all_dists = dict(nx.all_pairs_dijkstra_path_length(G, weight="weight"))

    n = len(names)
    D = np.zeros((n, n))
    for i, a in enumerate(names):
        for j, b in enumerate(names):
            D[i, j] = all_dists[a][b]

    if noise_scale > 0.0:
        noise = rng.normal(0, noise_scale, (n, n))
        noise = (noise + noise.T) / 2
        np.fill_diagonal(noise, 0)
        D = np.maximum(D + noise, 0.0)

    true_edges = [(a, b, w) for a, b, w in edges]
    return names, D, true_edges

Computing distances and checking topology

Once the tree is built, pairwise distances are computed as shortest-path lengths through the networkx graph weighted by branch length.
To compare topologies, each tree is converted to its set of bipartitions: for every edge, the two groups of leaves on either side.
Two trees have the same topology if and only if their bipartition sets are identical.

import numpy as np
import pytest
from generate_tree import make_random_tree, bipartitions, MIN_TAXA
from phylo import neighbor_joining


def test_rejects_too_few_taxa():
    with pytest.raises(ValueError):
        make_random_tree(MIN_TAXA - 1)


def test_edge_count():
    # An unrooted binary tree on N leaves has exactly 2N-3 edges.
    for n in [3, 5, 8, 12]:
        _, _, edges = make_random_tree(n)
        assert len(edges) == 2 * n - 3, f"n={n}: got {len(edges)} edges"


def test_distance_matrix_properties():
    names, D, _ = make_random_tree(6)
    assert D.shape == (6, 6)
    assert np.allclose(D, D.T)
    assert np.all(np.diag(D) == 0.0)
    assert np.all(D >= 0.0)


def test_nj_recovers_true_topology():
    # With exact tree-additive distances, NJ must recover every bipartition
    # of the generating tree.  This holds for all tree-additive inputs
    # regardless of topology or branch lengths.
    for seed in [1, 7, 42, 99]:
        names, D, true_edges = make_random_tree(6, seed=seed)
        inferred_edges = neighbor_joining(names, D)
        true_splits = bipartitions(true_edges, names)
        inferred_splits = bipartitions(inferred_edges, names)
        assert true_splits == inferred_splits, (
            f"seed={seed}: topology mismatch\n  true:     {true_splits}\n  inferred: {inferred_splits}"
        )


def test_nj_recovers_topology_under_noise():
    # With small Gaussian noise (scale 0.05 ≈ 10% of the default branch mean)
    # NJ should still recover the correct topology on most random trees.
    # We test four seeds and require all four to succeed.  Measured failure
    # rate at this noise level is below 1% for N=6 with branch_mean=0.5.
    for seed in [1, 7, 42, 99]:
        names, D, true_edges = make_random_tree(6, seed=seed, noise_scale=0.05)
        inferred_edges = neighbor_joining(names, D)
        true_splits = bipartitions(true_edges, names)
        inferred_splits = bipartitions(inferred_edges, names)
        assert true_splits == inferred_splits, (
            f"seed={seed}: topology mismatch under noise"
        )


def test_reproducibility():
    # Same seed must always produce identical output.
    names1, D1, edges1 = make_random_tree(8)
    names2, D2, edges2 = make_random_tree(8)
    assert names1 == names2
    assert np.array_equal(D1, D2)
    assert edges1 == edges2


def test_different_seeds_differ():
    _, D1, _ = make_random_tree(6)
    _, D2, _ = make_random_tree(6, seed=999)
    assert not np.allclose(D1, D2)

Exercises

Neighbor-joining comparison

Neighbor-joining (Saitou and Nei, 1987) uses a Q-matrix correction that removes long-branch attraction bias and does not assume a molecular clock. Implement neighbor_joining(names, D) returning edges as (node_a, node_b, length) triples. Verify that UPGMA and neighbor-joining agree on the topology for the exact distance matrix, then add noise (scale 0.05) and show a case where they disagree.

Bootstrapping

Phylogenetic bootstrap estimates how well-supported each split is. Draw 100 perturbed distance matrices using make_distance_matrix(noise_scale=0.1, seed=i) for $i = 0, \ldots, 99$ and run UPGMA on each. Report what fraction of replicates recover the correct (Bat, Chimp) clade.

Four-point condition

A distance matrix is tree-additive if and only if it satisfies the four-point condition: for every four taxa $i, j, k, l$, the largest two of the three sums $D(i,j)+D(k,l)$, $D(i,k)+D(j,l)$, $D(i,l)+D(j,k)$ are equal. Write is_additive(names, D) that checks this condition for all $\binom{N}{4}$ quadruples and returns True or False. Verify that the exact matrix passes and a noisy matrix fails.

Larger synthetic trees

Extend make_distance_matrix to accept an arbitrary Newick-format string such as "((A:0.1,B:0.2):0.3,(C:0.4,D:0.5):0.6)" and compute the full distance matrix from it. Use this to generate a 10-taxon example and confirm that UPGMA recovers the correct topology.