Vocabulary Richness in Historical Texts

The Problem

Why does a 1000-word text typically have a lower TTR than a 100-word text from the same author?

Because longer texts contain fewer unique words.
Wrong: longer texts contain more unique words in absolute terms; it is the ratio of unique words to total words that falls.
Because words are reused more often as text length increases, reducing the
proportion of unique word forms.
Correct: common words such as "the" and "of" appear multiple times; as total tokens grow, repetitions account for a larger share of the count.
Because the author's vocabulary is exhausted after 100 words.
Wrong: the full vocabulary is available throughout; it is reuse frequency that changes the ratio.
Because tokenisation introduces more errors in longer texts.
Wrong: TTR is a purely statistical property of the token sequence and is independent of tokenisation quality.

Type-Token Ratio and MATTR

$$\text{TTR} = \frac{V}{T}$$

$$\text{MATTR} = \frac{1}{T - w + 1} \sum_{i=0}^{T-w} \text{TTR}(t_i, t_{i+1}, \ldots, t_{i+w-1})$$

A 100-word text contains 60 distinct word types. What is its type-token ratio?

Generating Synthetic Texts

def make_corpus(
    authors=AUTHORS,
    vocab_sizes=VOCAB_SIZES,
    texts_per_author=TEXTS_PER_AUTHOR,
    text_length=TEXT_LENGTH,
    zipf_exp=ZIPF_EXP,
    seed=SEED,
):
    """Return a Polars DataFrame with columns text_id, author, word.

    Each author has a distinct vocabulary size; words are sampled from
    a Zipfian frequency distribution (frequency proportional to 1/rank^zipf_exp).
    A larger vocabulary size produces more distinct words per token and thus
    a higher type-token ratio.
    """
    rng = np.random.default_rng(seed)
    records = []
    text_id = 0
    for author, vocab_size in zip(authors, vocab_sizes):
        ranks = np.arange(1, vocab_size + 1, dtype=float)
        probs = ranks ** (-zipf_exp)
        probs /= probs.sum()
        word_forms = [f"w{i:04d}" for i in range(vocab_size)]
        for _ in range(texts_per_author):
            sampled = rng.choice(word_forms, size=text_length, p=probs)
            for w in sampled:
                records.append({"text_id": text_id, "author": author, "word": w})
            text_id += 1
    return pl.DataFrame(records)

Computing TTR and MATTR

def type_token_ratio(words):
    """Return unique words / total words (type-token ratio, TTR).

    TTR falls as text length increases even when vocabulary richness is
    constant, because longer texts inevitably repeat high-frequency words.
    Use MATTR for length-fair comparisons across texts of different sizes.
    """
    words = list(words)
    if not words:
        return 0.0
    return len(set(words)) / len(words)


def mattr(words, window=MATTR_WINDOW):
    """Return the Moving-Average Type-Token Ratio.

    Computes TTR over each consecutive window of width `window`, then
    returns the mean of those window TTRs.  Because every window has the
    same length, the result is length-independent and can fairly compare
    texts of different sizes.  Falls back to global TTR when the text is
    shorter than the window.
    """
    words = list(words)
    n = len(words)
    if n <= window:
        return type_token_ratio(words)
    window_ttrs = [
        type_token_ratio(words[i : i + window]) for i in range(n - window + 1)
    ]
    return float(np.mean(window_ttrs))

Aggregating Richness Scores

def compute_richness(df, window=MATTR_WINDOW):
    """Return a DataFrame with TTR and MATTR for each text.

    Input df must have columns text_id, author, word.
    Output has columns text_id, author, ttr, mattr.
    """
    rows = []
    for text_id in df["text_id"].unique().sort():
        subset = df.filter(pl.col("text_id") == text_id)
        author = subset["author"][0]
        words = subset["word"].to_list()
        rows.append(
            {
                "text_id": int(text_id),
                "author": author,
                "ttr": type_token_ratio(words),
                "mattr": mattr(words, window),
            }
        )
    return pl.DataFrame(rows)

Visualizing Results

def plot_richness(richness_df, filename):
    """Save a bar chart of MATTR per text, coloured by author."""
    df = richness_df.with_columns(
        pl.concat_str(
            [
                pl.col("author"),
                pl.lit(" \u2013 text "),
                (pl.col("text_id") + 1).cast(pl.String),
            ]
        ).alias("label")
    )
    chart = (
        alt.Chart(df)
        .mark_bar()
        .encode(
            x=alt.X("label:N", title="Text", sort=None),
            y=alt.Y("mattr:Q", title="MATTR", scale=alt.Scale(zero=False)),
            color=alt.Color("author:N", title="Author"),
        )
        .properties(
            width=320,
            height=250,
            title="Moving-Average Type-Token Ratio by Text",
        )
    )
    chart.save(filename)
Bar chart with six bars. Author A bars are shortest, Author B bars are middle height, Author C bars are tallest.
Figure 1: MATTR for each text in the synthetic corpus. Author A (vocabulary 200), Author B (400), and Author C (800) produce texts with MATTR of roughly 0.63, 0.68, and 0.72 respectively, confirming that the metric distinguishes authors by lexical richness regardless of text length.

Testing

import pytest
from generate_vocab import make_corpus
from vocab import type_token_ratio, mattr, compute_richness
import polars as pl


def test_ttr_all_unique():
    # When every word is distinct the TTR is exactly 1.0.
    assert type_token_ratio(["a", "b", "c", "d"]) == pytest.approx(1.0)


def test_ttr_all_same():
    # When the same word is repeated n times the TTR is 1/n.
    assert type_token_ratio(["x"] * 10) == pytest.approx(0.1)


def test_ttr_empty():
    # Empty input returns 0.0 without raising an exception.
    assert type_token_ratio([]) == pytest.approx(0.0)


def test_mattr_short_text_equals_ttr():
    # When the text is shorter than the window, MATTR falls back to global TTR.
    words = ["a", "b", "c"]
    assert mattr(words, window=10) == pytest.approx(type_token_ratio(words))


def test_mattr_window_equals_length():
    # When text length equals the window width there is exactly one window,
    # so MATTR equals TTR exactly.
    words = ["a", "b", "c", "a"]
    assert mattr(words, window=4) == pytest.approx(type_token_ratio(words))


def test_richness_order_by_vocab_size():
    # Mean MATTR must increase with vocabulary size: Author A (200) <
    # Author B (400) < Author C (800).  The Zipfian generator guarantees
    # that a larger vocabulary produces more distinct tokens per window.
    df = make_corpus()
    richness = compute_richness(df)
    means = richness.group_by("author").agg(pl.col("mattr").mean()).sort("author")
    mattr_vals = means["mattr"].to_list()
    # Authors sort alphabetically: A, B, C -- matching ascending vocab sizes.
    assert mattr_vals[0] < mattr_vals[1] < mattr_vals[2]

Vocabulary richness key terms

Type-token ratio (TTR)
$V/T$ where $V$ is the number of distinct word types and $T$ is the total token count; decreases with text length even at constant vocabulary richness
Moving-average TTR (MATTR)
Mean of window TTRs computed over overlapping fixed-width windows; length-independent because every window has the same size
Zipf's law
The empirical regularity that the $k$-th most frequent word in a corpus appears roughly $1/k$ times as often as the most frequent word; produces the characteristic long tail of rare words in natural language
Vocabulary richness
A property of a text or author reflecting how varied word choices are; higher richness means proportionally more unique words per token
Window width (MATTR)
The fixed number of tokens in each local TTR window; must be short enough to keep window TTR below 1.0 but long enough to average out token-level noise

Exercises

Effect of window width

Compute MATTR for window widths of 10, 25, 50, 100, and 200 for one of the Author C texts. Plot MATTR against window width and explain why MATTR approaches the global TTR as the window grows toward the text length.

TTR length dependence

Generate texts of 100, 200, 400, and 800 tokens from the same Zipfian distribution (vocabulary size 400). Plot both TTR and MATTR against text length on the same axes. At what length does the difference between TTR and MATTR become noticeable?

Real corpus comparison

Download two public-domain texts of very different lengths from Project Gutenberg. Tokenise each into lowercase words (strip punctuation), then compute TTR and MATTR with a 50-word window. Does MATTR correct the length bias visible in TTR?

Hapax legomena rate

A hapax legomenon is a word that appears exactly once in a text. Implement hapax_rate(words) that returns the fraction of total tokens that are hapax legomena. Compare hapax rate with MATTR across the three synthetic authors and discuss whether hapax rate is also length-independent.