Coding Schemes and Inter-Rater Reliability

Learning Goals

Lesson

i
"""Compute percent agreement and Cohen's kappa for two coders."""

from collections import Counter


def cohen_kappa(coder_a, coder_b):
    """Compute Cohen's kappa for two lists of categorical labels."""
    n = len(coder_a)
    assert len(coder_b) == n, "Coders must label the same number of items"
    categories = list(set(coder_a) | set(coder_b))
    observed_agreement = sum(a == b for a, b in zip(coder_a, coder_b)) / n
    count_a = Counter(coder_a)
    count_b = Counter(coder_b)
    expected_agreement = sum((count_a[c] / n) * (count_b[c] / n) for c in categories)
    kappa = (observed_agreement - expected_agreement) / (1 - expected_agreement)
    return observed_agreement, kappa


if __name__ == "__main__":
    # Example: two coders classify pull request comments
    coder_a = [
        "change",
        "question",
        "approve",
        "change",
        "change",
        "question",
        "approve",
        "approve",
        "change",
        "question",
    ]
    coder_b = [
        "change",
        "approve",
        "approve",
        "change",
        "question",
        "question",
        "approve",
        "approve",
        "change",
        "question",
    ]
    pct, kappa = cohen_kappa(coder_a, coder_b)
    print(f"Percent agreement: {pct:.1%}")
    print(f"Cohen's kappa: {kappa:.3f}")
    label = (
        "poor"
        if kappa < 0.4
        else "moderate"
        if kappa < 0.6
        else "substantial"
        if kappa < 0.8
        else "near-perfect"
    )
    print(f"Interpretation: {label}")

Check Understanding

Why is percent agreement not sufficient for measuring inter-rater reliability? What does Cohen's kappa add?

Percent agreement does not account for agreement that would occur by chance given the distribution of codes in the data. If one category dominates (for example, 90% of items are "not relevant"), two coders independently guessing at random would still agree 82% of the time just from base rates. Cohen's kappa subtracts the expected chance agreement from the observed agreement before dividing, so a high kappa means the coders are genuinely agreeing on meaning rather than accidentally choosing the same dominant category.

What is wrong with the following code, and how do you fix it?
# Two coders discuss each item together and then record their "independent" codes
coder_a = [code_item(item, discussion=True) for item in items]
coder_b = [code_item(item, discussion=True) for item in items]
kappa = cohen_kappa(coder_a, coder_b)

The coders discussed each item before recording their codes, so the codes are not independent. Any agreement reflects shared discussion rather than the clarity of the codebook. Computing kappa on codes that were negotiated before recording is meaningless as a reliability measure because it does not tell you whether a third coder, working alone with only the codebook, would reach the same conclusions. The fix is to have each coder work through all items independently, recording their codes without any communication, before comparing results and computing kappa. Only after computing kappa should they discuss disagreements to reach consensus.

Wang et al. report kappa for each classification pass of their card sorting, and kappa improves between passes. Why might kappa improve beyond the coders simply practicing?

Each time the coders compare results and discuss disagreements, they sharpen the codebook implicitly. After the first pass, they discover which distinctions were ambiguous and refine how they think about those boundaries. The second pass is therefore performed with an improved shared understanding of the categories, even if the written codebook has not changed. In addition, the act of explaining your reasoning to another coder forces you to make tacit distinctions explicit, which helps both coders apply the scheme more consistently on the next pass.

A researcher codes 200 issue comments solo and reports "95% internal consistency." What is wrong with this claim?

Internal consistency (often measured by Cronbach's alpha or similar) is a property of survey scales, not of qualitative coding. Applying it to coding makes no methodological sense. More importantly, one person coding alone cannot produce a reliability estimate at all — reliability requires independent coders. What the researcher should have done is recruited a second coder, provided them with a codebook, had them code a sample independently, and then computed Cohen's kappa. A solo coder producing a high "consistency" number is measuring something, but not inter-rater reliability.

Exercises

Computing kappa on your own codes

Return to the ten Stack Overflow comments you coded in Lesson 20. Give the same comments to a partner, along with the codebook you wrote, but do not discuss the comments before your partner codes them. After both of you have recorded your codes independently, compute percent agreement and Cohen's kappa using the kappa.py script. If kappa is below 0.6, identify the one code definition that caused the most disagreements and rewrite it so that it includes an explicit example and a non-example.

Writing and applying a codebook

Write a three-code codebook for classifying pull request review comments as "requesting a change," "asking a question," or "approving." For each code, write a definition, one example of a real comment that fits, and one example of a real comment that does not fit. Apply the codebook to ten review comments from any public GitHub repository, then give the codebook and the same ten comments to a partner and have them code independently. Compute Cohen's kappa and note which pair of codes caused the most disagreements.

Explaining kappa improvement across passes

Wang et al. report kappa values for each coding pass of their card sorting study and observe improvement between passes. Write two explanations for why kappa might improve between the first and second pass that go beyond simply saying "we practiced." For each explanation, write one sentence describing what a researcher could do to take advantage of that mechanism deliberately rather than just hoping improvement occurs.

Critiquing a solo coding claim

A researcher codes 200 GitHub issue comments alone and reports 95% internal consistency as evidence that their coding scheme is reliable. Write three sentences explaining what is wrong with this claim. In a fourth sentence, describe exactly what the researcher should have done instead, naming the measure they should have computed and the minimum acceptable value by the conventions in this lesson.

Computing expected agreement by hand

Consider a two-category coding scheme where coder A assigns 60% of items to category 1 and 40% to category 2, and coder B assigns 70% to category 1 and 30% to category 2. Compute the expected agreement by chance. Then compute the Cohen's kappa you would achieve if the observed agreement were 80%. Show your arithmetic at each step. State whether the resulting kappa meets the threshold for "substantial" agreement by the conventions in this lesson, and write one sentence interpreting what that means for a paper reporting this result.