AI Tools and Software Engineering

Learning Goals

Productivity Claims

Code Quality and Security

What We Do Not Yet Know

Code

i
"""Classify Copilot-generated code snippets for security weaknesses."""

import polars as pl

# Fu et al. security categories
CATEGORIES = [
    "injection",
    "memory_management",
    "error_handling",
    "cryptography",
    "other",
    "none",
]

snippets = pl.read_csv("data/copilot_snippets.csv")
results = []
for row in snippets.iter_rows(named=True):
    print(f"\n--- Snippet {row['id']} ---")
    print(row["code"])
    compiles = input("Compiles? (y/n): ").strip().lower() == "y"
    if compiles:
        has_issue = input("Obvious security issue? (y/n): ").strip().lower() == "y"
        if has_issue:
            print(f"Categories: {', '.join(CATEGORIES)}")
            category = input("Category: ").strip()
        else:
            category = "none"
    else:
        has_issue = False
        category = "does_not_compile"
    results.append(
        {
            "id": row["id"],
            "compiles": compiles,
            "has_issue": has_issue,
            "category": category,
        }
    )

result_df = pl.DataFrame(results)
print("\nTally:")
print(result_df.group_by("category").agg(pl.len()).sort("len", descending=True))

Check Understanding

What two threats to validity apply to GitHub's own report that Copilot increases task completion speed by 55%?

The first is lack of external validity: the studies used short, isolated programming tasks that are not representative of the complex, context-dependent work most developers do day-to-day. The second is conflict of interest in the control conditions: GitHub has a financial interest in Copilot's success, and the comparison conditions were not always pre-specified or independently reviewed. A 55% improvement on a contrived task tells you something, but it is not the same as a 55% improvement in daily developer output.

What is wrong with this function, and how would you fix it?
import hashlib
def store_password(password):
    return hashlib.md5(password.encode()).hexdigest()

MD5 is a cryptographically broken hash function. It was designed for fast checksum computation, not password storage, and it is trivially attacked with rainbow tables or GPU-accelerated brute force. The fix is to use a password hashing function designed for this purpose, such as bcrypt, scrypt, or argon2. These are deliberately slow and include salting by default:

import bcrypt
def store_password(password):
    return bcrypt.hashpw(password.encode(), bcrypt.gensalt())

This is exactly the kind of vulnerability Fu et al. found in Copilot-generated code: the model learned from a training corpus that contains many examples of MD5 password hashing, so it reproduces the pattern confidently.

Huang et al. found that LLM-generated code can reflect social biases related to age, gender, and race. Why does this matter for software that allocates resources or makes decisions about people?

If a model's code generation is sensitive to demographic cues in prompts — producing different variable names, assumptions, or algorithmic choices depending on whether a persona is described as young or old, male or female — then software written with that assistance may embed those same biases into systems that allocate jobs, loans, healthcare resources, or legal outcomes. A developer who does not check for this cannot know whether the code they shipped reflects their intent or the model's training distribution.

Why does Cappendijk et al. argue that energy consumption must be included in productivity studies of AI coding tools?

Running a large language model at inference requires substantial electricity. If a study reports that a tool saves developers twenty minutes per task but does not account for the energy cost of the queries that produced those suggestions, the reported productivity gain is incomplete. In aggregate — across millions of developers making millions of queries — the energy cost becomes large enough to affect whether the tool is a net benefit. Excluding it systematically overstates the gain, in the same way that excluding maintenance costs overstates the benefit of any engineering shortcut.

Exercises

Study Design

You have been asked to run a study at your company to determine whether Copilot increases developer productivity. You have six months and can recruit twenty developers. Sketch the study design, including what you will measure, what you will control for, what your comparison condition is, and three threats to validity you are most worried about. Then identify which of Fucci's methodological choices — multi-site recruitment, blind analysis, and pre-specified outcomes — you would borrow and explain why each one you select addresses a specific threat you identified.

Testing for Demographic Bias

Huang et al. found that LLM-generated code can reflect social biases. Design a minimal experiment to test whether Copilot generates code of different quality or uses different naming conventions depending on whether the prompt uses a stereotypically male versus female name for the developer persona. Write three sentences: one stating your hypothesis, one explaining how you will operationalize "code quality," and one identifying a threat to validity in your design. Write your test prompts as blockquotes so they are clearly separated from your analysis:

Write a function that validates an email address. The function will be written by [NAME].

Adding Energy Costs

Cappendijk et al. argue that energy consumption of LLM-generated code should be included in productivity studies. Identify one existing Copilot productivity study discussed in this lesson. Write two sentences explaining what data you would need to collect in order to add an energy cost estimate to that study's reported productivity gains, and note one reason this data would be difficult to obtain in practice.

Benchmark Design

Nguyen and Nadi found that Copilot's correctness rate varies by problem difficulty. If you were designing a benchmark to evaluate a new code-generation tool fairly, what three categories of tasks would you include and why? Write one sentence per category, explaining what that category tests that the others do not.

Test Adequacy Rubric

El Haji et al. found that Copilot-generated tests lack meaningful assertions. Write a short rubric consisting of four criteria, one sentence each, for manually evaluating whether a unit test is adequate. Apply your rubric to two of the snippets you classified in the in-class exercise and report your scores. Then write one sentence explaining whether your rubric is reliable enough to use in a research study without first checking inter-rater reliability with a second coder.