Introduction

Goals

What Is Data Science?

What is data science, and how is it different from statistics or spreadsheet work?

What kind of work will this course prepare you to do?

What LLMs Do

What does an LLM do when I ask it to write Python code?

Why does it matter that LLMs do not actually understand data?

The Costs of LLMs

What are the environmental and labor costs of using LLM tools?

What This Course Does Not Do

What should I not use an LLM for in a research context?

Check Understanding

A classmate says LLMs "understand" data because they can write correct code. What is wrong with this claim, and what evidence would help you explain the difference?

LLMs predict plausible text based on patterns in their training data. They produce code that looks correct because they have seen many similar examples, not because they understand what the code does. To prove this, ask an LLM a question about a dataset and include a false column name in the prompt. The LLM will use that column name without questioning it, because it cannot see the data.

You ask an LLM to compute the average temperature in a dataset and it returns 2847.3. You expected a number between -30 and 45. List two things you would check before running the code again.

First, check whether the LLM used the correct column: it may have computed the average of a year or station identifier instead of a temperature. Second, check the units: some temperature datasets store values in hundredths of a degree Celsius (so 2847 means 28.47) or use Kelvin instead of Celsius. Ask the LLM "what units did you assume for the temperature column?" before trusting any number.

Why is saving an analysis as a Python script more reproducible than saving it as a spreadsheet?

A Python script records every step: which file was read, which rows were filtered, which formula was applied, in which order. Anyone can re-run the script and get the same result. A spreadsheet mixes data, formulas, and results in the same cells , making it easy to overwrite a value without any record of the change, and hard to verify that the visible numbers follow from the raw data.

Name one situation where using an LLM to write code is worth its environmental cost, and one where it probably is not.

Worth the cost: automating a well-understood, repetitive task such as reading fifty CSV files in the same format and combining them into one table. Probably not worth it: asking an LLM to interpret what your results mean or to decide which statistical test suits your research design, because those are exactly the judgments you need to practice to become a competent researcher.

Exercises

Map the Workflow

Draw a flowchart of the prompt-check-run-interpret loop that this course uses in every session. Label each step with one question you should ask yourself before moving to the next step.

Find the Deskilling Risk

Think of one task you already use an LLM for regularly, such as searching for references, summarizing papers, or writing emails. Write a paragraph describing how you would do that task without the LLM for one week, and what you expect you would notice.

Read a Methods Section

Find a published paper in your field that includes a "data analysis" or "methods" section. Write down every software tool, cleaning step, or statistical procedure the authors mention. Identify one step where an LLM could have helped, and one where relying on it uncritically could have introduced an error.