Dataframes in Gleam

A dataframe is a table of named, typed columns that all share the same number of rows.
Column-oriented storage makes operations on a single column fast, while row-oriented access requires zipping all columns together.
A custom type with variants for each supported element type (IntCol, StrCol) lets the compiler enforce type safety at access time.
make validates that all columns have the same length at construction time, so downstream functions can assume a consistent shape.
Row filtering uses a Boolean mask derived from one column applied uniformly to every column in the dataframe.

What Is a Dataframe?

A dataframe is the core abstraction in pandas, Polars, and R: a rectangular table where every column has a name and a uniform type
Python programmers reach for pandas.DataFrame; this lesson builds a minimal version in Gleam to show how dataframes work
The key design choices:
- Store columns, not rows: a Dict(String, Column) where each column holds all its values in a list
- Record the row count once in the dataframe record, rather than recomputing list.length on every operation
- Validate shape at construction time so all other functions can trust it

Column Types

pub type Column {
  IntCol(List(Int))
  StrCol(List(String))
}

IntCol(List(Int)) and StrCol(List(String)) are the two supported types
Adding a FloatCol or BoolCol variant later means adding one case to every function that pattern-matches on Column
The compiler will flag every case that fails to handle the new variant, turning a potential runtime bug into a compile error

Building a Dataframe

pub type Dataframe {
  Dataframe(cols: dict.Dict(String, Column), nrows: Int)
}

pub fn make(pairs: List(#(String, Column))) -> Result(Dataframe, String) {
  case pairs {
    [] -> Ok(Dataframe(cols: dict.new(), nrows: 0))
    [#(_, first), ..] -> {
      let n = col_length(first)
      let bad =
        list.find(pairs, fn(p) {
          let #(_, col) = p
          col_length(col) != n
        })
      case bad {
        Ok(#(name, _)) -> Error("column '" <> name <> "' has wrong length")
        Error(_) -> Ok(Dataframe(cols: dict.from_list(pairs), nrows: n))
      }
    }
  }
}

make takes a list of (name, column) pairs and returns a Result
The first pair establishes the expected row count n
list.find checks whether any column has the wrong length: Ok(#(name, _)) means a bad column was found; Error(_) means all are fine
dict.from_list converts the validated pairs into the column dictionary
Returning Result here means callers must handle the bad-shape case rather than discovering it later as a silent bug

Accessing Columns

pub fn nrows(df: Dataframe) -> Int {
  df.nrows
}

pub fn ncols(df: Dataframe) -> Int {
  dict.size(df.cols)
}

pub fn int_col(df: Dataframe, name: String) -> Result(List(Int), String) {
  case dict.get(df.cols, name) {
    Error(_) -> Error("no column '" <> name <> "'")
    Ok(StrCol(_)) -> Error("column '" <> name <> "' is not integer")
    Ok(IntCol(xs)) -> Ok(xs)
  }
}

pub fn str_col(df: Dataframe, name: String) -> Result(List(String), String) {
  case dict.get(df.cols, name) {
    Error(_) -> Error("no column '" <> name <> "'")
    Ok(IntCol(_)) -> Error("column '" <> name <> "' is not string")
    Ok(StrCol(xs)) -> Ok(xs)
  }
}

int_col returns Error for two distinct reasons: the column does not exist, or it exists but holds strings
- More generally, something other than integers
Pattern matching on StrCol(_) before IntCol(xs) catches the type mismatch
nrows and ncols are O(1): row count is stored directly, and dict.size is a constant-time operation

Selecting a Subset of Columns

pub fn select(df: Dataframe, names: List(String)) -> Result(Dataframe, String) {
  list.fold(names, Ok([]), fn(acc_result, name) {
    case acc_result {
      Error(_) -> acc_result
      Ok(acc) ->
        case dict.get(df.cols, name) {
          Error(_) -> Error("no column '" <> name <> "'")
          Ok(col) -> Ok([#(name, col), ..acc])
        }
    }
  })
  |> result.map(fn(pairs) {
    Dataframe(cols: dict.from_list(list.reverse(pairs)), nrows: df.nrows)
  })
}

select folds over the requested names and builds a new pair list, short-circuiting on the first missing name
The accumulator starts as Ok([]) and stays Error(msg) once one name fails
list.reverse is needed because the fold prepends to the accumulator, reversing the order of the names
The resulting dataframe keeps the same nrows as the original

Aggregation and Filtering

pub fn col_sum(df: Dataframe, name: String) -> Result(Int, String) {
  int_col(df, name)
  |> result.map(fn(xs) { list.fold(xs, 0, fn(acc, x) { acc + x }) })
}

pub fn filter_rows(
  df: Dataframe,
  name: String,
  pred: fn(Int) -> Bool,
) -> Result(Dataframe, String) {
  use xs <- result.try(int_col(df, name))
  let mask = list.map(xs, pred)
  let new_cols =
    dict.to_list(df.cols)
    |> list.map(fn(pair) {
      let #(n, col) = pair
      #(n, keep_by_mask(col, mask))
    })
    |> dict.from_list
  let new_nrows = list.length(list.filter(mask, fn(b) { b }))
  Ok(Dataframe(cols: new_cols, nrows: new_nrows))
}

fn keep_by_mask(col: Column, mask: List(Bool)) -> Column {
  case col {
    IntCol(xs) -> IntCol(keep_where(xs, mask))
    StrCol(xs) -> StrCol(keep_where(xs, mask))
  }
}

fn keep_where(values: List(a), mask: List(Bool)) -> List(a) {
  list.zip(values, mask)
  |> list.fold([], fn(acc, pair) {
    case pair {
      #(v, True) -> [v, ..acc]
      _ -> acc
    }
  })
  |> list.reverse
}

col_sum uses result.map to apply list.fold inside the Ok branch without unwrapping manually
filter_rows builds a Boolean mask by applying the predicate to the named column, then passes that mask to every column through keep_by_mask
keep_where zips values with the mask and keeps only the True entries, using list.reverse to restore the original order
Every column is filtered by the same mask, so rows stay aligned
This is the same pattern as the shuffle phase of MapReduce: group by a key (the mask value), keep only one group

Running the Example

  let data = [
    #("name", StrCol(["Alice", "Bob", "Carol"])),
    #("age", IntCol([30, 25, 35])),
    #("score", IntCol([88, 92, 79])),
  ]
  case make(data) {
    Error(msg) -> io.println("error: " <> msg)
    Ok(df) -> {
      io.println("nrows=" <> int.to_string(nrows(df)))
      io.println("ncols=" <> int.to_string(ncols(df)))
      io.println("total score=" <> string.inspect(col_sum(df, "score")))
      case filter_rows(df, "age", fn(age) { age >= 30 }) {
        Error(msg) -> io.println("filter error: " <> msg)
        Ok(seniors) -> {
          io.println("age >= 30: " <> int.to_string(nrows(seniors)) <> " rows")
          io.println("names: " <> string.inspect(str_col(seniors, "name")))
        }
      }
    }
  }

  io.println(
    "bad lengths: "
    <> string.inspect(
      make([
        #("x", IntCol([1, 2, 3])),
        #("y", IntCol([4, 5])),
      ]),
    ),
  )

The output shows that age >= 30 keeps Alice and Carol but not Bob
str_col confirms that the name column was filtered by the same mask as age
The final make call returns Error for mismatched column lengths
To filter rows on a string column, write a similar function that calls str_col instead of int_col: filter_rows requires an integer column for the Boolean mask, so a separate function is needed for string-column filtering
Dataframe operations can be chained with |> and result.try inside a use block because each function takes the dataframe as its first argument

Testing

pub fn make_valid_test() {
  make([#("x", IntCol([1, 2, 3])), #("y", StrCol(["a", "b", "c"]))])
  |> should.be_ok()
}

pub fn make_length_mismatch_test() {
  make([#("x", IntCol([1, 2, 3])), #("y", IntCol([4, 5]))])
  |> should.be_error()
}

pub fn make_empty_test() {
  make([])
  |> should.be_ok()
}

pub fn nrows_test() {
  let df = make([#("x", IntCol([1, 2, 3]))]) |> should.be_ok()
  nrows(df) |> should.equal(3)
}

pub fn ncols_test() {
  let df = make([#("a", IntCol([1])), #("b", StrCol(["x"]))]) |> should.be_ok()
  ncols(df) |> should.equal(2)
}

pub fn int_col_exists_test() {
  let df = make([#("n", IntCol([10, 20]))]) |> should.be_ok()
  int_col(df, "n") |> should.equal(Ok([10, 20]))
}

pub fn int_col_missing_test() {
  let df = make([#("n", IntCol([1]))]) |> should.be_ok()
  int_col(df, "z") |> should.be_error()
  Nil
}

pub fn col_sum_test() {
  let df = make([#("v", IntCol([1, 2, 3, 4]))]) |> should.be_ok()
  col_sum(df, "v") |> should.equal(Ok(10))
}

make_valid_test and make_length_mismatch_test cover the two construction paths
filter_rows_test checks both the row count and the string column values, catching bugs where the mask is applied to only one column

Check Understanding

Why does make store nrows in the Dataframe record rather than computing it from a column each time it is needed?

Accessing the length of a list is O(n) in Gleam because lists are singly-linked: every call to list.length walks the whole list. Storing the row count once avoids this cost for every subsequent nrows call and for operations like filter_rows that need the count after building the new column dictionary. The trade-off is that nrows must be updated correctly in every function that changes the shape of the dataframe.

What happens if you call filter_rows with a column name that holds strings?

filter_rows calls int_col(df, name) first. int_col pattern-matches on the column variant: if the named column is StrCol(_), it returns Error("column '...' is not integer"). filter_rows uses result.try, so it propagates that error immediately without ever applying the predicate or building a mask. The caller gets an Error and no filtering is performed.

Exercises

Float column (15 minutes)

Add FloatCol(List(Float)) to the Column type. Add float_col(df, name) -> Result(List(Float), String) and col_mean(df, name) -> Result(Float, String) that computes the column mean. Update make, keep_by_mask, and any other functions that pattern-match on Column. Write at least three tests.

Group by (20 minutes)

Write group_by(df: Dataframe, name: String) -> Result(Dict(String, Dataframe), String) that partitions rows by the distinct string values in the named column. Each key in the result is one distinct string value; the associated dataframe contains only the rows where that column has that value. Use filter_rows internally. Test with at least two distinct groups.

Add column (10 minutes)

Write add_col(df: Dataframe, name: String, col: Column) -> Result(Dataframe, String) that returns a new dataframe with the given column appended. Return Error if the column length does not match nrows(df) or if a column with that name already exists. Write three tests: one success, one length mismatch, one duplicate name.

Row at index (15 minutes)

Write row(df: Dataframe, idx: Int) -> Result(Dict(String, String), String) that returns all column values for a given row index as a dict mapping column name to its string representation (use int.to_string for integer columns). Return Error if idx is negative or out of range. Test with a valid index, a negative index, and an index equal to nrows.