Spatial Clustering of Archaeological Sites

The Problem

Why does a grid-cell density approach label some sites as noise (label -1)?

Because noise sites are placed outside the survey region.
Wrong: all sites in the synthetic data lie within the region bounds; noise sites land in cells just like cluster sites do.
Because any site whose cell has fewer than HOT_THRESHOLD sites is treated as
background, not part of any cluster.
Correct: a cell below the threshold is not hot, so sites in that cell receive label -1 regardless of how close they are to a cluster cell.
Because the DFS algorithm cannot reach sites at the edge of the grid.
Wrong: the DFS visits all hot cells, including those on the grid boundary; edge position alone does not determine whether a cell is hot.
Because cluster sites always outnumber noise sites.
Wrong: in the synthetic data there are 20 noise points and 60 cluster sites, but the labelling depends on cell density, not overall counts.

Dividing the Region into a Grid

$$\text{cell} = \left\lfloor \frac{v - \text{region_min}}{w} \right\rfloor$$


A survey region runs from 0 to 100 units. GRID_SIZE is 20, so each cell is 5 units wide. A site at easting 27.3 maps to which column index?

Identifying Concentration Cells


What happens to the number of detected clusters when GRID_SIZE is halved (from 20 to 10)?

The number of clusters always increases, because there are more cells to label.
Wrong: halving GRID_SIZE reduces the number of cells, not increases them; each coarser cell aggregates more sites.
Each cell covers four times the area and accumulates more sites, so cells are
more likely to meet HOT_THRESHOLD and previously separate clusters may merge.
Correct: coarser cells combine sites from a wider area; two formerly distinct concentration zones may land in the same or adjacent hot cells and merge into one connected component.
The number of clusters is unaffected because HOT_THRESHOLD does not change.
Wrong: cell size and threshold interact; the same threshold applied to cells of different sizes produces different hot-cell patterns.
Each cluster splits into more fragments because the cells are larger.
Wrong: larger cells aggregate more sites; the tendency is to merge clusters, not split them.

Finding Clusters with Depth-First Search


Put these steps of the DFS connected-component labelling in the correct order.

  1. Scan the grid row by row; find the first hot cell with label -1
  2. Assign the current cluster_id to that cell and push it onto the stack
  3. Pop a cell from the stack; iterate over its 8 neighbors
  4. For each neighbor that is hot and unlabelled, assign cluster_id and push
  5. When the stack empties, increment cluster_id and scan for the next unlabelled hot cell

What happens when HOT_THRESHOLD is raised from 2 to 5?

Every cell qualifies as hot, so the entire grid becomes one large cluster.
Wrong: raising the threshold makes it harder for cells to qualify; fewer cells are hot, not more.
Cells need more sites to qualify, so some former cluster cells become background
and clusters may shrink, split, or disappear.
Correct: cells that previously had 2-4 sites are no longer hot; if those cells were part of a cluster's boundary or interior, the component may fragment into smaller pieces or fall below connectivity.
The cluster boundaries become smoother because the threshold filters outlier cells.
Wrong: raising the threshold removes cells from clusters; it can fragment boundaries rather than smooth them.
The number of noise sites decreases because more sites fall into hot cells.
Wrong: raising the threshold removes cells from the hot set, so more sites end up in background cells and are labelled as noise.

Assigning Sites to Clusters


Site A is at easting 34.0, northing 24.0. Its cell has a count of 1 and HOT_THRESHOLD is 2. What label does site A receive?

The label of the nearest hot cell's cluster.
Wrong: assign_labels uses the cell containing the site, not the nearest hot cell; proximity to a hot cell does not transfer the label.
-1, because its cell does not meet the density threshold.
Correct: the cell count (1) is below HOT_THRESHOLD (2), so the cell is not hot and every site in it is labelled -1.
0, because it is the first site encountered in the scan.
Wrong: label 0 is the first cluster label assigned by find_clusters, not a default for unlabelled sites; unlabelled sites receive -1.
The label of the cluster whose centroid is closest to the site.
Wrong: the grid method assigns labels by cell membership, not by distance to cluster centroids.

Cluster Centroids


Four clusters are planted in the synthetic data. How many entries does the dict returned by cluster_centroids contain, assuming all four are recovered?

Testing

Grid shape and total count

Hot-cell mask

Two disconnected blocks become two clusters

At least two clusters on synthetic data

One centroid per label

import numpy as np

from generate_archcluster import (
    make_sites,
    REGION_MIN,
    REGION_MAX,
)
from archcluster import (
    build_grid,
    find_hot_cells,
    find_clusters,
    assign_labels,
    cluster_centroids,
    GRID_SIZE,
    HOT_THRESHOLD,
)


def test_build_grid_shape():
    # Grid must have the requested number of rows and columns.
    coords = np.array([[10.0, 20.0], [50.0, 50.0], [90.0, 80.0]])
    grid = build_grid(coords, REGION_MIN, REGION_MAX, GRID_SIZE)
    assert grid.shape == (GRID_SIZE, GRID_SIZE)


def test_build_grid_sum():
    # Every site must land in exactly one cell, so the grid sum equals site count.
    df = make_sites()
    coords = df.select(["easting", "northing"]).to_numpy()
    grid = build_grid(coords, REGION_MIN, REGION_MAX, GRID_SIZE)
    assert grid.sum() == len(coords)


def test_find_hot_cells_threshold():
    # Cells with count >= threshold must be True; cells below must be False.
    grid = np.array([[0, 1, 2], [3, 4, 0], [1, 2, 5]])
    hot = find_hot_cells(grid, threshold=2)
    expected = grid >= 2
    assert np.array_equal(hot, expected)


def test_find_clusters_two_disconnected_blocks():
    # Two separated 1-cell hot regions must receive distinct non-negative IDs.
    hot = np.zeros((3, 3), dtype=bool)
    hot[0, 0] = True   # top-left block
    hot[2, 2] = True   # bottom-right block (not 8-connected to top-left)
    labels = find_clusters(hot)
    lbl_a = labels[0, 0]
    lbl_b = labels[2, 2]
    assert lbl_a >= 0
    assert lbl_b >= 0
    assert lbl_a != lbl_b
    # Cold cells stay -1.
    assert labels[0, 1] == -1
    assert labels[1, 0] == -1


def test_find_clusters_8connectivity():
    # Diagonal neighbors must be merged into one cluster.
    hot = np.zeros((3, 3), dtype=bool)
    hot[0, 0] = True
    hot[1, 1] = True   # diagonal from (0,0)
    labels = find_clusters(hot)
    assert labels[0, 0] == labels[1, 1]
    assert labels[0, 0] >= 0


def test_assign_labels_produces_two_or_more_clusters():
    # On the standard synthetic data the pipeline must find at least two clusters.
    df = make_sites()
    coords = df.select(["easting", "northing"]).to_numpy()
    grid = build_grid(coords, REGION_MIN, REGION_MAX, GRID_SIZE)
    hot = find_hot_cells(grid, HOT_THRESHOLD)
    cell_labels = find_clusters(hot)
    labels = assign_labels(coords, cell_labels, REGION_MIN, REGION_MAX, GRID_SIZE)
    n_clusters = len(set(labels[labels >= 0]))
    assert n_clusters >= 2


def test_cluster_centroids_one_per_label():
    # cluster_centroids must return exactly one entry for every unique non-negative label.
    coords = np.array([
        [10.0, 10.0],
        [11.0, 10.0],
        [50.0, 50.0],
        [51.0, 50.0],
        [90.0, 90.0],  # noise
    ])
    labels = np.array([0, 0, 1, 1, -1])
    centroids = cluster_centroids(coords, labels)
    assert set(centroids.keys()) == {0, 1}
    cx0, cy0 = centroids[0]
    assert abs(cx0 - 10.5) < 1e-9
    assert abs(cy0 - 10.0) < 1e-9

Grid-cell density clustering key terms

Spatial clustering
The task of partitioning geographic point locations into groups whose members are spatially proximate, without specifying the number of groups in advance
Grid cell
One element of a regular lattice that tiles the survey region; sites are counted per cell to estimate local density
Hot cell
A grid cell whose site count meets or exceeds HOT_THRESHOLD; hot cells are the building blocks of detected clusters
Connected component
A maximal set of hot cells that are all linked to one another through shared edges or corners (8-connectivity); each component is one cluster
Depth-first search
A graph traversal that explores as far as possible along each branch before backtracking, implemented here with a stack to label connected components of hot cells
Cluster centroid
The mean easting and northing of all sites assigned to a cluster; summarizes the cluster's geographic location

DBSCAN handles clusters of irregular shape and density without discretizing space; it requires a more advanced algorithms course.

Exercises

Parameter sensitivity

Run the pipeline on the synthetic data with GRID_SIZE values of 10, 20, and 40 and HOT_THRESHOLD values of 1, 2, and 4 (nine combinations in total). Record the number of clusters and noise sites for each combination. Which settings merge the four planted clusters into fewer groups? Which settings fragment them into many small pieces?

Cluster purity

For each recovered cluster, compute the fraction of its member sites that share the same true_cluster value from the generator. A perfectly recovered cluster has purity 1.0. What is the minimum purity across all recovered clusters at the default parameters?

Boundary sensitivity

Move one planted cluster center 10 units closer to another and regenerate the synthetic data. Do the two clusters merge into one under the default parameters? At what separation distance do they first appear as a single cluster?

Centroid error

For each recovered cluster, compute the Euclidean distance between its centroid and the nearest planted cluster center. Is the error consistent across all four clusters, or does it vary with cluster size?