Mineral Classification by Geochemistry

The Problem

Geochemists measure rock samples as oxide weight percentages (wt%): the fraction of each chemical compound in the rock.
A classifier that assigns samples to mineral classes from their chemistry automates a task that otherwise requires expert visual inspection of thin sections under a microscope.
We train a logistic regression model on two features — $\text{SiO}_2$ (silica) and $\text{Al}_2\text{O}_3$ (alumina) — to separate felsic rocks (granite-like, silica-rich) from mafic rocks (basalt-like, silica-poor).

Generating Synthetic Geochemical Data

Real geochemical datasets are messy; we start with synthetic data drawn from two Gaussian distributions that mimic the composition ranges reported in igneous petrology textbooks.
Felsic rocks average 70 wt% $\text{SiO}_2$ and 14 wt% $\text{Al}_2\text{O}_3$; mafic rocks average 50 wt% $\text{SiO}_2$ and 9 wt% $\text{Al}_2\text{O}_3$.
The two distributions overlap — real mineral classification is never perfectly clean.

# Synthetic geochemical data: two mineral classes distinguished by oxide concentrations.
# Felsic rocks (e.g. granite) are silica-rich; mafic rocks (e.g. basalt) are silica-poor.
# Feature 0: SiO2 concentration (wt%)
# Feature 1: Al2O3 concentration (wt%)
SEED = 7493418  # RNG seed
FELSIC_MEAN = [70.0, 14.0]  # granite-like: high silica, moderate alumina
FELSIC_STD = [3.0, 1.5]
MAFIC_MEAN = [50.0, 9.0]  # basalt-like: lower silica, lower alumina
MAFIC_STD = [3.0, 1.5]
N_FELSIC = 80  # training + test samples per class
N_MAFIC = 80
TRAIN_FRAC = 0.75  # fraction used for training

def make_mineral_data(n_felsic=N_FELSIC, n_mafic=N_MAFIC, seed=SEED):
    """Return (X, y) for a two-class geochemical dataset.

    X is (n_felsic + n_mafic, 2): columns are SiO2 and Al2O3 wt%.
    y is (n_felsic + n_mafic,): 0 for felsic, 1 for mafic.
    Rows are shuffled so classes are interleaved.
    """
    rng = np.random.default_rng(seed)
    X_felsic = rng.normal(FELSIC_MEAN, FELSIC_STD, (n_felsic, 2))
    X_mafic = rng.normal(MAFIC_MEAN, MAFIC_STD, (n_mafic, 2))
    X = np.vstack([X_felsic, X_mafic])
    y = np.concatenate([np.zeros(n_felsic), np.ones(n_mafic)])
    idx = rng.permutation(len(y))
    return X[idx], y[idx]

The Mahalanobis distance between two univariate Gaussians with means $\mu_1$, $\mu_2$ and equal standard deviation $\sigma$ is $|\mu_1 - \mu_2| / \sigma$. Using FELSIC_MEAN[0] = 70, MAFIC_MEAN[0] = 50, FELSIC_STD[0] = 3, what fraction of the $\text{SiO}_2$ distributions overlap? (Hint: use the 68-95-99.7 rule: the distance is about $6.7\sigma$, so overlap $\approx 0$.)

The Logistic Regression Model

The model predicts the probability that a sample is mafic (class 1):

$$p = \sigma(w_0 \cdot x_0 + w_1 \cdot x_1 + b) = \frac{1}{1 + e^{-(w_0 x_0 + w_1 x_1 + b)}}$$

$\sigma$ is the sigmoid function, which maps any real number to $(0, 1)$.
$x_0$ and $x_1$ are the (normalised) $\text{SiO}_2$ and $\text{Al}_2\text{O}_3$ concentrations.
$w_0$, $w_1$, and $b$ are learned from data.

def sigmoid(z):
    """Map any real number to (0, 1): sigma(z) = 1 / (1 + exp(-z))."""
    return 1.0 / (1.0 + np.exp(-z))

Why must features be normalized before training logistic regression with gradient descent?

Normalisation makes the sigmoid output exactly 0.5 at the start of training.: Wrong: initial weights are zero, so $z = 0$ and $\sigma(0) = 0.5$ regardless of feature scale.
Features on very different scales cause gradient descent to take very small steps along
one axis and large steps along another, slowing convergence.: Correct: unnormalized gradients have components proportional to feature magnitude, creating an elongated loss surface that makes convergence slow and erratic.
Normalization ensures the model cannot overfit by constraining the weight magnitudes.: Wrong: normalization changes the input scale, not the weight values; regularization is the standard tool for constraining weights.
The sigmoid function is only defined for inputs in $[0, 1]$.: Wrong: the sigmoid is defined for all real numbers; the $[0, 1]$ interval is its output range.

def normalize(X, mean=None, std=None):
    """Standardize X to zero mean and unit standard deviation.

    If mean and std are supplied (from the training set), apply them to X
    without recomputing — this ensures test data is scaled identically.
    """
    if mean is None:
        mean = X.mean(axis=0)
    if std is None:
        std = X.std(axis=0)
    return (X - mean) / std, mean, std

Training by Gradient Descent

At each iteration, the model computes the binary cross-entropy loss:

$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^n \left[ y_i \log p_i + (1 - y_i) \log(1 - p_i) \right]$$

The gradient of $\mathcal{L}$ with respect to the weights is $\mathbf{X}^\top (\mathbf{p} - \mathbf{y}) / n$, and with respect to the bias is $\text{mean}(\mathbf{p} - \mathbf{y})$.
Each gradient-descent step subtracts the gradient scaled by the learning rate.

def train(X, y, lr=LEARNING_RATE, n_iter=N_ITER):
    """Fit a logistic regression model by gradient descent.

    Parameters
    ----------
    X : (n, p) array of features (should be normalized)
    y : (n,) array of labels, 0 or 1
    lr : learning rate
    n_iter : number of full-batch gradient steps

    Returns
    -------
    w : (p,) weight vector
    b : scalar bias
    losses : list of binary cross-entropy loss at each iteration
    """
    n, p = X.shape
    w = np.zeros(p)
    b = 0.0
    losses = []

    for _ in range(n_iter):
        z = X @ w + b
        p_hat = sigmoid(z)
        # Binary cross-entropy; clip to avoid log(0)
        p_hat_clipped = np.clip(p_hat, 1e-12, 1 - 1e-12)
        loss = -np.mean(y * np.log(p_hat_clipped) + (1 - y) * np.log(1 - p_hat_clipped))
        losses.append(loss)
        # Gradients
        err = p_hat - y
        dw = X.T @ err / n
        db = err.mean()
        w -= lr * dw
        b -= lr * db

    return w, b, losses

Put the steps of one gradient-descent iteration in the correct order.

Compute the linear combination $z = X w + b$
Apply the sigmoid to get predicted probabilities $\hat{p} = \sigma(z)$
Evaluate the binary cross-entropy loss
Compute gradients $\partial \mathcal{L}/\partial w$ and $\partial \mathcal{L}/\partial b$
Update weights: $w \leftarrow w - \alpha \cdot \partial \mathcal{L}/\partial w$

Decision Boundary and Classification

The decision boundary is the line where $p = 0.5$, i.e., where $w_0 x_0 + w_1 x_1 + b = 0$.
Rearranging for $x_1$ ($\text{Al}_2\text{O}_3$) as a function of $x_0$ ($\text{SiO}_2$):

$$x_1 = -\frac{w_0 x_0 + b}{w_1}$$

Because features were normalised during training, the boundary must be mapped back to original units before plotting.

def decision_boundary_line(w, b, mean, std, x_range):
    """Return (x_vals, y_vals) for the decision boundary in original feature space.

    The boundary satisfies w[0]*x_norm + w[1]*y_norm + b = 0.
    Re-expressing in un-normalized coordinates:
        w[0]*(x - mean[0])/std[0] + w[1]*(y - mean[1])/std[1] + b = 0
    Solving for y:
        y = mean[1] - std[1]/w[1] * (w[0]*(x - mean[0])/std[0] + b)
    """
    x_vals = np.linspace(*x_range, 200)
    y_vals = mean[1] - (std[1] / w[1]) * (w[0] * (x_vals - mean[0]) / std[0] + b)
    return x_vals, y_vals

def predict(X, w, b, threshold=0.5):
    """Return binary class predictions (0 or 1) for each row of X."""
    return (sigmoid(X @ w + b) >= threshold).astype(int)


def accuracy(y_true, y_pred):
    """Return fraction of correct predictions."""
    return np.mean(y_true == y_pred)

Scatter plot of SiO₂ vs Al₂O₃ with felsic (blue circles) and mafic (red triangles) training samples and a dashed decision boundary. — Figure 1: Training data for 120 samples (80% of 160 total). The dashed line is the learned decision boundary. The two classes are well separated in SiO₂ but overlap slightly in Al₂O₃.

Match each component of the logistic regression model to its role.

Component	Role
$w_0 x_0 + w_1 x_1 + b$	Separating hyperplane in feature space
$\sigma(\cdot)$	Maps the linear score to a probability
$\mathcal{L}$	Penalises confident wrong predictions more heavily than uncertain ones
Decision boundary	The surface where the model is maximally uncertain ($p = 0.5$)

Testing

Sigmoid known values

$\sigma(0) = 0.5$ exactly; $\sigma(z) \to 1$ as $z \to +\infty$; $\sigma(z) \to 0$ as $z \to -\infty$. These are identities that must hold regardless of the training data.

Normalization properties

After normalization, each feature column must have mean 0 and standard deviation 1.
Applying training statistics to test data must reproduce $(\mathbf{x} - \mu) / \sigma$ without recomputing $\mu$ or $\sigma$ from the test set.

Loss decreases during training

A correct gradient-descent implementation must reduce the cross-entropy loss monotonically over the first 100 iterations. An increase would indicate a sign error in the gradient or an excessively large learning rate.

Accuracy on well-separated data

The two classes are separated by roughly 6.7 standard deviations in $\text{SiO}_2$. A correctly trained classifier must achieve at least 95% accuracy; a lower score indicates a bug.

Accuracy with seed=42

With default parameters and seed=42, held-out test accuracy must reach at least 95%. The tolerance of 5% is wide enough to accommodate small variations between seeds while still catching a failed classifier.

Predict with known weights

With $w = [10, 0]$ and $b = 0$, the boundary is at $x_0 = 0$. A sample with $x_0 = 1$ must be predicted class 1 and a sample with $x_0 = -1$ must be predicted class 0.

import numpy as np
import pytest
from generate_mineral import make_mineral_data, train_test_split, TRAIN_FRAC
from mineral import sigmoid, normalize, train, predict, accuracy


def test_sigmoid_at_zero():
    # sigmoid(0) = 0.5 by definition.
    assert sigmoid(0.0) == pytest.approx(0.5)


def test_sigmoid_large_positive():
    # sigmoid(z) → 1 as z → +∞; at z=100 the deviation from 1 is negligible.
    assert sigmoid(100.0) == pytest.approx(1.0, abs=1e-6)


def test_sigmoid_large_negative():
    # sigmoid(z) → 0 as z → -∞.
    assert sigmoid(-100.0) == pytest.approx(0.0, abs=1e-6)


def test_normalize_zero_mean_unit_std():
    # Normalized training data must have mean ≈ 0 and std ≈ 1 for each feature.
    X, _ = make_mineral_data()
    X_n, mean, std = normalize(X)
    assert np.allclose(X_n.mean(axis=0), 0.0, atol=1e-10)
    assert np.allclose(X_n.std(axis=0), 1.0, atol=1e-10)


def test_normalize_applies_training_stats():
    # Normalizing test data with training mean/std must not recompute stats.
    X, _ = make_mineral_data()
    n = len(X)
    X_train, X_test = X[: n // 2], X[n // 2 :]
    X_n, mean, std = normalize(X_train)
    X_test_n, _, _ = normalize(X_test, mean, std)
    expected = (X_test - mean) / std
    assert np.allclose(X_test_n, expected)


def test_training_loss_decreases():
    # The binary cross-entropy loss must decrease over the first 100 iterations.
    X, y = make_mineral_data()
    X_n, _, _ = normalize(X)
    _, _, losses = train(X_n, y, n_iter=100)
    assert losses[-1] < losses[0]


def test_perfect_separation():
    # Two perfectly separated clusters must achieve 100% training accuracy.
    # Felsic centred at (70, 14), mafic at (50, 9) — 6-sigma apart in SiO2.
    X, y = make_mineral_data()
    X_n, _, _ = normalize(X)
    w, b, _ = train(X_n, y, n_iter=1000)
    y_pred = predict(X_n, w, b)
    # The two classes are well-separated; expect ≥ 95% training accuracy.
    assert accuracy(y, y_pred) >= 0.95


def test_test_accuracy():
    # With default parameters, test accuracy must be ≥ 95%.
    # The two classes differ by ~6.7 std devs in SiO2; the boundary should be sharp.
    X, y = make_mineral_data()
    X_train, y_train, X_test, y_test = train_test_split(X, y, TRAIN_FRAC)
    X_train_n, mean, std = normalize(X_train)
    X_test_n, _, _ = normalize(X_test, mean, std)
    w, b, _ = train(X_train_n, y_train)
    y_pred = predict(X_test_n, w, b)
    assert accuracy(y_test, y_pred) >= 0.95


def test_accuracy_all_correct():
    assert accuracy(np.array([0, 1, 0, 1]), np.array([0, 1, 0, 1])) == pytest.approx(
        1.0
    )


def test_accuracy_all_wrong():
    assert accuracy(np.array([0, 1, 0, 1]), np.array([1, 0, 1, 0])) == pytest.approx(
        0.0
    )


def test_predict_known_weights():
    # With w=[10, 0], b=0: samples with X[:,0] > 0 should be predicted class 1.
    X = np.array([[1.0, 0.0], [-1.0, 0.0]])
    w = np.array([10.0, 0.0])
    b = 0.0
    y_pred = predict(X, w, b)
    assert y_pred[0] == 1
    assert y_pred[1] == 0

Mineral classification key terms

Felsic rock: Silica-rich igneous rock (granite, rhyolite); typically $\text{SiO}_2 > 63$ wt%
Mafic rock: Silica-poor, magnesium- and iron-rich igneous rock (basalt, gabbro); typically $\text{SiO}_2 < 52$ wt%
Sigmoid function: $\sigma(z) = 1/(1 + e^{-z})$; maps any real number to the open interval $(0, 1)$ for use as a probability
Binary cross-entropy: $-[y \log p + (1-y)\log(1-p)]$; loss function that penalises confident wrong predictions
Decision boundary: The surface in feature space where the model assigns equal probability to both classes; a line for two features
Gradient descent: Iterative optimisation that moves weights in the direction that most rapidly decreases the loss

Exercises

Three-class classification

Add a third class (ultramafic rocks, e.g. peridotite, with $\text{SiO}_2 \approx 42$ wt% and $\text{Al}_2\text{O}_3 \approx 3$ wt%). Extend train to use the softmax function and categorical cross-entropy loss instead of the binary formulation. Report per-class precision and recall on a held-out test set.

Learning-rate sensitivity

Run train with learning rates $\alpha \in {0.001, 0.01, 0.1, 1.0, 10.0}$ for 500 iterations each. Plot the final training loss as a function of $\alpha$ and identify the range where gradient descent converges reliably. Explain why very large learning rates cause divergence.

Decision boundary as a function of iteration count

Save the weights $w$ and bias $b$ every 50 iterations during training. Plot the decision boundary at each checkpoint on the same scatter plot. Show that the boundary rotates and shifts as training progresses and stabilises near the optimum.

Effect of class imbalance

Generate a dataset with 160 felsic samples and 20 mafic samples. Train the classifier and report the confusion matrix. Show that the model achieves high overall accuracy by predicting felsic almost everywhere, then implement class-weighted cross-entropy loss (multiply the mafic term by n_felsic / n_mafic) and show that it recovers better detection of the minority class.