Mineral Classification by Geochemistry

The Problem

Generating Synthetic Geochemical Data

# Synthetic geochemical data: two mineral classes distinguished by oxide concentrations.
# Felsic rocks (e.g. granite) are silica-rich; mafic rocks (e.g. basalt) are silica-poor.
# Feature 0: SiO2 concentration (wt%)
# Feature 1: Al2O3 concentration (wt%)
SEED = 7493418  # RNG seed
FELSIC_MEAN = [70.0, 14.0]  # granite-like: high silica, moderate alumina
FELSIC_STD = [3.0, 1.5]
MAFIC_MEAN = [50.0, 9.0]  # basalt-like: lower silica, lower alumina
MAFIC_STD = [3.0, 1.5]
N_FELSIC = 80  # training + test samples per class
N_MAFIC = 80
TRAIN_FRAC = 0.75  # fraction used for training
def make_mineral_data(n_felsic=N_FELSIC, n_mafic=N_MAFIC, seed=SEED):
    """Return (X, y) for a two-class geochemical dataset.

    X is (n_felsic + n_mafic, 2): columns are SiO2 and Al2O3 wt%.
    y is (n_felsic + n_mafic,): 0 for felsic, 1 for mafic.
    Rows are shuffled so classes are interleaved.
    """
    rng = np.random.default_rng(seed)
    X_felsic = rng.normal(FELSIC_MEAN, FELSIC_STD, (n_felsic, 2))
    X_mafic = rng.normal(MAFIC_MEAN, MAFIC_STD, (n_mafic, 2))
    X = np.vstack([X_felsic, X_mafic])
    y = np.concatenate([np.zeros(n_felsic), np.ones(n_mafic)])
    idx = rng.permutation(len(y))
    return X[idx], y[idx]

The Mahalanobis distance between two univariate Gaussians with means $\mu_1$, $\mu_2$ and equal standard deviation $\sigma$ is $|\mu_1 - \mu_2| / \sigma$. Using FELSIC_MEAN[0] = 70, MAFIC_MEAN[0] = 50, FELSIC_STD[0] = 3, what fraction of the $\text{SiO}_2$ distributions overlap? (Hint: use the 68-95-99.7 rule: the distance is about $6.7\sigma$, so overlap $\approx 0$.)

The Logistic Regression Model

$$p = \sigma(w_0 \cdot x_0 + w_1 \cdot x_1 + b) = \frac{1}{1 + e^{-(w_0 x_0 + w_1 x_1 + b)}}$$

def sigmoid(z):
    """Map any real number to (0, 1): sigma(z) = 1 / (1 + exp(-z))."""
    return 1.0 / (1.0 + np.exp(-z))

Why must features be normalized before training logistic regression with gradient descent?

Normalisation makes the sigmoid output exactly 0.5 at the start of training.
Wrong: initial weights are zero, so $z = 0$ and $\sigma(0) = 0.5$ regardless of feature scale.
Features on very different scales cause gradient descent to take very small steps along
one axis and large steps along another, slowing convergence.
Correct: unnormalized gradients have components proportional to feature magnitude, creating an elongated loss surface that makes convergence slow and erratic.
Normalization ensures the model cannot overfit by constraining the weight magnitudes.
Wrong: normalization changes the input scale, not the weight values; regularization is the standard tool for constraining weights.
The sigmoid function is only defined for inputs in $[0, 1]$.
Wrong: the sigmoid is defined for all real numbers; the $[0, 1]$ interval is its output range.
def normalize(X, mean=None, std=None):
    """Standardize X to zero mean and unit standard deviation.

    If mean and std are supplied (from the training set), apply them to X
    without recomputing — this ensures test data is scaled identically.
    """
    if mean is None:
        mean = X.mean(axis=0)
    if std is None:
        std = X.std(axis=0)
    return (X - mean) / std, mean, std

Training by Gradient Descent

$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^n \left[ y_i \log p_i + (1 - y_i) \log(1 - p_i) \right]$$

def train(X, y, lr=LEARNING_RATE, n_iter=N_ITER):
    """Fit a logistic regression model by gradient descent.

    Parameters
    ----------
    X : (n, p) array of features (should be normalized)
    y : (n,) array of labels, 0 or 1
    lr : learning rate
    n_iter : number of full-batch gradient steps

    Returns
    -------
    w : (p,) weight vector
    b : scalar bias
    losses : list of binary cross-entropy loss at each iteration
    """
    n, p = X.shape
    w = np.zeros(p)
    b = 0.0
    losses = []

    for _ in range(n_iter):
        z = X @ w + b
        p_hat = sigmoid(z)
        # Binary cross-entropy; clip to avoid log(0)
        p_hat_clipped = np.clip(p_hat, 1e-12, 1 - 1e-12)
        loss = -np.mean(y * np.log(p_hat_clipped) + (1 - y) * np.log(1 - p_hat_clipped))
        losses.append(loss)
        # Gradients
        err = p_hat - y
        dw = X.T @ err / n
        db = err.mean()
        w -= lr * dw
        b -= lr * db

    return w, b, losses

Put the steps of one gradient-descent iteration in the correct order.

  1. Compute the linear combination $z = X w + b$
  2. Apply the sigmoid to get predicted probabilities $\hat{p} = \sigma(z)$
  3. Evaluate the binary cross-entropy loss
  4. Compute gradients $\partial \mathcal{L}/\partial w$ and $\partial \mathcal{L}/\partial b$
  5. Update weights: $w \leftarrow w - \alpha \cdot \partial \mathcal{L}/\partial w$

Decision Boundary and Classification

$$x_1 = -\frac{w_0 x_0 + b}{w_1}$$

def decision_boundary_line(w, b, mean, std, x_range):
    """Return (x_vals, y_vals) for the decision boundary in original feature space.

    The boundary satisfies w[0]*x_norm + w[1]*y_norm + b = 0.
    Re-expressing in un-normalized coordinates:
        w[0]*(x - mean[0])/std[0] + w[1]*(y - mean[1])/std[1] + b = 0
    Solving for y:
        y = mean[1] - std[1]/w[1] * (w[0]*(x - mean[0])/std[0] + b)
    """
    x_vals = np.linspace(*x_range, 200)
    y_vals = mean[1] - (std[1] / w[1]) * (w[0] * (x_vals - mean[0]) / std[0] + b)
    return x_vals, y_vals
def predict(X, w, b, threshold=0.5):
    """Return binary class predictions (0 or 1) for each row of X."""
    return (sigmoid(X @ w + b) >= threshold).astype(int)


def accuracy(y_true, y_pred):
    """Return fraction of correct predictions."""
    return np.mean(y_true == y_pred)
Scatter plot of SiO₂ vs Al₂O₃ with felsic (blue circles) and mafic (red triangles) training samples and a dashed decision boundary.
Figure 1: Training data for 120 samples (80% of 160 total). The dashed line is the learned decision boundary. The two classes are well separated in SiO₂ but overlap slightly in Al₂O₃.

Match each component of the logistic regression model to its role.

Component Role
$w_0 x_0 + w_1 x_1 + b$ Separating hyperplane in feature space
$\sigma(\cdot)$ Maps the linear score to a probability
$\mathcal{L}$ Penalises confident wrong predictions more heavily than uncertain ones
Decision boundary The surface where the model is maximally uncertain ($p = 0.5$)

Testing

Sigmoid known values

Normalization properties

Loss decreases during training

Accuracy on well-separated data

Accuracy with seed=42

Predict with known weights

import numpy as np
import pytest
from generate_mineral import make_mineral_data, train_test_split, TRAIN_FRAC
from mineral import sigmoid, normalize, train, predict, accuracy


def test_sigmoid_at_zero():
    # sigmoid(0) = 0.5 by definition.
    assert sigmoid(0.0) == pytest.approx(0.5)


def test_sigmoid_large_positive():
    # sigmoid(z) → 1 as z → +∞; at z=100 the deviation from 1 is negligible.
    assert sigmoid(100.0) == pytest.approx(1.0, abs=1e-6)


def test_sigmoid_large_negative():
    # sigmoid(z) → 0 as z → -∞.
    assert sigmoid(-100.0) == pytest.approx(0.0, abs=1e-6)


def test_normalize_zero_mean_unit_std():
    # Normalized training data must have mean ≈ 0 and std ≈ 1 for each feature.
    X, _ = make_mineral_data()
    X_n, mean, std = normalize(X)
    assert np.allclose(X_n.mean(axis=0), 0.0, atol=1e-10)
    assert np.allclose(X_n.std(axis=0), 1.0, atol=1e-10)


def test_normalize_applies_training_stats():
    # Normalizing test data with training mean/std must not recompute stats.
    X, _ = make_mineral_data()
    n = len(X)
    X_train, X_test = X[: n // 2], X[n // 2 :]
    X_n, mean, std = normalize(X_train)
    X_test_n, _, _ = normalize(X_test, mean, std)
    expected = (X_test - mean) / std
    assert np.allclose(X_test_n, expected)


def test_training_loss_decreases():
    # The binary cross-entropy loss must decrease over the first 100 iterations.
    X, y = make_mineral_data()
    X_n, _, _ = normalize(X)
    _, _, losses = train(X_n, y, n_iter=100)
    assert losses[-1] < losses[0]


def test_perfect_separation():
    # Two perfectly separated clusters must achieve 100% training accuracy.
    # Felsic centred at (70, 14), mafic at (50, 9) — 6-sigma apart in SiO2.
    X, y = make_mineral_data()
    X_n, _, _ = normalize(X)
    w, b, _ = train(X_n, y, n_iter=1000)
    y_pred = predict(X_n, w, b)
    # The two classes are well-separated; expect ≥ 95% training accuracy.
    assert accuracy(y, y_pred) >= 0.95


def test_test_accuracy():
    # With default parameters, test accuracy must be ≥ 95%.
    # The two classes differ by ~6.7 std devs in SiO2; the boundary should be sharp.
    X, y = make_mineral_data()
    X_train, y_train, X_test, y_test = train_test_split(X, y, TRAIN_FRAC)
    X_train_n, mean, std = normalize(X_train)
    X_test_n, _, _ = normalize(X_test, mean, std)
    w, b, _ = train(X_train_n, y_train)
    y_pred = predict(X_test_n, w, b)
    assert accuracy(y_test, y_pred) >= 0.95


def test_accuracy_all_correct():
    assert accuracy(np.array([0, 1, 0, 1]), np.array([0, 1, 0, 1])) == pytest.approx(
        1.0
    )


def test_accuracy_all_wrong():
    assert accuracy(np.array([0, 1, 0, 1]), np.array([1, 0, 1, 0])) == pytest.approx(
        0.0
    )


def test_predict_known_weights():
    # With w=[10, 0], b=0: samples with X[:,0] > 0 should be predicted class 1.
    X = np.array([[1.0, 0.0], [-1.0, 0.0]])
    w = np.array([10.0, 0.0])
    b = 0.0
    y_pred = predict(X, w, b)
    assert y_pred[0] == 1
    assert y_pred[1] == 0

Mineral classification key terms

Felsic rock
Silica-rich igneous rock (granite, rhyolite); typically $\text{SiO}_2 > 63$ wt%
Mafic rock
Silica-poor, magnesium- and iron-rich igneous rock (basalt, gabbro); typically $\text{SiO}_2 < 52$ wt%
Sigmoid function
$\sigma(z) = 1/(1 + e^{-z})$; maps any real number to the open interval $(0, 1)$ for use as a probability
Binary cross-entropy
$-[y \log p + (1-y)\log(1-p)]$; loss function that penalises confident wrong predictions
Decision boundary
The surface in feature space where the model assigns equal probability to both classes; a line for two features
Gradient descent
Iterative optimisation that moves weights in the direction that most rapidly decreases the loss

Exercises

Three-class classification

Add a third class (ultramafic rocks, e.g. peridotite, with $\text{SiO}_2 \approx 42$ wt% and $\text{Al}_2\text{O}_3 \approx 3$ wt%). Extend train to use the softmax function and categorical cross-entropy loss instead of the binary formulation. Report per-class precision and recall on a held-out test set.

Learning-rate sensitivity

Run train with learning rates $\alpha \in {0.001, 0.01, 0.1, 1.0, 10.0}$ for 500 iterations each. Plot the final training loss as a function of $\alpha$ and identify the range where gradient descent converges reliably. Explain why very large learning rates cause divergence.

Decision boundary as a function of iteration count

Save the weights $w$ and bias $b$ every 50 iterations during training. Plot the decision boundary at each checkpoint on the same scatter plot. Show that the boundary rotates and shifts as training progresses and stabilises near the optimum.

Effect of class imbalance

Generate a dataset with 160 felsic samples and 20 mafic samples. Train the classifier and report the confusion matrix. Show that the model achieves high overall accuracy by predicting felsic almost everywhere, then implement class-weighted cross-entropy loss (multiply the mafic term by n_felsic / n_mafic) and show that it recovers better detection of the minority class.