asimpy Improvements

Posted

Following up on yesterday’s post about benchmarking, Claude and I have made some improvements to asimpy.

1. Eliminate functools.partial in resume()

Before, every time an event fired and needed to reschedule a waiting process, resume() did this:

from functools import partial

def resume(self, value=None):
    if not self._done:
        self._env.immediate(partial(self._loop, value))

partial is a C-level callable, so it is fast, but it still allocates a new object on every event completion. The fix stores the value directly on the process and schedules _loop without wrapping it:

def resume(self, value=None):
    if not self._done:
        self._resume_value = value
        self._env.immediate(self._loop)

def _loop(self):
    value = self._resume_value
    self._resume_value = None
    ...

2. Inline _add_waiter in _loop

When a process parks on a pending event, the original code called through a method:

yielded._add_waiter(self.resume)

The inlined version writes directly to the event’s waiter slot:

if v is _PENDING:
    w = yielded._waiter
    if w is None:
        yielded._waiter = self.resume
    elif isinstance(w, list):
        w.append(self.resume)
    else:
        yielded._waiter = [w, self.resume]

This saves one Python function-call dispatch per await on a pending event.

3. Move the Exception Check from Event.__await__ to Process._loop

The original __await__ checked whether the received value was an exception on every resume, even when the event succeeded normally:

def __await__(self):
    value = yield self
    if isinstance(value, BaseException):
        raise value
    return value

Moving the check to _loop lets __await__ become a single expression with no branch:

def __await__(self):
    return (yield self)

In _loop, the value from a failed event causes coro.throw() instead of coro.send():

if isinstance(value, BaseException):
    yielded = self._coro.throw(value)
else:
    yielded = self._coro.send(value)

Every normal (non-failing) await now avoids the isinstance check entirely.

4. Lazy Waiter Slot

Every Event.__init__ previously allocated an empty list for _waiters, which is wasted work in the common case where at most one process ever waits on a given event. The fix replaces the list with a single _waiter slot that upgrades lazily:

__slots__ = ("_env", "_value", "_waiter", "_on_cancel")

def __init__(self, env):
    ...
    self._waiter = None  # None | callable | list[callable]

succeed() dispatches on the same three cases and avoids iterating an empty list for the typical single-waiter event.

Benchmark Results

Benchmarks were run with sys.monitoring (Python bytecode instruction counts) and time.perf_counter() (wall-clock time), each over 1000 executions of the feature under test. The comparison is against SimPy 5, which uses generator-based scheduling.

Feature Before (µs) After (µs) Wall-clock Δ Instr Δ
Event 0.522 0.446 -14.6% -1
Event (fail) 0.967 0.770 -20.4% -2
Timeout 0.686 0.623 -9.2% -1
Environment.run(until=) 0.749 0.638 -14.8% -1
Container 0.694 0.647 -6.8% -4
Container (float amounts) 0.745 0.640 -14.1% -4
Queue 0.729 0.657 -9.9% -4
Queue (blocking get) 1.733 1.605 -7.4% -4
Queue (blocking put) 1.870 1.723 -7.9% -4
Store 0.706 0.692 -2.0% -4
Store (filtered get) 0.721 0.695 -3.6% -4
PriorityQueue 0.710 0.673 -5.2% -4
Resource (uncontended) 0.364 0.339 -6.9% -2
Resource (context manager) 0.493 0.481 -2.4% -2
Resource (contention) 1.825 1.676 -8.2% +13
Resource (multi-capacity) 1.843 1.738 -5.7% +13
AllOf 1.059 1.022 -3.5% -4
AllOf (blocking) 2.243 2.054 -8.4% +4
FirstOf 1.067 1.009 -5.4% -3
FirstOf (blocking) 2.240 2.189 -2.3% +2
Interrupt 1.866 1.763 -5.5% -1
Interrupt (with cause) 1.918 1.770 -7.7% -1
Barrier 1.381 1.280 -7.3% -2
PreemptiveResource 3.196 3.179 -0.5% -5
PreemptiveResource (no-preempt) 2.558 2.422 -5.3% -3
Process 1.431 1.994 +39.3% +14

What the Numbers Show

Wall-clock time improved by 2-20% for every benchmark except Process. The largest gains appeared where events are created and triggered frequently: failing events (-20%), basic Event and Timeout (-15%), and Environment.run(until=) (-15%). The savings come primarily from eliminating one C-level allocation per event completion (the partial object in change 1) and one per event creation (the empty list in change 4). Neither of these shows up as fewer Python bytecodes because sys.monitoring counts only Python-level instructions, not C heap allocations. This is is why instruction counts changed by only 1-5 for most benchmarks even when wall-clock time fell.

The instruction increases on Resource (contention/multi-capacity) (+13) and Process (+14) come from the three new bytecodes per _loop entry introduced by change 1 (read _resume_value, clear it to None) and the isinstance branch introduced by change 3. For resource contention, where multiple processes queue for the same resource, these extra instructions are more than offset by the C-level savings: wall-clock still fell 6-8%.

The Process microbenchmark is the one genuine regression (+39% wall-clock, +14 instructions). It measures the cost of creating and running a single minimal process to completion. At that scale the fixed per-_loop overhead from changes 1 and 3 has nothing to amortise against. In any simulation where a process awaits more than one event during its lifetime, which is the overwhelmingly common case, those costs are shared across many events and the net result is a speedup.

What I Can’t Benchmark

These changes definitely make asimpy harder to understand: for example, inlining _add_waiter moves an action from what feels like its natural home to an entirely different part of the library. I wouldn’t have thought of doing this on my own, and while I understand what Claude suggested in retrospect, I don’t know how long it would take to wrap my head around it if I was reading the code for the first time. People are using phrases like “cognitive debt” and “comprehension debt” to describe this phenomenon; my only consolation is that software systems have always struggled with this.