My work log tells me I’ve spent 54 hours since mid-November
building this discrete event simulation,
which works out to a little over an hour a day.
I’ve learned a few things about SimPy and Polars along the way,
and depending on what happens with my job search,
I may run an online workshop in 2026 to walk people through it.
For now,
though,
I need to put this aside and concentrate on completing a couple of small contracts
and revising some of the fiction I finally “finished”.
I supervised some student projects at the University of Toronto
for the first time in fifteen years.
I really enjoyed working with them,
and I hope they had fun as well,
but it brought me face-to-face with how bad the job market is right now
for newcomers in tech.
I was laid off for the third time in eight years,
which has brought me face-to-face with how bad the job market is right now
for oldcomers in tech.
The refugee that my spouse and I helped sponsor through Rainbow Railroad
finally arrived in Canada,
and has settled in well.
I started building a tutorial on how to simulate software development processes.
I played jazz in public for the first time in twenty-three years.
One wedding, no funerals.
I finally finished (first complete drafts of) three pieces of fiction
that I’ve been working on for longer than I want to admit.
I don’t know if there’s a market for any of them right now,
but it felt pretty good to finally write “The End”.
Time for the day’s last cup of tea.
If you came in peace, be welcome.
There are only two hard things in computer science:
cache invalidation and naming things.
— Phil Karlton
With respect,
I think that handling interrupts is harder than either of these.
Yesterday’s post explained how SimPy does this.
Today,
after several near misses,
we’ll look at how to add it to our simulation.
A Quick Recap
Our Simulation class now includes a process
that waits a random interval,
chooses a random developer,
and interrupts her by calling .interrupt:
When .interrupt is called,
SimPy injects a simpy.Interrupt exception into the target process;
the argument to .interrupt is attached to that exception object.
If we’re comfortable throwing away the task we’re currently working on,
the Developer process can catch the exception like this:
The trickiest part of this is canceling the outstanding request
to get something from the development queue.
As we discovered yesterday,
if we don’t do this,
the developer won’t ever get anything else from the queue.
Nearly Right
What if we don’t want to throw away our current task when we’re interrupted?
What if instead we want to handle the interruption and then resume that task?
Here’s an implementation that’s almost right:
# This code is wrong.defwork(self):task=NonewhileTrue:req=Nonetry:iftaskisNone:req=self.sim.dev_queue.get()task=yieldreqt_start=self.sim.nowyieldself.sim.timeout(task.dev_required-task.dev_done)task.dev_done+=self.sim.now-t_startiftask.is_done():task=Noneexceptsimpy.Interruptasexc:ifreqisnotNone:req.cancel()yieldself.sim.timeout(exc.cause)
Inside the try block we get a task if we don’t already have one,
then try to wait for as much time as the task still requires.
When we’re done we add some time to the task and check if it’s now complete.
If we’re interrupted,
we cancel any outstanding request for a new task
and then wait for as long as the interrupt tells us to.
There are (at least) two things wrong with this code.
The first and less important is that when we’re interrupted,
we throw away any work we’ve done on the task during this iteration of the loop.
That’s relatively easy to fix:
we just add some code to the except block to increment the .dev_done time in the task.
The bigger problem,
though,
is that sooner or later this code fails because of an uncaught Interrupt exception.
The problem is that we can be interrupted
inside our interrupt handler.
It isn’t likely,
which means it doesn’t happen very often,
but if we run the simulation often enough with different random seeds,
it eventually falls over.
A Corrected Version
The “fix” (I’ll explain the scare quotes around that word in a moment)
is to move interrupt handling into the try block.
To do that,
we have to add another state variable interrupt_delay
that tells the process if it’s currently supposed to be handling an interruption delay:
So why did I put scare quotes around the word “fix”?
Because I’m still not 100% sure this works.
It hasn’t failed yet,
despite multiple runs with different seeds and parameters,
but this code is now complex enough that I could well believe it contains
a one-in-a-million edge case.
I think the except block is now a critical region,
i.e.,
that no interrupts can occur within it
because none of those three lines hands control back to SimPy,
but I’m not completely sure.
And yes,
this code still throws away any work the developer has done on a task
during a particular loop iteration
if an interrupt occurs;
the interrupt handler should increment task.dev_done.
And yes,
it’s possible for an interrupt to be interrupted:
a more realistic implementation would stack interrupt delays,
but honestly,
if my boss interrupts me while I’m being interrupted,
I don’t have any qualms about discarding the first interruption.
Yaks and More Yaks
My goal with this series of blog posts was
to simulate the development process of a small software team.
I’ve spent most of the last week learning more about SimPy;
it feels like yak shaving,
but without it,
I don’t think I’d have confidence in the code shown above
(or be able to review its AI-generated equivalent).
The previous post in this series looked at the effects of re-work on throughput.
Its version of the simulation assumed that
a task that needed to be re-done was given back to the developer who had originally worked on it,
and that she would tackle that task once she finished whatever she was doing at the time.
In reality,
though,
developers (and testers) often interrupt what they’re doing,
switch to another task,
and then switch back.
This post looks at how to simulate that in SimPy.
A Simpler Model
To start with,
let’s go back to a simpler model in which one process adds new tasks to a queue at random intervals,
while developers take tasks,
work on them,
and then mark them as done.
While we’re building it,
we’ll use these parameters:
PARAMS={"n_dev":1,# number of developers"rng_seed":12345,# random number generation seed"t_arrival":8.0,# mean task arrival time"t_dev":5.0,# median development time"t_interrupt":3.0,# time between interrupts"t_sim":20,# length of simulation}
As before,
we define a Simulation class to hold all our bits and pieces.
Its constructor caches the parameters,
then creates the SimPy Environment,
the development queue,
and the developers.
It also creates a simple list called .log to store log messages:
For lack of anywhere better to put it,
we define the task generator as a method of Simulation.
The order of statements in the while loop ensures that
the first task is generated at t=0:
classSimulation:…previouscode…defgenerate(self):"""Generate tasks at random intervals starting at t=0."""whileTrue:yieldself.dev_queue.put(Task(self))yieldself.timeout(self.t_arrival())defprocess(self,proc):"""Shortcut for running a process."""returnself.env.process(proc)
As before,
we also give Simulation some convenience methods like .process
so that we can type sim.process(…) rather than sim.env.process(…).
Creating Interrupts
What’s new in this simulation is another process
whose only purpose is to interrupt developers.
It does this by choosing a random developer dev every few clock ticks
and then calling dev.proc.interrupt():
Developer.work is (going to be) a generator that simulates
an individual developer’s behavior.
Calling it as self.work() in Developer.__init__doesn’t start a process.
Instead, that call creates a generator object.
We pass that generator object to self.sim.process(),
which gives it to the SimPy Environment to run.
Doing this means that developers start running as soon as they are created.
Finally,
self.sim.process(…) returns the generator object that was passed into it.
We save this in the Developer object as self.proc
so that we can access it later.
(This step was the one that initially confused me.
Developer isn’t a SimPy process:
the generator object created by calling Developer.work is the process.
If we want to interrupt the process,
we need to get at the generator object,
and the logical place to store a reference to it is
in the object that defines its behavior.)
So let’s jump back to Simulation.annoy. It contains the lines:
This developer does the following each time it goes around the while loop:
Create a request to get a task from the development queue.
yield that request,
which hands the request to the SimPy Environment
and suspends this process until the request can be satisfied.
The result of yielding the request is a task from the development queue,
so wait a while to simulate working on that task.
Unless a simpy.Interrupt exception occurs,
in which case the developer cancels the request for a task
and goes around the loop again.
We have run into the need to cancel before.
In the first version of this code,
I assumed that interrupts would only occur while a developer was working on a task,
so the body of the except block was just pass:
# This version is wrong!defwork(self):whileTrue:req=Nonetry:req=self.sim.dev_queue.get()task=yieldreqyieldself.sim.timeout(task.dev_required)exceptsimpy.Interruptasexc:pass
What I found was that if the interrupt occurred while the developer was waiting on the queue,
and I didn’t cancel the request,
subsequent requests were never satisfied.
In other words,
once a developer had been interrupted,
she would never get any more work.
How It Behaves
I won’t bother to show the code that adds log messages to Simulation.log
or collects the states of all the tasks when the simulation is done,
but here’s the JSON output from a typical short run:
As you can see,
the first two tasks are interrupted and discarded while they’re being worked on,
while the developer manages to finish the third task before she’s interrupted.
The next step will be to resume tasks rather than just throwing them away.
The two parameters that affect re-work are p_rework_needed,
which is the probability that a tester is going to send a task back for more work,
and p_rework_chosen,
which is the probability that a developer will choose something from her re-work queue
rather than starting fresh work
when both are available.
The extreme cases are:
p_rework_needed = 0.0, meaning no tasks are ever sent back.
p_rework_needed = 1.0, meaning every task is always sent back.
(Note that this means no task will ever be finished:
instead,
tasks will circulate endlesslesly between development and testing,
and yes,
sometimes it does feel like that happens in real life.)
p_rework_chosen = 0.0, in which case
developers only re-do tasks when there is no new work queued up.
(We know from previous simulations that there are almost always
new tasks waiting to be started.)
p_rework_chosen = 1.0, in which case
developers only start new tasks when nothing needs to be re-done.
Finally,
this version of the simulation always sends tasks back to
the developers who originally worked on them
rather than allowing the first available developer
to re-do someone else’s work.
Final States
The figure below shows the number of tasks in each of the five possible states
at the end of the simulation
as a function of p_rework_needed and p_rework_chosen:
As expected,
lots of tasks are in the complete state when there is never any need for re-work,
while none are when re-work is always needed.
This finding may seem obvious,
but this simple check uncovered a couple of bugs in the simulation.
A more interesting finding is that
how developers choose tasks doesn’t seem to have much effect
on how many they get done:
while there is some variation,
the bars stay more or less the same height when we look at each row.
We can reach the same conclusion by looking at
the number of times tasks were developed and re-worked.
The sizes of the circles in the plots below reflect these counts:
Again,
the probability of needing re-work has an obvious impact,
while the probability of choosing new work vs. re-work doesn’t seem to.
It has taken me almost three hours to answer my earlier question about SimPy,
and once I recover,
I’m going to submit a couple of PRs for their documentation.
To recap,
the most recent version of the simulation
all testers to bounce work back to the development queue if they find bugs:
I’m going to walk through the code I wrote,
the problem I had,
and the solution I (think I’ve) found step by step.
If there’s a simpler way,
or if this is documentation clearly somewhere,
please let me know.
Tasks
A Task has a unique ID and keeps track of how many times it has been worked on:
Since I don’t want to have to pass a whole bunch of parameters around,
Simulation stores the SimPy Environment,
our single developer,
our single tester,
and the three queues:
one for new tasks,
one for tasks that need to be reworked,
and one for tasks that are ready for testing:
classSimulation:def__init__(self):# Environment.self.env=simpy.Environment()# One developer and one tester.self.developer=Developer(self)self.tester=Tester(self)# Queues for new development work, rework, and testing.self.q_dev=simpy.Store(self.env)self.q_rework=simpy.Store(self.env)self.q_test=simpy.Store(self.env)
Simulation also defines the generator that creates new tasks
and puts them in the ready-to-develop queue:
N_TASK=2# number of tasks to createT_GEN=1# time between tasksT_DEV=1# time to do developmentT_TEST=1# time to do testingT_SIM=4# how long to run the whole simulation
Workers
A generic Worker (either a developer or a tester)
caches the object that holds the overall simulation
and provides a couple of convenience methods for getting the current time
and for printing itself:
The tester gets a task from the testing queue,
waits a tick to simulate work,
increments the task’s internal count of how many times it has gone around the loop,
and then puts it in the rework queue:
classTester(Worker):defwork(self):whileTrue:# Get a task from the testing queue.print(f"{self}: wait for task(s)")task=yieldself.sim.q_test.get()print(f"{self}: got {task}")# Do testing.yieldself.sim.env.timeout(T_TEST)task.count+=1# Always put the task in the rework queue.yieldself.sim.q_rework.put(task)print(f"{self}: put {task} in rework queue")
Developers (the wrong version)
This code is wrong, but will be fixed below.
My original Developer waited until a task was available in either
the development queue or the rework queue,
then selected one of those tasks (giving priority to rework),
waited a tick to simulate work,
and put the task in the testing queue:
classDeveloperWrong(Worker):defwork(self):whileTrue:# Get either a task for development or a task for rework.req_dev=self.sim.q_dev.get()req_rework=self.sim.q_rework.get()print(f"{self}: wait for task(s)")result=yieldsimpy.AnyOf(self.sim.env,[req_dev,req_rework])# Give priority to rework.ifreq_reworkinresult:task=result[req_rework]print(f"{self} got {task} from rework with {len(result.events)} events")elifreq_devinresult:task=result[req_dev]print(f"{self} got {task} from dev with {len(result.events)} events")else:assertFalse,"how did we get here?"# Do development.yieldself.sim.env.timeout(T_DEV)# Put the task in the testing queue.yieldself.sim.q_test.put(task)print(f"{self}: put {task} in testing queue")
According to the documentation,
simpy.AnyOf creates an object that means “give me any of the following”.
Its result is a dictionary-like object
whose keys are the requests
and whose values are items from one or more of the queues being waited on:
“or more” because it’s possible for items to be available in multiple queues simultaneously.
Output from the Buggy Version
Here’s the output using the DeveloperWrong class shown above:
0.00 create task-0/0
DeveloperWrong/0.00: wait for task(s)
Tester/0.00: wait for task(s)
DeveloperWrong/0.00 got task-0/0 from dev with 1 events
1.00 create task-1/0
DeveloperWrong/1.00: put task-0/0 in testing queue
DeveloperWrong/1.00: wait for task(s)
Tester/1.00: got task-0/0
DeveloperWrong/1.00 got task-1/0 from dev with 1 events
2.00 create task-2/0
Tester/2.00: put task-0/1 in rework queue
Tester/2.00: wait for task(s)
DeveloperWrong/2.00: put task-1/0 in testing queue
DeveloperWrong/2.00: wait for task(s)
Tester/2.00: got task-1/0
DeveloperWrong/2.00 got task-2/0 from dev with 1 events
3.00 create task-3/0
Tester/3.00: put task-1/1 in rework queue
Tester/3.00: wait for task(s)
DeveloperWrong/3.00: put task-2/0 in testing queue
DeveloperWrong/3.00: wait for task(s)
Tester/3.00: got task-2/0
DeveloperWrong/3.00 got task-3/0 from dev with 1 events
A lot is going on in those 23 lines,
so let’s look at what the developer got from where:
DeveloperWrong/0.00 got task-0/0 from dev with 1 events
DeveloperWrong/1.00 got task-1/0 from dev with 1 events
DeveloperWrong/2.00 got task-2/0 from dev with 1 events
DeveloperWrong/3.00 got task-3/0 from dev with 1 events
Hm:
the developer is never taking tasks from the rework queue.
Is the tester putting them in?
Tester/0.00: wait for task(s)
Tester/1.00: got task-0/0
Tester/2.00: put task-0/1 in rework queue
Tester/2.00: wait for task(s)
Tester/2.00: got task-1/0
Tester/3.00: put task-1/1 in rework queue
Tester/3.00: wait for task(s)
Tester/3.00: got task-2/0
So what’s going on?
A Corrected Developer
Here’s the skeleton of the corrected developer’s work method:
classDeveloperRight(Worker):defwork(self):which="dev"take=NonewhileTrue:# Get either a task for development or a task for rework.req_dev=self.sim.q_dev.get()req_rework=self.sim.q_rework.get()print(f"{self}: wait for task(s)")result=yieldsimpy.AnyOf(self.sim.env,[req_dev,req_rework])…magichappenshere…# Do development.yieldself.sim.env.timeout(T_DEV)# Put the task in the testing queue.yieldself.sim.q_test.put(task)print(f"{self}: put {task} in testing queue")
The bit of magic in the middle is that as far as I can tell,
all of the requests that weren’t selected have to be canceled,
even if there wasn’t something in the queue at the time for the developer to take.
For example,
the first time the developer gets a task from the development queue,
the rework queue is guaranteed to be empty,
but we still have to cancel the request for something from it.
In order to demonstrate this,
I’ve filled in the magic portion of work with code that alternates between
taking work from the development queue (if available)
and taking work from the rework queue (ditto).
Of course,
if there is only one event in the result from AnyOf,
the code uses that,
but still cancels the other request:
defwork(self):which="dev"take=NonewhileTrue:…geteitherataskfordevelopmentorataskforrework…# Pick one.iflen(result.events)==2:print(f"{self}: ...got two events")ifwhich=="dev":take="dev"which="rework"else:take="rework"which="dev"elifreq_devinresult.events:print(f"{self}: ...got dev")take="dev"elifreq_reworkinresult.events:print(f"{self}: ...got rework")take="rework"else:assertFalse,"how did we get here?"# Now take it.iftake=="dev":print(f"{self}: ...taking dev")task=result[req_dev]req_rework.cancel()eliftake=="rework":print(f"{self}: ...taking rework")task=result[req_rework]req_dev.cancel()else:assertFalse,"how did we get here?"print(f"{self} got {task}")…dodevelopmentandputthetaskinthetestingqueue…
Here are the relevant bits of output when the simulation is run for 10 ticks instead of just 4:
DeveloperRight/0.00: ...taking dev
DeveloperRight/1.00: ...taking dev
DeveloperRight/2.00: ...taking dev
DeveloperRight/3.00: ...taking rework
DeveloperRight/4.00: ...taking dev
DeveloperRight/5.00: ...taking rework
DeveloperRight/6.00: ...taking dev
DeveloperRight/7.00: ...taking rework
DeveloperRight/8.00: ...taking dev
DeveloperRight/9.00: ...taking rework
As I said at the outset,
I’m going to submit some PRs for SimPy’s documentation
to shout loudly and clearly that outstanding requests need to be canceled
(or possibly recycled: I haven’t tried that yet).
Meanwhile,
I can now get back to playing with the impact of rework fractions on throughput.
I took a break from this series of posts for a couple of days
to focus on looking for a new role
and because I’ve hit a bit of a roadblock.
In the most recent version of the simulation,
testers can bounce work back to the development queue if they find bugs:
Tasks that have been sent back for more work always have higher priority than new tasks,
but that’s not realistic.
What I’d like to model is a system in which re-work is done by
the same developer who did the initial work.
I can do this by creating a re-work queue for each developer,
and use SimPy’s AnyOf operator to select from either queue:
Notice that the “get” request for the new-work queue is canceled
when there’s re-work to be done.
Doing this takes care of the case in which items were available in both queues
(which turns out to be very common);
I found by trial and error that canceling a “get” request has no effect
if the request is still pending,
which simplifies the logic a little bit.
So here’s my problem:
when there is re-work to be done and the developer is working on a new task,
I’d like to interrupt the developer and have them do re-work,
then resume their previous task;
in other words,
I’d like re-work to take immediate precedence over new work.
However,
if the developer is already doing re-work,
I don’t want to interrupt them.
SimPy supports interrupts:
if you have a process p, you can call p.interrupt() to throw an exception into it.
The process can catch this and do whatever it wants:
So now what?
As near as I can figure,
when a developer is interrupted,
it checks to see whether it’s already doing re-work or not.
If it is,
it puts the task associated with the re-work interrupt in its re-work queue.
If not,
it puts the (new) task it’s working on in the re-work queue
and switches tasks.
Hm,
that means the developer’s queue has to be a priority queue
so that re-work takes precedence over new work.
And we need to keep track of time spent so far on each task.
And there’s probably at least one corner case I haven’t thought of yet,
because this is a concurrent system with interrupts
and there are always corner cases.
A few final thoughts:
The title of this post is “In Search of Sturdiness”
because I think that word best sums up what I’m looking for.
I don’t care all that much if a solution is elegant or graceful or clever;
I want something sturdy because I want to make more changes to the simulation
and I don’t want to have to come back and re-design this part.
AI tools aren’t much help here,
at least not the ones I’ve tried.
They’re good at generating plausible answers,
but this situation needs one that’s right as well.
This is where I’d like to make use of a tool like Alloy or TLA+,
but they aren’t already part of my repertoire,
and I don’t have time for a three-month side quest to master one of them.
I’ve tried to find an experienced SimPy user, without luck.
If you are one,
and are willing to offer some design advice,
I’d be very grateful if you would reach out.