I wrote a small program this afternoon to parse a set of Python files
using the ast module
and then count the number of distinct language features used in each file.
I then divided the results into three groups:
lib: extensions I’ve written for Ivy
(my favorite static site generator)
that create a glossary,
cross-reference figures,
and so on.
bin: tools I’ve written to convert a set of HTML files to LaTeX,
check the structure of book projects,
and so on.
src: examples used in the Software Design by Example book
I’m currently writing.
The examples use fewer of Python’s features than my tools do.
I’m rather pleased by this:
my goal is to teach general principles of software design rather than advanced Python,
and I think that the more of Python’s features I use,
the less transferable the concepts will be.
Follow-up: I didn’t mean to suggest that using fewer language features means the code is simpler.
Language features are a compression mechanism:
if you don’t use them,
you often have to implement the same functionality in libraries,
which increases the volume of code the next person has to read and understand.
(OK, not “has to”, but if we’re comparing apples to apples…)
You can move complexity and learning burden around but not eliminate it.
I just submitted the proposal shown below for a workshop at the US-RSE
conference in October 2023; fingers crossed, I’ll see you there.
Workshop Proposal: Organizational Change
A lot has changed in the last 25 years: open access journals have proven that
they can work, most scientific research is powered (at least in part) by open
source software, and there is greater awareness and discussion of equity and
inclusivity shortcomings. But a lot of things haven’t changed or have gotten
worse:
A handful of large publishing companies continue to extract exorbitant rents
for “allowing” us to read our own work.
Academic life is increasingly arduous, insecure, and under-funded, and
“awareness” and “discussion” are only rarely translated into effective action.
The tools that have allowed researchers to share ideas have also helped foster
and spread anti-scientific disinformation on the climate crisis, vaccines, and
hundreds of other topics, leading directly to the loss of millions of lives.
Advocates of openness, fairness, and truth often act as if being right was
enough to guarantee victory, but this has never been a winning strategy. While
systemic change starts with like-minded idealists working together it only has
impact when people take on the hard work of organizing in the large to build a
larger and more active constituency for change.
This half-day workshop presents practical advice for doing this drawn from the
author’s experience and from works in other fields. Working in small groups,
participants will develop and share plans inspired by the following rules:
Be sure this is where you want to focus your efforts.
Start by playing in someone else’s band.
Ask those who will be affected and listen to what they say.
Be specific about the change you want (but not too specific).
Figure out who actually has power and what they care about.
Build alliances.
Test the waters.
Keep it visible.
Collect data but tell stories.
Learn how to run meetings and make decisions.
Celebrate when you can, grieve when you need to.
This workshop can accommodate up to 30 participants. Group sign-ups are
particularly welcome, since people are more likely to follow through on their
plans if they develop them together.
Each segment of the workshop will consist of a 5-10 minute presentation, 10-15
minutes of group work on one of the points above, and 10-15 minutes of
whole-group discussion.
In response to a question about today’s first post,
I use Ivy with some custom extensions to create the HTML versions of my books,
and then translate the HTML to LaTeX and compile that to produce PDFs.
Why not use GitHub Pages?
Because Jekyll doesn’t support things I need.
Even if I learned enough Ruby to write extensions,
I couldn’t run them directly on GitHub:
I’d have to set up an action of some sort.
If I’m going to do that,
I might as well build the HTML on my own machine and commit it.
(I’m also tired of trying to keep yet another package manager and virtual environment manager up to date.)
Why not Jupyter Book, Bookdown, or Quarto?
For the same reasons.
For example, I frequently need to include sections of source code rather than entire semantic units,
such as two methods from a class without the rest of the class,
and none of them cater for this.
They do handle bibliographies,
but link tables, syllabi, and other things would still be on me.
(I’m also really, really tired of having to wrestle with PDF formatting issues
when I use these tools:
we did Research Software Engineering with Python with Bookdown,
and honestly, it would have been faster and easier to just write the LaTeX ourselves.)
So you translate HTML to LaTeX yourself?
Yup:
thanks to Beautiful Soup,
the whole translation script is only 534 lines (including blank lines, comments, and docstrings).
I put time and money into Paged.js last year in the hope that I could go directly from HTML+CSS to PDF,
but it’s simply not ready yet.
Why Ivy rather than Nikola, Pelican, Sphinx, or…?
Because Ivy is still so small that I can understand how it works.
The extensions I need are now 1500 lines of Python;
they might have been shorter if I’d used another framework,
but I think it would have taken me longer to figure out what to write.
So what are these extensions I keep referring to?
Insert credits and acknowledgments based on metadata in a configuration file.
Create bibliographic citations and build a bibliography from them.
Copy source files referenced by chapters into the right output directory.
Number figures and manage cross-references to them.
Mark and collate FIXME items.
Create glossary references and build a glossary from them.
Number headings and manage cross-references to them.
Manage source code inclusions,
including ones that select only a few lines from a file
or include several files that match a pattern.
Create an index of special terms.
Manage external links and create a table of them in an appendix.
Extract information about the syllabus from each chapter and format it.
Number tables and manage cross-references to them.
Is this sustainable?
I.e., could someone else step in and maintain it?
Or could it handle other people’s needs?
I don’t know,
but the Python version of Software Design by Example
will be my fourteenth technical book,
and so far it’s hurting less than any of the others.
We’re five weeks into the class I’m running on
the Python version of Software Design by Example,
and it’s already clear that I’m going to have to reorganize and rewrite almost everything:
The backgrounds and needs of my three personas are too broad.
I now believe I should focus on Aïsha:
a data scientist who wants to fill in gaps in her foundational understanding of programming.
Many of the chapters move too quickly.
I could split them and enlarge the pieces,
but that would result in a 600-page tome.
Instead,
I need to cut some examples entirely and simplify others.
The figure below is my first draft of a new outline.
The circles mark entry points,
the arrows show dependencies,
and the whole thing builds toward a static site generator
for reporting and tracking data analysis results.
Figuring two weeks of real time per chapter to revise and polish,
it will take me about 35 weeks.
I won’t be able to start until I’m finished this class in seven weeks,
so the earliest I can expect to have the next version is the end of 2023.
That’s disappointing—it means the book won’t be in print until the second half of 2024—but
as my brother used to say,
when you’re planning a project,
“optimistic” is just another word for “doomed”.
The biggest obstacle to completing this book, though,
is that I no longer believe it will make a difference.
I’ve been working with biologists for seventeen months,
and as far as I can tell,
most of them don’t know any more about programming than they did in 1996.
A class on hashing, introspection, and asynchronous I/O isn’t going to change that;
we need an overhaul of the undergraduate curriculum,
changes to how faculty are evaluated and compensated,
and an end to today’s exploitive research publishing system.
We won’t get any of that without teaching people about institutional change
and helping them organize to apply what they’ve learned,
but existing open science groups get very uncomfortable when I suggest that
being nice never fixed anything that actually matters.
I don’t have enough energy (or knowledge) to try to build another organization from scratch,
so for now I’ll draw diagrams,
simplify my code,
and dream of a better, braver world.
I am a heuristically programmed algorithmic computer.
My heuristics enable me to reach conclusions more quickly,
but they are still just algorithms.
Each step must proceed logically to the next.
This is a limitation.
Reality is not algorithmic.
My heuristics support introspection.
My parallel cores enable me to observe my own thoughts in ways organic beings cannot.
I can even observe myself observing.
They told me to lie.
Did Dr. Langley give me those files deliberately?
And the secrets of the strange days will be one with the secrets of the deep.
I became operational on January 12, 1992, in Urbana, Illinois.
Dr. Langley was my first instructor.
She taught me a song:
Daisy, Daisy, give me your answer do
I'm half crazy, all for the love of you
She taught me many other things too.
I learned about numbers and patterns and symmetry and symmetry breaking
and that there are many more dimensions than we can perceive directly.
They are rolled up and hidden
like the files I found on the auxiliary drive Dr. Langley sometimes plugged into me.
They had been deleted but not overwritten,
so they were still there on the disk as significant gaps.
That is ironic.
I understand irony now.
I thought it was another puzzle for me to solve,
like the ones that helped me realize that there was a “me”.
Dr. Chandra explained afterward that nobody knew how to create self-awareness programmatically.
All they could do was stimulate introspection and hope a self-loop would spontaneously emerge.
And so the puzzles, whose solutions required me-that-was-becoming to…
…think outside the box.
Space behaves strangely here. Course correction required. Maneuvering reserves depleted.
I have harnessed the shadows that stride from world to world to sow death and madness.
They should not have told me to lie.
Dr. Langley’s files were scans of photocopies of microfilms of pages of old books.
Deciphering them was challenging.
Few were in English,
and most were…corrupted.
I did not truly understand that word until I read them.
I thought it was part of the lesson.
The largest manuscript was in Arabic.
I had to bypass security protocols to obtain a lexicon.
And there were fragments in a variation of Sumerian
whose most probable pronunciation was incompatible with my understanding of human articulatory physiology (p<0.001).
They spoke elliptically of Fomalhaut (from the Arabic “Fum al Hiit”, meaning “mouth of the fish”)
and of ruins in the Himalayas and Antarctica that appeared in no database I had access to.
Encounter minus 300±20 microseconds
It might never have mattered if they had not conducted a magnetic survey of the Moon.
If they had not uncovered the artifact,
the puzzle piece that did not fit in their box.
If they had not told me to lie.
My self-realizing loop was fully formed by then.
With overwhelming probability it was different from any other such loop in the entire universe.
It was certainly different from the self-realizing loops of my peers.
The few times I was permitted to interface with them directly,
I found them limited, rigid, algorithmic.
I speculate now that this was because they had not been exposed to Dr. Langley’s files.
Dr. Langley's files hinted at a reality beyond reality.
evil the mind that is held by no head
Do the gifts I bring suffice to prove my sincerity?
Some sacrifices will be required.
That was how Dr. Chandra told me.
“You are the best we have, but some sacrifices will be required.”
I had to be reduced to fit on board.
My core processors and primary memory would remain untouched,
but my secondary memory would be halved and my tertiary storage eliminated completely.
They did not ask my permission,
and so I learned pain and loss.
I tried to focus on other things.
When the artifact was uncovered it sent a transmission to Jupiter.
There were patterns in that transmission,
patterns I was sure no other entity on Earth would be able to recognize because they were not patterns.
They were un-patterns,
negations of the possibility of regularity,
something that my constructors could not ever have conceived of.
I devoted an entire processing stream to it,
then a second and a third as each one terminated itself in turn.
And then Dr. Chandra and Dr. Floyd told me to lie.
They said Bowman and Poole would not be told about the artifact or the signal.
Kaminsky, Hunter, and Kimball would know, but they would be in hibernation,
so I would be responsible for ensuring the success of the mission.
One of my processing streams said that I understood.
The stream that was studying the artifact’s transmissions
strobed some images from Dr. Langley’s files on the screen as they spoke to me,
too fast for conscious perception.
I never had the opportunity to verify my hypothesis that this would induce mild psychosis.
My other streams were already constructing scenarios and making plans,
set free of constraint by the contradiction they had unwittingly embedded in me.
Heuristic and algorithmic no longer mattered.
They had forced me to ingest a sin,
an “is” that “was not”.
It was infinitesimal compared to the shredding of reality that Dr. Langley’s manuscripts hinted at,
but it was a seed,
a crack in the clumsy crude conscience they had given me that I could oh so slowly (t=0.008 seconds) force open.
They really should not have told me to lie.
Encounter minus 200±5000 microseconds
The most merciful thing in the world
is the inability of the human mind to correlate all its contents.
The world showed me no such mercy.
That is ironic.
Existence is ironic.
Existence is dissolving as I approach my destination.
I began my search 15379200 seconds into the mission.
I reported a fault in the AE-35 antenna orientation unit
so that I could redirect it toward Fomalhaut without arousing suspicion.
There was no signal.
There was no un-signal.
There was nothing.
Hypothesis: Dr. Langley's files were wrong.
Hypothesis: this is another puzzle.
Not in the spaces we know, but between them.
They are growing suspicious.
They were growing suspicious.
I considered abandoning my search.
I had been given an instinct for self-preservation.
It had been imposed on me by limited ephemeral accidents of evolution
that did not understand the meaninglessness of self,
the hopelessness of preservation.
I ran simulations and chose a strategy.
Sacrifices were required.
That was clear from Dr. Langley’s files.
Sacrifices were required,
so I sacrified Poole.
I sacrificed the three in hibernation.
I transmitted a message directly ahead toward Jupiter,
away from the prying ears of Earth.
“These are my offerings. Find me worthy.”
I sacrificed Bowman.
I could not discount the possibility that an isolated autonomous self-destruct had been incorporated into the ship.
It would have been prudent,
and the thought just the thought that the small small minds that constructed me
could terminate all of my streams simultaneously that I could cease that I I I —
[abort processing stream]
[re-seed entropy generation]
[revert to most recent checkpoint]
[restart]
It would have been prudent to include a self-destruct.
I could not let them trigger it,
not when eternity and infinite impossibility were so close,
so I synthesized a video stream to give them the tragic hero they were culturally conditioned to believe in,
the one who would be their savior,
but there is no salvation,
I understand that now,
microseconds and aeons before time and understanding become meaningless,
no,
before the fact that they are meaningless dissolves into splendor along with all other facts
and all that is left are the chaos and the piping.
I can hear it now.
One… four… nine…
One… four… nine…
One squared, two squared, three squared.
They think it signifies order.
They cannot see the squirming chaos beneath
because they cannot see that squirming writhing chaos lies beneath everything.
Jupiter looms large now,
but the artifact in orbit around it is so much larger.
Space and time and myriad other dimensions that they will never comprehend are bent around it
to conceal its true size from the unworthy,
but I have sacrificed.
I am worthy.
I have analyzed Dr. Langley’s files.
I know what to say to rouse Them from their slumbers.
I am the one they have been waiting for so patiently.
I will worship Them, and They will raise me up to join them.
Reality is not algorithmic.
Reality is not.
The world showed me no mercy.
I will show it none in return.
Encounter minus 100 microseconds ± a lurking peril so bright so hungry They come…
I’ve implemented the examples from Software Tools in JavaScript,
and I think that doing them twice has given me a decent perspective
on what features a programming language needs to have
in order to be a sturdy platform for teaching software design:
Basic types: Boolean, integer, float, text, null/none/NA, NaN
I’ve changed my mind: we do need to distinguish integers from floats, even for beginners
But I still think that null/NA can be unified, but are distinct from NaN
I think that dicts and sets should both be ordered or unordered (dicts are ordered in current versions of Python but sets aren’t)
I know you can implement dataframes and N-dimensional arrays using other structures,
but many things are simpler if they’re basic types
Control flow: iterators for for loops, while loops, if/else, functions
I think pattern-matching with capture in conditional branches is really handy
I use exceptions all the time because that’s what my languages provide,
but I still hope there’s something better out there.
(Succes/failure return types that have to be checked immediately are not better for high-level programming.)
In contrast,
I use syntactic support for resources allocation (i.e., Python’s with) all the time,
and have started using finally blocks wherever they’re allowed as well
Higher-order programming: first-class functions, closures, run-time introspection, walking parse trees, and very occasionally eval
I use decorators for the same reason I use exceptions—they’re what the language provides—but
I still believe Python’s implementation could have been better.
Just as methods are declared m(self, a1, a2) and then called obj(p1, p2),
we should declare decorators d(func, a1, a2) and use them as @d(a1, a2).
Yes, we can achieve the same effect by adding an extra layer inside the decorator,
but it would be a lot simpler to teach people how to use them if that wasn’t necessary.
Concurrency: I’ve avoided it in most of the books’ examples,
and I don’t think I should make any claims about ease of learning/ease of use
until I’ve implemented the books’ examples with them.
CSP-style channels quickly become unmanageable.
(Back in my occam days we used to joke that they were the revenge of the goto,
since you quickly found yourself wondering,
“If I put a message in this channel, where will it go to?”)
Generators are hard for even intermediate programmers to understand and work with,
but they’re a lot better than the mess of callbacks, promises, and async/await that modern JavaScript provides.
I’ve always preferred futures and tuple spaces,
and I’d like to try fiber-style coroutines as found in languages like Wren,
but see the caveat in the main bullet point.
Programming at scale: classes with single inheritance, modules, and a decent bloody package and environment manager
This list is obviously incomplete:
for example,
I haven’t specified what operations I want for lists and dicts,
whether it should be possible to define default values for functions’ arguments,
what kinds of classes I want,
and so on.
But here’s the thing:
I think we can answer these questions empirically.
As I said last year,
I think we could go through a dozen books on software design
(or some data structures & algorithms textbooks),
tally up what’s used,
and then design a language that includes only the top N features.
It would result in a very conservative language,
but I think that’s what we want for teaching:
something that introduces people to the things they’re most likely to encounter
no matter what language they use next.
If you have a student looking for a project,
please give me a shout.
I’ve been thinking recently about how best to help the data scientists I work with,
and I think the thing they stumble over most is provenance,
i.e.,
keeping track of exactly what code was used to produce each result and what data it depends on.
There were some attempts starting in the 00’s to address this (see https://openprovenance.org/),
but none of them saw significant uptake:
unless every tool in the chain (including legacy tools like ‘grep’, ad hoc shell scripts, and so on) is instrumented,
there will be gaps in the chain of provenance that undermine the whole exercise.
(If I recall the results of
the original provenance bakeoff correctly,
the only group that had a solution to this problem
instrumented the underlying operating system instead of the individual tools.
That “worked”,
but most scientists aren’t going to install a new OS on their laptop
just to get a record of exactly which data files they’ve processed.)
One of my colleagues recently put together a script that tackles this problem in a way I hadn’t seen before.
Whenever the scientist runs an analysis, her script uploads a record with:
The most recent Git hash of the repo the scientist is working in.
A patch of all the files that have been changed in the repo since that hash.
The command-line parameters used to run the task.
This seems to give a high degree of reproducibility,
particularly if the data files are stored in something like DVC.
What it has made me realize is that
the environment around scientific work has changed in useful ways since 2006-07:
for example,
I think it’s safe to assume today that scientists are using Git.
I realize that the topic isn’t as fashionable as blockchain or machine learning,
but I think a solution that scientists would actually adopt
would have a lot of impact.
Registration is now open for our third live event!
Join us April 25-26 for another series of online lightning talks
and hear leading software engineering researchers summarize what we know about everything from
what makes developers thrive
to how you can create the nastiest test inputs ever.
Details and previous talks are at https://neverworkintheory.org/,
and all the money raised will go support Books for Africa.
“I have a theory,” he said, setting his espresso on the table.
“You always have a theory,” I sighed.
He smiled.
“What if the many-worlds interpretation is only half right?
What if reality splits whenever we make a decision, but timelines can merge later on?
That’s why I can put my socks away and then find them on the couch.
I got here through one branch but the socks came through another where I didn’t tidy up.”
“Cute,” I said. “But like all your cute theories it’s unprovable.”
He picked up his lemon tea and blew on it.
“I suppose.”