Git, Graphs, and Software Engineering

Posted 2017-09-30

A couple of years ago, I complained that distributed version control still hadn’t had its structured revolution. After yet another discussion about how useful it is versus how hard it is to learn, I have a proposal:

Download data from several thousand large projects on GitHub.
Use your favorite statistical techniques to identify patterns in those repositories’ branch-and-merge graphs. (To increase the likelihood of your proposal being funded, say that you’re using machine learning rather than statistics.)
Select a small set of common subgraphs that account for a large fraction of everyday use.
Build a tool that provides those, and only those, to users. (For bonus marks, do a field study to see if it’s actually easier for newcomers to learn using the methods that Stefik, Hannenberg, and others have pioneered.)
Profit. Well, fame. OK, will you settle for having made the world a better place?

Step 3 is speculative: I have no evidence that usage patterns fit a long tail distribution, but I think most of us would be surprised if that was not true. Step 4 is the one that will lead to shouting: as happened when structured languages eliminated goto statements, a minority of very vocal programmers will quote fringe cases that can’t be handled by your chosen set of simple constructs. Everyone else (the people Hanselman dubbed “dark matter developers“) will thank you for applying the scientific method to the design of something useful. That was how one of my first professors defined “engineering”, and I think we’d all be better off if we did it.

See also Perez De Rosso and Jackson’s thought-provoking papers “What’s Wrong With Git? A Conceptual Design Analysis” and “Purposes, Concepts, Misfits, and a Redesign of Git“. If anyone reading this has time, interest, and graph analysis expertise, please get in touch…

After reading some online discussion of this post, I’d like to clarify that:

I’m not suggesting yet another set of aliases for Git: Pascal wasn’t macros on top of Fortran, and I don’t think a structured distributed VCS will be a layer on top of Git. (I also believe that any such abstraction will leak early and leak often…)
I agree that a lot of how people use Git isn’t captured in commit logs, but that’s what we have easiest access to.
Yes, it would make a lot of sense to mine the histories of other distributed version control systems like Mercurial as well.

Categories: programming, proposal