What the World Needs Now Is Diffs, Diffs, Diffs
I first heard the term “grand challenge” used in the 1980s to describe the kinds of big projects that would give an entire generation of scientists a focus for their work—something on the scale of putting a person on the moon, or sequencing the human genome. The phrase has since been applied to many other things, most recently DARPA’s autonomous vehicle program.
So, here’s a grand challenge for the open source community: build a comprehensive library of file differencing tools, so that I can usefully put things other than flat text under version control. If someone modifies an image, a PDF, or a Word document, I ought to be able to pull up a view of the changes, just as I would for source code or a README file. And if someone adds a couple of attributes to a <table> element in an HTML page, show me that, not the hundred and one places where your editor rearranged the order of pre-existing attributes in ways that aren’t semantically meaningful. If I could see what you’ve added to our project’s use cases or class diagrams, I might just use UML more often; if I could visually merge Gnumeric spreadsheets, I might use them to store grades, rather than tab-separated text files.
There’s lots of research to be done here, lots to be invented. I have a hazy notion of how a diff tool for images might work, but what about sound? Or video? Is there some “deep structure” that unites AutoCAD and VHDL, or some unified algorithm capable of handling all vector graphics formats? Even if there isn’t—even if we only wind up able to handle the hundred most common file formats—we’ll have made our lives much, much better.
Have a look at this:
SSDDiff a diff for semistructured data
ssddiff.alioth.debian.org
The author of the above tool also wrote this interesting thesis: “Stucture-Preserving Difference Search in Semistructured Data”. It has comparisons against Logilab xmldiff (written in Python!) which shows how his algorithm improves the result.
The hard part isn’t the diff itself, it’s figuring out what the most meaningful linear tokenization of the source files is, and ensuring that the files are normalized in a way that does not obscure the semantics.
So, for your table example, the source files must first be parsed into a DOM tree and then re-serialized with a standard indentation and attribute order before diff will work, but after that it works very well at highlighting the semantic change, since the normalization ensures that the meaningful unit of change (a line) is tokenized identically between the two versions.
Being able to put stuff under version control and actually read/interpret the diff output is one of the reasons I stopped using word processors and switched every non-trivial document I create or edit to LaTeX (I know that MS Word has some sort of revision history metadata voodoo in its file format now, but storing the revision history with the document itself has proven to be an embarassingly bad idea – just ask the British government).
I’m not sure why more programmers don’t use LaTeX — it seems like a natural fit for that segment of the population, looks a lot nicer when typeset, and has the aforementioned benefit of being CVS/SVNable, not to mention segmented across multiple files, just like regular source code. Perhaps the word processor is such a ubiquitous cultural element that by the time most people realize there are alternatives, they’re unwilling to put the effort into switching.
Storing things in a sparse, structural markup like LaTeX doesn’t completely solve the problem you pose regarding semantically insignificant edits, but barring some sort of semantic meta-markup to be superimposed on top of everything we do, the problem seems rather intractable. I look forward to someone much smarter than myself proving me wrong.