FIDEX: Filtering Spreadsheet Data using Examples
Xinyu Wang, Sumit Galwani, and Rishabh Singh: "FIDEX: Filtering Spreadsheet Data using Examples". OOPSLA 2016, https://www.cs.utexas.edu/~xwang/pubs/oopsla16.pdf.
Data filtering in spreadsheets is a common problem faced by millions of end-users. The task of data filtering requires a computational model that can separate intended positive and negative string instances. We present a system, FIDEX, that can efficiently learn desired data filtering expressions from a small set of positive and negative string examples.
There are two key ideas of our approach. First, we design an expressive DSL to represent disjunctive filter expressions needed for several real-world data filtering tasks. Second, we develop an efficient synthesis algorithm for incrementally learning consistent filter expressions in the DSL from very few positive and negative examples. A DAG-based data structure is used to succinctly represent a large number of filter expressions, and two corresponding operators are defined for algorithmically handling positive and negative examples, namely, the intersection and subtraction operators. FIDEX is able to learn data filters for 452 out of 460 real-world data filtering tasks in real time (0.22s), using only 2.2 positive string instances and 2.7 negative string instances on average.
Another solid piece of engineering, another reason to be sad that I won't be at SPLASH this year. The authors start by creating a small language to describe the things that people do when filtering spreadsheet data, like starts-with, ends-with, matches, and contains. They then create a sound, complete, and efficient algorithm that learns how to construct filter expressions using these operations along with intersection and subtraction operators. They then show that their algorithm learns both quickly enough and well enough to be useful to real people. And yes, it only works on strings (for now), and cannot yet handle noisy examples, but given the quality of the engineering shown off here, I'm confident the authors will have lots to report on both fronts soon.