2015-01-21 Tree Edit Distance

Today I attended a very interesting lecture on so-called tree edit distance. The lecturer, Mateusz Pawlik, who graduated from my university, has recently earned his PhD at Free University of Bozen-Bolzano in Italy and is currently working at Salzburg.

As I am not very well versed in combinatorial algorithms, the technicalities were a bit overwhelming for me (even though the presentation was really very well done – the balance between ideas and formality seemed just about right). The basic idea is that tree edit distance (as the name suggests) is an analogue of the Levenshtein metric, but for trees. Here, the basic operations are node deletion and insertion, and relabeling (we work on trees with ordered and labeled nodes).

It seems that Mateusz and the team he’s a member of are doing a very good job. What was especially interesting for me were two things. Since mathematical formulae are a good example of the numerous things with tree-like structure, one could apply the algorithm discussed to measure similarity between equations. This might help with the problem on searching for equations in e.g. databases of mathematical papers. (Apparently, some work has been done on exactly that; I will look into this in some Spare Timeā„¢.)

The second thing – when I realized that this might be used for that – immediately sent shivers down my spine. As all Lispers know, Lisp programs are basically abstract syntax trees without any syntactic sugar (well, almost any). What if we were able to find the (minimal) set of basic operations to get from tree A to tree B, and use it to display a diff between two versions of some Lisp code? It might be extremely useful (not that line-based diffs are useless, but they don’t survive any major reformatting too well). Of course, this is not trivial: their algorithm computes the distance and not the edit script (as Mateusz explained to me during the Q&A session after the lecture). But from what I understood, in principle it could be done. Of course, there remains the question of the exact repertoire of the “basic operations”: they might not be e.g. paredit's slurping and barfing, for instance. Nevertheless, it looks promising.

Now the “official” implementation (i.e. the one done by the researchers) was done in Java. Just wondering whether redoing it in a more reasonable language would be an interesting project (not one I would like to tackle, though, at least not in the foreseeable future: neither my Lisp-fu nor my combinatorics-fu are sufficient for that, I’m afraid.) Definitely, using Emacs Lisp might not be the best idea (even though being able to diff ASTs in Emacs would be so cool): even though their algorithm is fast, it would probably choke on any medium-sized Lisp project (a wc on one of my projects – and a small one, for that matter – showed 1240 words for about 200 SLOC. Even taking into account that many of these are in the docstrings, this amounts to, say, five or so nodes of the AST per one line of code. Now take simple.el‘s circa 8000 lines…)

As a side note: many of my fellows working in mathematical analysis seem to be thinking that algorithms/programming/graphs/combinatorics are trivial because everything is, well, finite. You know what? It bugs me. It’s really a pity that they did not attend the lecture. Let me dedicate to them this quote of E. W. Dijkstra:

From a bit to a few hundred megabytes, from a microsecond to a half an hour of computing confronts us with completely baffling ratio of 10^9! The programmer is in the unique position that his is the only discipline and profession in which such a gigantic ratio, which totally baffles our imagination, has to be bridged by a single technology. He has to be able to think in terms of conceptual hierarchies that are much deeper than a single mind ever needed to face before. Compared to that number of semantic levels, the average mathematical theory is almost flat. By evoking the need for deep conceptual hierarchies, the automatic computer confronts us with a radically new intellectual challenge that has no precedent in our history.

Summing up: it seems that a lot of fascinating research is being done. We live in an interesting world!

CategoryEnglish, CategoryBlog, CategoryLaTeX, CategoryEmacs