Marcin Borkowski: 2022-10-03 Converting words and sentences to identifiers

Some time ago I had a need to “convert” a phrase, or even a whole sentence, into an identifier. By “converting to an identifier” I mean lower-casing the whole thing and changing non-letter characters into underscores. For example, “Hello, world!” should become hello_world.

I came up with some simple code to do that – but it turned out that there were some pitfalls I did not expect. Here is an early version of the function which “identifierifies” the region:

(defun identifierify (beg end)
  "Convert region to an identifier-ish."
  (interactive "r")
  (downcase-region beg end)
  (replace-regexp-in-region "[^a-z]+" "_" beg end))

It’s not at all clear what’s wrong with it – it downcases the region, then converts every run of consecutive non-letter characters to a single underscore.

The problem manifests itself when there is a run of more than one non-letter character. It turned out that this function replaced some characters even after the region. In case my explanation is unclear (which it probably is, a bit), here is an example. Assume these are the contents of a buffer:

"Hello, world!"

and assume that the region encompasses everything except the quotes. Saying M-x identifierify would then replace the exclamation mark and the quote after it to an underscore. I looked at the code of replace-regexp-in-region to learn why, and it turns out that it had a bug! When the replacement is shorter than the match it replaces, the end parameter (which is an integer and not a marker) does not “shift” together with the text being shortened. (Of course, a similar problem will happen when the replacement is longer than the match – only then it might happen that some characters won’t get replaced.) It turns out that the issue is corrected on the master branch (though in a different way), so after downloading a newer Emacs it went away.

Also, I ended up using a slightly more sophisticated version of that code:

(defun identifierify (beg end)
  "Convert region to an identifier-ish."
  (interactive "r")
  (save-restriction
    (narrow-to-region beg end)
    (downcase-region beg end)
    (replace-regexp-in-region "[^a-z]+$" "" (point-min) (point-max))
    (replace-regexp-in-region "[^a-z]+" "_" (point-min) (point-max))))

The first replace-regexp-in-region just deletes any non-letter characters at the end of the region. (This makes sense, since an (English) sentence usually ends with punctuation, but always starts with a letter, so deleting non-letters at the end is useful.) For that to work, I needed the save-restriction/narrow-to-region combo, so that the $ would indeed catch the “end”. Otherwise, $ would not match if the end parameter did not coincide with end of line. (Incidentally, this very change was a good workaround for the bug I mentioned – in fact, this is precisely the way the bug was fixed in Emacs.)

By the way, replace-regexp-in-region and its cousin replace-string-in-region seem so useful (they are basically programmatic equivalents of query-replace and query-replace-regexp) that I immediately regretted not mentioning them in my Elisp book. It turned out, however, that I actually couldn’t have done that, since they were introduced in Emacs about a year ago, when the book was close to being finished (and I didn’t use the most recent Emacs right from the Git repo). Turns out that even after all those years, there are still useful (and pretty basic) things people add to Emacs!

CategoryEnglish, CategoryBlog, CategoryEmacs