Some time ago I had a need to “convert” a phrase, or even a whole sentence, into an identifier. By “converting to an identifier” I mean lower-casing the whole thing and changing non-letter characters into underscores. For example, “Hello, world!” should become hello_world.
I came up with some simple code to do that – but it turned out that there were some pitfalls I did not expect. Here is an early version of the function which “identifierifies”
the region:
(defun identifierify (beg end) "Convert region to an identifier-ish." (interactive "r") (downcase-region beg end) (replace-regexp-in-region "[^a-z]+" "_" beg end))
It’s not at all clear what’s wrong with it – it downcases the region, then converts every run of consecutive non-letter characters to a single underscore.
The problem manifests itself when there is a run of more than one non-letter character. It turned out that this function replaced some characters even after the region. In case my explanation is unclear (which it probably is, a bit), here is an example. Assume these are the contents of a buffer:
"Hello, world!"
and assume that the region encompasses everything except the quotes. Saying M-x identifierify would then replace the exclamation mark and the quote after it to an underscore. I looked at the code of replace-regexp-in-region to learn why, and it turns out that it had a bug! When the replacement is shorter than the match it replaces, the end parameter (which is an integer and not a marker) does not “shift” together with the text being shortened. (Of course, a similar problem will happen when the replacement is longer than the match – only then it might happen that some characters won’t get replaced.) It turns out that the issue is corrected on the master branch (though in a different way), so after downloading a newer Emacs it went away.
Also, I ended up using a slightly more sophisticated version of that code:
(defun identifierify (beg end)
"Convert region to an identifier-ish."
(interactive "r")
(save-restriction
(narrow-to-region beg end)
(downcase-region beg end)
(replace-regexp-in-region "[^a-z]+$" "" (point-min) (point-max))
(replace-regexp-in-region "[^a-z]+" "_" (point-min) (point-max))))
The first replace-regexp-in-region just deletes any non-letter characters at the end of the region. (This makes sense, since an (English) sentence usually ends with punctuation, but always starts with a letter, so deleting non-letters at the end is useful.) For that to work, I needed the save-restriction/narrow-to-region combo, so that the $ would indeed catch the “end”. Otherwise, $ would not match if the end parameter did not coincide with end of line. (Incidentally, this very change was a good workaround for the bug I mentioned – in fact, this is precisely the way the bug was fixed in Emacs.)
By the way, replace-regexp-in-region and its cousin replace-string-in-region seem so useful (they are basically programmatic equivalents of query-replace and query-replace-regexp) that I immediately regretted not mentioning them in my Elisp book. It turned out, however, that I actually couldn’t have done that, since they were introduced in Emacs about a year ago, when the book was close to being finished (and I didn’t use the most recent Emacs right from the Git repo). Turns out that even after all those years, there are still useful (and pretty basic) things people add to Emacs!