Some time ago I had a need to “convert” a phrase, or even a whole sentence, into an identifier. By “converting to an identifier” I mean lower-casing the whole thing and changing non-letter characters into underscores. For example, “Hello, world!” should become hello_world
.
I came up with some simple code to do that – but it turned out that there were some pitfalls I did not expect. Here is an early version of the function which “identifierifies” the region:
(defun identifierify (beg end) "Convert region to an identifier-ish." (interactive "r") (downcase-region beg end) (replace-regexp-in-region "[^a-z]+" "_" beg end))
It’s not at all clear what’s wrong with it – it downcases the region, then converts every run of consecutive non-letter characters to a single underscore.
The problem manifests itself when there is a run of more than one non-letter character. It turned out that this function replaced some characters even after the region. In case my explanation is unclear (which it probably is, a bit), here is an example. Assume these are the contents of a buffer:
"Hello, world!"
and assume that the region encompasses everything except the quotes. Saying M-x identifierify
would then replace the exclamation mark and the quote after it to an underscore. I looked at the code of replace-regexp-in-region
to learn why, and it turns out that it had a bug! When the replacement is shorter than the match it replaces, the end
parameter (which is an integer and not a marker) does not “shift” together with the text being shortened. (Of course, a similar problem will happen when the replacement is longer than the match – only then it might happen that some characters won’t get replaced.) It turns out that the issue is corrected on the master
branch (though in a different way), so after downloading a newer Emacs it went away.
Also, I ended up using a slightly more sophisticated version of that code:
(defun identifierify (beg end) "Convert region to an identifier-ish." (interactive "r") (save-restriction (narrow-to-region beg end) (downcase-region beg end) (replace-regexp-in-region "[^a-z]+$" "" (point-min) (point-max)) (replace-regexp-in-region "[^a-z]+" "_" (point-min) (point-max))))
The first replace-regexp-in-region
just deletes any non-letter characters at the end of the region. (This makes sense, since an (English) sentence usually ends with punctuation, but always starts with a letter, so deleting non-letters at the end is useful.) For that to work, I needed the save-restriction
/narrow-to-region
combo, so that the $
would indeed catch the “end”. Otherwise, $
would not match if the end
parameter did not coincide with end of line. (Incidentally, this very change was a good workaround for the bug I mentioned – in fact, this is precisely the way the bug was fixed in Emacs.)
By the way, replace-regexp-in-region
and its cousin replace-string-in-region
seem so useful (they are basically programmatic equivalents of query-replace
and query-replace-regexp
) that I immediately regretted not mentioning them in my Elisp book. It turned out, however, that I actually couldn’t have done that, since they were introduced in Emacs about a year ago, when the book was close to being finished (and I didn’t use the most recent Emacs right from the Git repo). Turns out that even after all those years, there are still useful (and pretty basic) things people add to Emacs!