2018-08-27 Whitespace in Emacs regexen

Some time ago I wanted to search for a sequence of two words in a buffer. Normally I’d search for lorem<space>ipsum, but what if there is a newline between them? Or a non-breaking space? Or a tab?

Well, Emacs has you covered – but in a thick layer of mud. I think I untangled a bit of that mess, and here’s what I’ve found.

First of all, every character has a “syntax class”. This is mostly useful in programming modes, and while you can use syntax classes in text-mode buffers and their ilk, it may not be the best idea. For instance, take this (text-mode) buffer:

lorem ipsum dolor sit amet
lorem	ipsum dolor sit amet
lorem ipsum dolor sit amet
ipsum dolor sit amet

(in the second line, there is a tab between “lorem” and “ipsum”, and in the third one a non-breaking space).

If you search for lorem\s-ipsum (\s- means “any character of syntax “whitespace”), which can also be written as lorem[[:space:]]ipsum, you will find all occurrences with the exception of the one with the non-breaking space, which has syntax “punctuation”. (This looks nonsensical until you realize, possibly with the help of Eli Zaretskii, that it actually does make sense.)

Of course, you can search for lorem[[:space:] ]ipsum, but this is kind of ugly.

Happily, there is not only [:space:], but also [:blank:], which matches horizontal whitespace, as defined by Annex C of the Unicode Technical Standard #18. Unfortunately, it doesn’t match a newline (which makes sense, but is annoying in this context), so you probably want to search for lorem[[:blank:]^J]ipsum (where ^J is a control character entered with C-q C-j). Ugh.

If that were not enough, you can also regex-search for “categories”. This is a concept I didn’t know about until very recently. It is similar to the syntax classes in that it is an Emacs-only concept (so not related to e.g. Unicode) and buffer-local, but different in that one character may belong to several categories. Apparently, this is not widely used in the Emacs sources (I noticed that filling functions use it, though), but I can imagine it’s potentially interesting. For instance, you can regex-search for consonants or Japanese characters.

As you can see, searching for whitespace turns out to be quite a rabbit hole. If you don’t need to match newlines (i.e., you don’t use hard newlines in your prose), \s- is probably your best bet, but in general you have to be careful. I hope this helps a bit.

CategoryEnglish, CategoryBlog, CategoryEmacs