2023-12-11 Replacing TeX control words behind the point

Two weeks ago, a friend from Polish TeX Users’ Group mailing list asked about an Emacs tool to replace control sequences with their Unicode counterparts. I also have this need from time to time, and I usually go with the TeX input method. He is not satisfied with it, though, because it replaces too much for him – for instance, he doesn’t want a_1 to get translated to a₁. He remembered some utility (written by another Polish TeX user) which replaces a TeX sequence with a Unicode equivalent, but only on demand. Since that one seems to be lost in the depths of time, he was left without a solution.

Being me, I decided to write it – after all, it should be fairly easy even for a moderately experienced Elisp hacker. So, here’s a proof of concept.

(defcustom TeX-to-unicode-alist
  '(("in" . "∈")
    ("emptyset" . "∅"))
  "Alist of LaTeX control words and their Unicode equivalents."
  :type '(alist :key-type string :value-type string))

(defun TeX-to-unicode ()
  "Replace a TeX control word with its Unicode equivalent.
The control word must be a sequence of one or more letters after
a backslash and be located directly behind the point."
  (interactive "*")
  (when-let (replacement (and (looking-back "\\\\\\([a-zA-Z]+\\)")
                              (alist-get (match-string 1)
                                         nil nil
    (delete-region (match-beginning 0) (match-end 0))
    (insert replacement)))

One thing I have learned recently is the when-lat macro. It works much like let*, but if any of the bindings is nil, it does not evaluate its body and just returns nil. (Go read its docstring if you find such a concept useful – in fact, it has a bit more features, and there are others like it, for example if-let, while-let and a few others.)

This code could be easily made more performant – looking up stuff in an alist would be most probably faster with symbols than with strings, and a hash table would be faster if there were really many control words in it. On the other hand, this is an interactive function, not something running thousands times in a loop, so this probably doesn’t matter.

Of course, filling up TeX-to-unicode-alist is the real challenge here. In this PoC I just put two control words in, but TeX has hundreds of control words, and Unicode has hundreds of symbols. Making a comprehensive list is a lot of work. Good thing is, someone already did it – after all, Emacs has the TeX input method! Our next problem is how to leverage the existing table Emacs uses. A quick search reveals that the table is located in emacs/lisp/leim/quail/latin-ltx.el. About 85% of that file is just one invocation of the latin-ltx--define-rules macro which contains (more or less) what we need. Unfortunately, using it is far from straightforward. I can envision three strategies. One is just copying that file, deleting things I don’t need and converting the list to the format I need. This sounds a bit ugly, but makes sense, and if I wanted a production-grade, actually useful solution, I could do this. One thing that makes it a bit difficult is that it doesn’t contain the list of greek letters, for example – Emacs uses the fact that it is possible to map names of TeX commands for greek letters to Unicode names for their characters. Clever, but doesn’t help us a lot.

Another way is to analyze what the latin-ltx--define-rules macro does – it must put the results somewhere, after all – and using those results. Unfortunately, it seems that the results are in a format which is hardly usable for our purpose (see quail-package-alist to see for yourself!). It’s still possible, of course, to do an automated conversion, but it’s a bit of not fun work I’d prefer to avoid if possible.

Yet another is doing some clever trickery to redefine things like latin-ltx--define-rules and eval​ing the latin-ltx.el file. (This is probably doable, but rather tricky – the file contains both the definition and invocation of that macro, so for this to work, we would probably have to temporarily redefine defmacro. This is definitely not the rabbit hole I’d prefer to go into…)

Let’s do something else instead. When researching for this post, I ran M-x apropos-value RET omega RET, hoping to find the variable keeping the data about the TeX inout method. (I imagined that omega is probably not part of a value of many Emacs variables, but should appear in any list of TeX control words or related places. Of course, now that I saw quail-package-alist, I know it wasn’t going to work.) I found something else instead: org-entities. This is almost exactly what we need. When exporting, Org can translate various things into (among others) LaTeX, HTML entities – and UTF-8. Bingo! Every entry in org-entities is a list of strings (well, some entries are strings, and they are a kind of comments, used to make the output of org-entities-help nicer), the second of those strings is a LaTeX command (by the way, for most stuff we discuss here, plain TeX and LaTeX commands are the same), and the last, sixth entry is a UTF-8 string. Since my command only allows control words, we’ll disregard entries like \'{A}, but use the ones of the form: backslash, one or more letters, optional {}. (If you really need to input accented letters in your file, the go-to solution is to either use a suitable keyboard mapping in your OS, or use a suitable Emacs input method, or – if you only need this occasionally – use C-x 8 followed by an accent character and a letter.)

One thing I discovered when coding this was that org-entities contained some symbols more than once. It turns out that Org mode has more than one name for some symbols. For example, unlike in TeX, you can say both \AA and \Aring to get Å. On the other hand, like in TeX, you can say both \le and \leq to get . Unfortunately, Org mode maps both of them to \le when exporting to LaTeX, which means that my trick with org-entities will not put \leq on the list. That’s not ideal, but not very bad, either. Anyway, I decided to remove the duplicates from the resulting list, just for the sake of elegance.

Since I did not want to include all of the entries in org-entities (it contains a lot of things like accented letters, or horizontal whitespace like \hspace{.5em} and other stuff I didn’t want to have in TeX-to-unicode-alist), I did not want to use mapcar. The usual way to perform transformations on lists involving omitting some elements and transforming others is either composing a map and filter functions (in Elisp, that would be mapcar and seq-filter), or resorting to reduce (seq-reduce in Elisp). I went the latter way, without a good reason – the choice is a matter of personal preference (or a whim). Then, I applied seq-uniq to delete the duplicates (since the entries are conses of strings, I needed to provide a suitable TESTFN) and nreverse to preserve the order of the entries.

(defcustom TeX-to-unicode-alist
    (seq-reduce (lambda (acc entity)
                 (when (listp entity)
                   (let ((TeX (nth 1 entity))
                         (utf (nth 6 entity)))
                     (when (string-match
                       (push (cons (match-string 1 TeX) utf) acc))))
               (append org-entities-user org-entities)
    (lambda (a b) (string= (car a) (car b)))))
  "Alist of LaTeX control words and their Unicode equivalents."
  :type '(alist :key-type string :value-type string))

And that’s pretty much it for today! As usual, Emacs turns out to be an extremely malleable tool you can shape in almost any way to suit your needs. And also as usual, let me remind you that if you want to learn to write little utilities like this, one of the best sources you can start with is the classic Introduction to programming in Emacs Lisp by the late Robert J. Chassell. If you want to dig deeper, you can then buy my book about Emacs Lisp, Hacking your way around in Emacs, which is (sort of) a spiritual successor to that book.

Happy hacking!

CategoryEnglish, CategoryBlog, CategoryEmacs, CategoryTeX, CategoryLaTeX