2017-10-02 Converting TeX sequences to Unicode characters

I quite often deal with LaTeX files using stuff like \'a or \"e, and I really prefer having those encoded in UTF-8. So the natural question arises: how to convert one into another?

The problem is especially frustrating because Emacs can do this – either via C-x 8 prefix, or with the TeX input method. It is not trivial, however, to find out how it does these things, and to get hold of the data used to actually perform the conversion. (At least, I didn’t find a way to do it.)

After a bit of searching, however, I came up with another solution. I’m hesitant to call it “clever”; it’s rather hackish, but hey, it works, so who cares.

(defvar TeX-to-Unicode-accents-alist
  '((?` . "grave")
	(?' . "acute")
	(?^ . "circumflex")
	(?\" . "diaeresis")
	(?H . "double acute")
	(?~ . "tilde")
	(?c . "with cedilla")
	(?k . "ogonek")
	(?= . "macron")
	(?. . "with dot above")
	(?u . "with breve")
	(?v . "with caron"))
  "A mapping from TeX control characters to accent names used in

(defun combine-letter-diacritical-mark (letter mark)
  "Return a Unicode string of LETTER combined with MARK.
MARK can be any character that can be used in TeX accenting
  (let* ((letter (if (stringp letter)
					 (string-to-char letter)
		 (uppercase (= letter
					   (upcase letter))))
	(cdr (assoc-string
		  (format "LATIN %s LETTER %c %s"
				  (if uppercase "CAPITAL" "SMALL")
				  (cdr (assoc mark TeX-to-Unicode-accents-alist)))

Notice how strange are the Unicode ways of naming the accented characters. Also, beware that this function has no error detection: if you say something like (combine-letter-diacritical-mark ?w ?H), you’ll just get an error.

Now a word of warning. In order to be able to actually convert LaTeX-like stuff into Unicode, we have to deal with quite a lot of cases. For instance, the Polish letter ą can be encoded as \k a, \k{a}, {\k a} or even {\k{a}}. (I’m not sure why anyone would do the last thing, but with my experience I can say that I’m 99% sure someone somewhere does it. And every time they do, a typo fairy comes and changes their v into a ν or something like that, all in math formulas of course, and they deserve it.) And don’t forget about edge cases like \\k a, which are improbable, but possible! But all this are details which only complicate the main problem. So, let’s forget about most of these cases for now and convert our TeX-like things into Unicode in a buffer. This code is just simple enough as a proof of concept, but for production it really should be able to deal with all the strange cases mentioned above. Ah, and it doesn’t convert \l, since this is not an accented character and has to be dealt with separately! Making the code below more robust is therefore left as an exercise for the reader. (Well, I’ll also do it some day, and I’ll then probably publish it on this blog, too.)

(defun TeX-convert-accented-letters-to-Unicode ()
  "Convert accented letters in TeX notation to Unicode.
Operate on the whole buffer."
	(goto-char (point-min))
	(let ((case-fold-search nil) letter)
	  (TeX-search-unescaped "\\\\\\([`'^\"H~ck=.uv]\\)\\(?:\\( ?[a-zA-Z]\\)\\|{\\([a-zA-Z]\\)}\\)" 'forward t nil t)
	(setq letter (match-string 2))
	(if (or (string= "" letter) (null letter))
		(setq letter (match-string 3)))
	(replace-match (char-to-string (combine-letter-diacritical-mark letter (string-to-char (match-string 1)))) t t)))))

CategoryEnglish, CategoryBlog, CategoryTeX, CategoryLaTeX, CategoryEmacs