Marcin Borkowski: 2018-01-15 Counting LaTeX commands in a bunch of files

I hope that I want bore anyone to death with blog posts related to the journal I’m working for, but here’s another story about my experiences with that.

I am currently writing a manual for authors wanting to prepare a paper for Wiadomości Matematyczne. We accept LaTeX files, of course, but we have our own LaTeX class (not yet public), and adapting what others wrote (usually using article) is sometimes a lot of work. Having the authors follow our guidelines could make that slightly less work, which is something I’d be quite happy with. (Of course, making a bunch of university mathematicians do something reasonable would be an achievement in itself.)

When I presented (the current version of) the manual to my colleagues in the editorial board, we agreed that nobody will read it anyway. And then I had an idea of preparing a TL;DR version, just a few sentences, where I could mention the one thing I want to get across: dear authors, please do not do anything fancy, just stick with plain ol’ LaTeX. And one component of that message could be a list of LaTeX commands people should stick to. (If you have never worked for a journal or somewhere where you get to look at other people’s LaTeX files, you probably have no idea about what they are capable of doing.)

So here I am, having 200+ LaTeX files (there are twice as many, but I had only about 200 on my current laptop), meticulously converted to our template (which means our class and our local customs, like special commands for various dashes or avoiding colons at all costs), and I want to prepare a list of LaTeX commands used throughout together with the information about the frequency of using them.

In ye olden days, people would use Perl for that. Nowadays, Python would be probably a more common choice. But if you learn to use a hammer, everything starts to look like a nail, no? Enter Emacs Lisp.

Actually, I decided to use it also because I have already written some stuff for parsing LaTeX files. (I’ll blog about it some day; the coolest thing I have there is the analogue of show-paren-mode for “pairs” like \bigl( ... \bigr], and the ability to change this into e.g. \Bigl( ... \Bigr] etc. with one command.) After all, it turned out that I didn’t need those features that much anyway. The only thing I used was the TeX+-info-about-token-beginning-at-point, which returns a cons cell whose car is the TeX token starting at point and whose cdr is a symbol describing its type.

I approached the problem in a truly Lispy, bottom-up style. I started with a count-TeX-macros-in-current-buffer function, receiving and returning an alist of macros and their frequencies. Then count-TeX-macros-in-file followed, which first visited a file (using with-temp-buffer and insert-file-contents-literally, of course). Finally, count-TeX-macros-recursively received a directory and a regex and performed the count in all files whose names matched the regex in and below the given directory. Sorting (by descending frequency) and displaying the results were just the topping.

The thing that astonished me the most was the speed of this. Since I did not attempt any premature optimization, I expected my code to work for anything between maybe ten seconds and a few minutes. I certainly did not expect less than one second, which was really cool.

Also, please note that this is a quick-and-dirty, one-shot code, which is therefore not very clean. I don’t intend to waste too much time polishing this, it’s simple enough that if you want to play with it, you should be able to understand the code in ten minutes or so.

Finally, I didn’t bother to count environments, only commands. I might extend my code to environments one day, too, but I do not expect ay surprises. (document, enumerate, maybe itemize, a sprinkling of figure, table and tikzpicture, and the obvious math stuff – that would pretty much be it, I guess.)

(require 'tex+)

(defun count-TeX-macros-in-current-buffer (histogram)
  "Return an alist of macros in the current buffer.
HISTOGRAM is the input we should add to."
  (save-mark-and-excursion
	(save-restriction
	  (widen)
	  (goto-char (point-min))
	  (while (search-forward "\\" nil t) 	
	(backward-char)
	(let* ((token (TeX+-info-about-token-beginning-at-point))
		   (freq (assoc (car token) histogram)))
	  (if (memq (cdr token) '(control-symbol control-word))
		  (if freq
		  (incf (cdr freq))
		(setq histogram (cons (cons (car token) 1) histogram)))))
	(skip-chars-forward "\\\\" (+ (point) 2)))
	  histogram)))

(defun count-TeX-macros-in-file (file histogram)
  "Count TeX macros in FILE and add that info to HISTOGRAM."
  (with-temp-buffer
	(insert-file-contents-literally file)
	(setq histogram (count-TeX-macros-in-current-buffer histogram))))

(defun count-TeX-macros-recursively (directory regex)
  "Count TeX macros in files in DIRECTORY (recursively) whose
names match REGEX."
  (let ((files (directory-files-recursively "." regex))
	(histogram '()))
	(while files
	  (message (concat "Analyzing " (file-name-nondirectory (car files)) "..."))
	  (setq histogram (count-TeX-macros-in-file (car files) histogram))
	  (message (concat "Analyzing " (file-name-nondirectory (car files)) "...done"))
	  (setq files (cdr files)))
	histogram))

(defun sort-histogram (histogram)
  "Sort HISTOGRAM (destructively) by frequency."
  (sort histogram (lambda (a b) (> (cdr a) (cdr b)))))

(defun insert-histogram (histogram)
  "Insert frequency data from HISTOGRAM in a human-readable
format."
  (setq histogram (sort-histogram histogram))
  (newline)
  (while histogram
	(insert (format "%-24s %d\n" (car (car histogram)) (cdr (car histogram))))
	(setq histogram (cdr histogram))))

And the winner is, of course, the results. And they did surprise me. It turns out that the most common macros are $ and $ (which is not surprising, since we automatically convert $...$ to them). The silver medal goes to \emph (again, no surprise here). Then, we have (in roughly this order):

\cite and \bib
\begin and \end
\ppauza (which is a Polish version of an en-dash, with proper spacing around and a non-breakable space before the dash; this one is defined in the polski package)
\, (used in math a lot)
\' (the first surprise)
\item
\dywiz (a Polish version of a hyphen, which, when the word is actually hyphenated, should be repeated at the end of the former line and at the beginning of the latter one; also defined in the polski package)
\\
\polishendash (which is a stupid name, but this is our macro which acts more or less like \dywiz, but has the length of an en-dash; this is different than \ppauza, since there is no spacing around it and it is repeated when hyphenated, just like \dywiz)
\!, which we use quite a lot
\label (which is – surprisingly – used more often than \ref!)
\usepackage (on average, twice per document, and every one of them uses inputenc!)
\newcommand (which was another surprise)
\[ and \]
\" – for some strange reason
\citelist
\ref, promptly followed by \eqref
stuff like \documentclass and \footnote
\section
metadata like \author

All that interspersed by some of our internal macros, and a lot of stuff used in math, like \in and \int and \ln and \left and so on.

The bottom line of this research is this: if you are an author of a paper for Wiadomości Matematyczne (or most other math journals, I presume), you should not use any fancy TeX stuff. Basically, the only commands you will most probably need (outside the preamble/template, of course – I don’t count stuff like \author here), are \emph (never \em or even \textit!), \section (and sometimes \subsection), \label and \ref, probably \cite and an occasional \footnote or \item. And, of course, various math symbols. Anything above that and you may safely assume that you are a troublemaker for the editors. (And by the way, if you claim that “LaTeX is too hard”, here’s my (a bit unpleasant) answer: if you are a mathematician and can’t learn how to use about ten commands, probably another ten environments plus the math symbols you actually need, please stop whining about “difficulty” and choose another profession.)

CategoryEnglish, CategoryBlog, CategoryEmacs, CategoryTeX, CategoryLaTeX