Content AND Presentation

Last edit

Summary: remove obsolete info about Polish words (they've been translated to English a long time ago anyway).


< (Note to English-speaking readers: the links entitled //Komentarze na tej stronie// lead to comment pages.)


< (//Więcej// means //More// in Polish; click it to see older entries.)

2018-01-15 Counting LaTeX commands in a bunch of files

I hope that I want bore anyone to death with blog posts related to the journal I’m working for, but here’s another story about my experiences with that.

I am currently writing a manual for authors wanting to prepare a paper for Wiadomości Matematyczne. We accept LaTeX files, of course, but we have our own LaTeX class (not yet public), and adapting what others wrote (usually using article) is sometimes a lot of work. Having the authors follow our guidelines could make that slightly less work, which is something I’d be quite happy with. (Of course, making a bunch of university mathematicians do something reasonable would be an achievement in itself.)

When I presented (the current version of) the manual to my colleagues in the editorial board, we agreed that nobody will read it anyway. And then I had an idea of preparing a TL;DR version, just a few sentences, where I could mention the one thing I want to get across: dear authors, please do not do anything fancy, just stick with plain ol’ LaTeX. And one component of that message could be a list of LaTeX commands people should stick to. (If you have never worked for a journal or somewhere where you get to look at other people’s LaTeX files, you probably have no idea about what they are capable of doing.)

So here I am, having 200+ LaTeX files (there are twice as many, but I had only about 200 on my current laptop), meticulously converted to our template (which means our class and our local customs, like special commands for various dashes or avoiding colons at all costs), and I want to prepare a list of LaTeX commands used throughout together with the information about the frequency of using them.

In ye olden days, people would use Perl for that. Nowadays, Python would be probably a more common choice. But if you learn to use a hammer, everything starts to look like a nail, no? Enter Emacs Lisp.

Actually, I decided to use it also because I have already written some stuff for parsing LaTeX files. (I’ll blog about it some day; the coolest thing I have there is the analogue of show-paren-mode for “pairs” like \bigl( ... \bigr], and the ability to change this into e.g. \Bigl( ... \Bigr] etc. with one command.) After all, it turned out that I didn’t need those features that much anyway. The only thing I used was the TeX+-info-about-token-beginning-at-point, which returns a cons cell whose car is the TeX token starting at point and whose cdr is a symbol describing its type.

I approached the problem in a truly Lispy, bottom-up style;-). I started with a count-TeX-macros-in-current-buffer function, receiving and returning an alist of macros and their frequencies. Then count-TeX-macros-in-file followed, which first visited a file (using with-temp-buffer and insert-file-contents-literally, of course). Finally, count-TeX-macros-recursively received a directory and a regex and performed the count in all files whose names matched the regex in and below the given directory. Sorting (by descending frequency) and displaying the results were just the topping.

The thing that astonished me the most was the speed of this. Since I did not attempt any premature optimization;-), I expected my code to work for anything between maybe ten seconds and a few minutes. I certainly did not expect less than one second, which was really cool.

Also, please note that this is a quick-and-dirty, one-shot code, which is therefore not very clean. I don’t intend to waste too much time polishing this, it’s simple enough that if you want to play with it, you should be able to understand the code in ten minutes or so.

Finally, I didn’t bother to count environments, only commands. I might extend my code to environments one day, too, but I do not expect ay surprises. (document, enumerate, maybe itemize, a sprinkling of figure, table and tikzpicture, and the obvious math stuff – that would pretty much be it, I guess.)

(require 'tex+)

(defun count-TeX-macros-in-current-buffer (histogram)
  "Return an alist of macros in the current buffer.
HISTOGRAM is the input we should add to."
	  (goto-char (point-min))
	  (while (search-forward "\\" nil t) 	
	(let* ((token (TeX+-info-about-token-beginning-at-point))
		   (freq (assoc (car token) histogram)))
	  (if (memq (cdr token) '(control-symbol control-word))
		  (if freq
		  (incf (cdr freq))
		(setq histogram (cons (cons (car token) 1) histogram)))))
	(skip-chars-forward "\\\\" (+ (point) 2)))

(defun count-TeX-macros-in-file (file histogram)
  "Count TeX macros in FILE and add that info to HISTOGRAM."
	(insert-file-contents-literally file)
	(setq histogram (count-TeX-macros-in-current-buffer histogram))))

(defun count-TeX-macros-recursively (directory regex)
  "Count TeX macros in files in DIRECTORY (recursively) whose
names match REGEX."
  (let ((files (directory-files-recursively "." regex))
	(histogram '()))
	(while files
	  (message (concat "Analyzing " (file-name-nondirectory (car files)) "..."))
	  (setq histogram (count-TeX-macros-in-file (car files) histogram))
	  (message (concat "Analyzing " (file-name-nondirectory (car files)) "...done"))
	  (setq files (cdr files)))

(defun sort-histogram (histogram)
  "Sort HISTOGRAM (destructively) by frequency."
  (sort histogram (lambda (a b) (> (cdr a) (cdr b)))))

(defun insert-histogram (histogram)
  "Insert frequency data from HISTOGRAM in a human-readable
  (setq histogram (sort-histogram histogram))
  (while histogram
	(insert (format "%-24s %d\n" (car (car histogram)) (cdr (car histogram))))
	(setq histogram (cdr histogram))))

And the winner is, of course, the results. And they did surprise me. It turns out that the most common macros are \( and \) (which is not surprising, since we automatically convert $...$ to them). The silver medal goes to \emph (again, no surprise here). Then, we have (in roughly this order):

  • \cite and \bib
  • \begin and \end
  • \ppauza (which is a Polish version of an en-dash, with proper spacing around and a non-breakable space before the dash; this one is defined in the polski package)
  • \, (used in math a lot)
  • \' (the first surprise)
  • \item
  • \dywiz (a Polish version of a hyphen, which, when the word is actually hyphenated, should be repeated at the end of the former line and at the beginning of the latter one; also defined in the polski package)
  • \\
  • \polishendash (which is a stupid name, but this is our macro which acts more or less like \dywiz, but has the length of an en-dash; this is different than \ppauza, since there is no spacing around it and it is repeated when hyphenated, just like \dywiz)
  • \!, which we use quite a lot
  • \label (which is – surprisingly – used more often than \ref!)
  • \usepackage (on average, twice per document, and every one of them uses inputenc!)
  • \newcommand (which was another surprise)
  • \[ and \]
  • \" – for some strange reason
  • \citelist
  • \ref, promptly followed by \eqref
  • stuff like \documentclass and \footnote
  • \section
  • metadata like \author

All that interspersed by some of our internal macros, and a lot of stuff used in math, like \in and \int and \ln and \left and so on.

The bottom line of this research is this: if you are an author of a paper for Wiadomości Matematyczne (or most other math journals, I presume), you should not use any fancy TeX stuff. Basically, the only commands you will most probably need (outside the preamble/template, of course – I don’t count stuff like \author here), are \emph (never \em or even \textit!), \section (and sometimes \subsection), \label and \ref, probably \cite and an occasional \footnote or \item. And, of course, various math symbols. Anything above that and you may safely assume that you are a troublemaker for the editors. (And by the way, if you claim that “LaTeX is too hard”, here’s my (a bit unpleasant) answer: if you are a mathematician and can’t learn how to use about ten commands, probably another ten environments plus the math symbols you actually need, please stop whining about “difficulty” and choose another profession.)

CategoryEnglish, CategoryBlog, CategoryEmacs, CategoryTeX, CategoryLaTeX

Comments on this page

2018-01-07 A small editing tool for work with AMSrefs

As I mentioned many times, I often edit LaTeX files written by someone else for a journal. One thing which is notoriously difficult to get right when writing academic papers is bibliographies. At Wiadomości Matematyczne, we use AMSrefs, which is really nice (even if it has some rough edges here and there). (BTW, BibLaTeX was not as mature as it is today when we settled on our tool; also, AMSrefs might be a tad easier to customize, though I’m not sure about that anymore…) One of the commands AMSrefs offers is \citelist. Instead of writing things like papers \cite{1}, \cite{2} and~\cite{3}, you write papers \citelist{\cite{1}\cite{2}\cite{3}}, and AMSrefs sorts these entries and compresses runs into ranges (like in [1-3]).

The only problem is that most authors have no idea that this exists, and we often have to convert “manual” lists of citations into \citelist‘s.

Well, as usual, Emacs to the rescue. Here’s what I have written.

(defun skip-cite-at-point ()
  "Move point to the end of the \\cite at point."
  (when (looking-at "\\\\cite")
    (forward-char 5)
    (cond ((= (char-after) ?\[)
	   (forward-sexp 2))
	  ((= (char-after) ?\{)
	   (when (and (not (eobp))
		      (= (char-after) ?*))
	  (t (error "Malformed \\cite")))))

(defun cite-to-citelist ()
  "Convert region to a \\citelist command.
All \\cite's are preserved and things between them deleted.
This command will be fooled by things like \"\\\\cite\"."
  (if (use-region-p)
      (let ((end (copy-marker (region-end))))
	(goto-char (region-beginning))
	(insert "\\citelist{")
	(while (< (point) end)
	  (delete-region (point)
			 (if (search-forward "\\cite" end t)
			       (backward-char 5)
	(insert "}"))
    (message "Region not active")))

It might contain some subtle bug, but I really hope it doesn’t – and it will get thoroughly tested very soon.

Notice how nice it is to craft such little editing tools in Emacs. You basically mimic your editing process, i.e., tell the machine what you do by hand to accomplish the goal. And not only do you have obvious things like forward-char, but also more complicated building blocks like forward-sexp.

Also, in case you wonder about the intricacies of the skip-cite-at-point function, AMSrefs’s \cite supports the traditional \cite[p. 123]{1} syntax, but also introduces its own: \cite{1}*{p. 123}. While quite unorthodox for a LaTeX command, it makes life easier for all people who want to put a \cite in an optional argument to things like \begin{theorem} ... \end{theorem} (which is a very common use case). Since LaTeX does not do proper pairing of brackets when parsing optional parameters, normally you need to enclose the whole \cite[...]{...} in additional curly braces – AMSrefs’ syntax makes that unnecessary.

Anyway, in the case anyone needs something like that, here it is. And even if nobody does, maybe this can be an encouragement to write your own snippets like this to help automate your common tasks.

CategoryEnglish, CategoryBlog, CategoryEmacs, CategoryTeX, CategoryLaTeX

Comments on this page

2017-12-31 LaTeX pillory – macros everywhere

A few years ago, my frustration with what people do with (or to…) LaTeX made me start a (now rather abandoned) series of blog posts (in Polish) with the common theme of a “LaTeX pillory”. The name is somwhet misleading, since I don’t really want to shame anyone – but I do want to put shame on some practices. This time I received something that is so terrifying that I decided to revive that project.

After anonymizing (=changing the words used) and translating into English, here’s what I got in a journal submission.

\E\r\ \u s

It was a bit different in Polish, where the macros were actually parts of words (since we have a lot of inflection in Polish), but you get the idea.

Now this is reasonable (for certain values of “reasonable” at least), since it saves typing. But I think that the overhead is not worth it. What’s more, the LaTeX source becomes much less readable with this (there is about a dozen words treated that way). Last but not least, the \expandafter trick, while neat, is not something anyone should use in the document.

Anyway, reasonable or not, it is definitely funny. So I had to share it. You’re welcome.

CategoryEnglish, CategoryBlog, CategoryTeX, CategoryLaTeX, CategoryLaTeXPillory

Comments on this page


CategoryEnglish, CategoryBlog