2015-02-14 Counting visible characters

There was a discussion on the ConTeXt mailing list recently about counting visible characters in a PDF file. This is easily reduced (by the pdftotext utility) to the problem of counting visible characters in a text file. Just for fun, I decided to do this in Elisp. I knew this wouldn’t be really a challenge to code it, but it might be a bit challenging to make it optimal. So, here’s the first (and final, as of now) version:

(defun how-many-visible-chars ()
    "Count visible (i.e., other than spaces, tabs and newlines)
characters in the buffer."
  (let ((count 0))
      (goto-char (point-min))
      (while (not (eobp))
	(unless (looking-at-p "[ \t\n]")
	  (setq count (1+ count)))
    (message "%d visible characters" count)))

First of all, note that this is not production-ready code: it works only as an interactive command (i.e., it does not return an integer nor suppress the message when called from a Lisp program), nor does it recognize active region (which IMHO is a must for command like this – though it does, obviously, respect narrowing).

Also, it counts hidden characters, too, e.g. it takes into account all the folded text in Org buffers. I don’t exactly know how to overcome this. (Actually, I can imagine that someone might want to do things like this. For instance, count-lines-region counts invisible lines, too, and sometimes I want to count only the subheadings of the Org entry. There is no built-in way to do it, although it is simple to write a bit of Elisp to perform such a calculation.)

The biggest problem with this – at least I thought so – is performance. The idea of walking through the buffer character by character and using a regex to check whether it is not a blank one seems outrageous (and rightfully so). I was quite surprised to learn that it’s not really that inefficient: I ran it on Emacs’ simple.el (which is more than 300 kB long – quite a feat for a text file), and it took my function one or two seconds to count all the non-blank characters.

I guess it might be an interesting exercise to think about possible optimizations. I guess getting rid of regexen may be a good idea. Another one I could come up with is kind of loop unrolling: since the bulk of most buffers is made up of letters, digits and a few other characters, using skip-chars-forward might help a lot. As a matter of fact, skip-syntax-forward might be even faster; OTOH, in buffers where the syntax property is used a lot, this might not be the best idea.

I have to admit that now having thought and written about this, I can’t wait to try out the Emacs profiler.

CategoryEnglish, CategoryBlog, CategoryEmacs, CategoryOrgMode, CategoryTeX