There was a discussion on the ConTeXt mailing list recently about counting visible characters in a PDF file. This is easily reduced (by the pdftotext
utility) to the problem of counting visible characters in a text file. Just for fun, I decided to do this in Elisp. I knew this wouldn’t be really a challenge to code it, but it might be a bit challenging to make it optimal. So, here’s the first (and final, as of now) version:
(defun how-many-visible-chars () "Count visible (i.e., other than spaces, tabs and newlines) characters in the buffer." (interactive) (let ((count 0)) (save-excursion (goto-char (point-min)) (while (not (eobp)) (unless (looking-at-p "[ \t\n]") (setq count (1+ count))) (forward-char))) (message "%d visible characters" count)))
First of all, note that this is not production-ready code: it works only as an interactive command (i.e., it does not return an integer nor suppress the message when called from a Lisp program), nor does it recognize active region (which IMHO is a must for command like this – though it does, obviously, respect narrowing).
Also, it counts hidden characters, too, e.g. it takes into account all the folded text in Org buffers. I don’t exactly know how to overcome this. (Actually, I can imagine that someone might want to do things like this. For instance, count-lines-region
counts invisible lines, too, and sometimes I want to count only the subheadings of the Org entry. There is no built-in way to do it, although it is simple to write a bit of Elisp to perform such a calculation.)
The biggest problem with this – at least I thought so – is performance. The idea of walking through the buffer character by character and using a regex to check whether it is not a blank one seems outrageous (and rightfully so). I was quite surprised to learn that it’s not really that inefficient: I ran it on Emacs’ simple.el
(which is more than 300 kB long – quite a feat for a text file), and it took my function one or two seconds to count all the non-blank characters.
I guess it might be an interesting exercise to think about possible optimizations. I guess getting rid of regexen may be a good idea. Another one I could come up with is kind of loop unrolling: since the bulk of most buffers is made up of letters, digits and a few other characters, using skip-chars-forward
might help a lot. As a matter of fact, skip-syntax-forward
might be even faster; OTOH, in buffers where the syntax property is used a lot, this might not be the best idea.
I have to admit that now having thought and written about this, I can’t wait to try out the Emacs profiler.
CategoryEnglish, CategoryBlog, CategoryEmacs, CategoryOrgMode, CategoryTeX