Content AND Presentation

2019-10-12 Challenge accepted - a Node.js grep

An old friend of mine posted a challenge on Twitter to implement a grep-like utility in one’s language of choice. Instead of complaining that he’s got an unfair advantage – he is a Pythonista, and Python is almost as well-suited to that kind of tasks as Perl – I decided to accept the challenge. Of course, I had to start with Emacs Lisp. For this, I decided to cheat a bit and use a buffer instead of a “file” or “stream” – this is, after all, the most natural data structure in Emacs to perform this kind of task.

After about ten minutes of coding, I came up with this:

(defun my-emacs-grep (regex)
  "Delete lines not matching REGEX in the current buffer."
  (interactive (list (read-regexp "Regex: ")))
  (goto-char (point-min))
  (while (not (eobp))
    (let ((eol-pos (line-end-position)))
      (if (re-search-forward regex eol-pos t)
	  (forward-line)
	(delete-region (point) (1+ eol-pos))))))

This is way shorter than delete-non-matching-lines, which is a built-in equivalent with some bells and whistles attached, but it seems to work fine.

Now, being a JavaScript programmer, I also had to code a JS-grep. This is actually quite an interesting task. Here is a first, naive attempt.

#!/usr/bin/env node
const fs = require('fs');
const re = new RegExp(process.argv[2]);
console.log(
	fs.readFileSync(process.stdin.fd, 'utf8')
		.split('\n')
		.filter(line => re.test(line))
		.join('\n')
);

This works, but not very well – it slurps everything from stdin into memory, which is not how the real grep works – but has an advantage of being very simple and again taking even less than ten minutes to code.

Anyway, let’s make a better one. Node.js has the readline library, and – quite helpfully – the manual has an example of reading a file line-by-line using it. After modifying it slightly I ended up with this:

#!/usr/bin/env node
const fs = require('fs');
const readline = require('readline');
const re = new RegExp(process.argv[2]);

const rl = readline.createInterface({
	input: fs.createReadStream('/dev/stdin'),
	crlfDelay: Infinity,
});

rl.on('line', line => (re.test(line) && console.log(line)));

The most interesting part is that it shows the web lineage of Node.js – even though the newer versions have synchronous operations like fs.readFileSync, the readline library has an event-driven interface. This approach is not extremely helpful when writing CLI scripts, but shines for backends of web applications.

Anyway, here we have three implementations of a very simplistic grep. What should be done now is some benchmarking – but I guess this should wait until we have a Python version to compare with. :-)

CategoryEnglish, CategoryBlog, CategoryEmacs, CategoryJavaScript

Comments on this page

2019-10-07 A tip with diffing (and committing) program structure changes

As I mentioned last week, the fact that diff works on a line basis is sometimes a source of trouble. Consider this case. Assume that we have a simple JavaScript module greets (I guess that a very similar case could be made for Java classes, Python modules etc.).

module.exports = {
	hello: function hello(who) {
		return `Hello ${who}!`;
	},

	bye: function bye(who) {
		return `Bye, ${who}!`;
	},
}

(Yes, it’s very primitive, it could make use of fat arrow functions etc., but please bear with me.)

Let us now assume that instead of exporting an object with two functions, we now want to export a function accepting one argument (a language code) and exporting an object containing keys hello and bye, much like before. So, we introduce the following changes.

module.exports = function(locale) {
	if (locale === 'en') {
		return {
			hello: function hello(who) {
				return `Hello ${who}!`;
			},

			bye: function bye(who) {
				return `Bye, ${who}!`;
			},
		}
	}
}

Now, diffing the two results in a mess (note that I use a real diff indtead of git-diff here, but this is irrelevant).

$ diff -u greets-1.js greets-2.js
--- greets-1.js 2019-09-15 22:54:33.421816250 +0200
+++ greets-2.js 2019-09-15 22:53:56.432355439 +0200
@@ -1,9 +1,13 @@
-module.exports = {
-       hello: function hello(who) {
-               return `Hello ${who}!`;
-       },
+module.exports = function(locale) {
+       if (locale === 'en') {
+               return {
+                       hello: function hello(who) {
+                               return `Hello ${who}!`;
+                       },

-       bye: function bye(who) {
-               return `Bye, ${who}!`;
-       },
+                       bye: function bye(who) {
+                               return `Bye, ${who}!`;
+                       },
+               }
+       }
 }

As you might imagine, for longer, real-life code the situation can get much worse. And if the whole thing is kept in Git, chances are that you are going to look at diffs a lot of the time, so they’d better be more readable!

There is, however, a simple trick which allows for (slightly) more readable diffs. Instead of committing this change in one go, let us first introduce a purely technical commit like this:

module.exports = {
			hello: function hello(who) {
				return `Hello ${who}!`;
			},

			bye: function bye(who) {
				return `Bye, ${who}!`;
			},
}

See what I did here? I have just indented everything inside the exported object by two tabs. (Deciding what and how much to indent is, of course, a human’s call every time.)

Now, the diff between this and the previous version isn’t that bad:

diff -u greets-1.js greets-1b.js
--- greets-1.js	2019-09-15 22:54:33.421816250 +0200
+++ greets-1b.js	2019-09-21 08:22:17.826641850 +0200
@@ -1,9 +1,9 @@
 module.exports = {
-	hello: function hello(who) {
-		return `Hello ${who}!`;
-	},
+			hello: function hello(who) {
+				return `Hello ${who}!`;
+			},

-	bye: function bye(who) {
-		return `Bye, ${who}!`;
-	},
+			bye: function bye(who) {
+				return `Bye, ${who}!`;
+			},
 }

What is way more important, however, is the diff between this intermediate stage and the final one:

diff -u greets-1b.js greets-2.js
--- greets-1b.js	2019-09-21 08:22:17.826641850 +0200
+++ greets-2.js	2019-09-21 08:25:54.175578419 +0200
@@ -1,4 +1,6 @@
-module.exports = {
+module.exports = function(locale) {
+	if (locale === 'en') {
+		return {
			hello: function hello(who) {
				return `Hello ${who}!`;
			},
@@ -6,4 +8,6 @@
			bye: function bye(who) {
				return `Bye, ${who}!`;
			},
+		}
+	}
 }

See? This is now much more readable than previously!

Now, there are perhaps other ways to solve this problem. Both the regular, GNU diff and git-diff have at least one option that might help here: --ignore-space-change, or -b for short. Here is the result of using it:

diff -bu greets-1.js greets-2.js
--- greets-1.js	2019-09-15 22:54:33.421816250 +0200
+++ greets-2.js	2019-09-21 08:25:54.175578419 +0200
@@ -1,4 +1,6 @@
-module.exports = {
+module.exports = function(locale) {
+	if (locale === 'en') {
+		return {
	hello: function hello(who) {
		return `Hello ${who}!`;
	},
@@ -6,4 +8,6 @@
	bye: function bye(who) {
		return `Bye, ${who}!`;
	},
+		}
+	}
 }

This is way better than whet we started with, but I’d argue that it is still not as good as my semi-manual solution, since it uses the indentation from greets-1.js to show the “matching” lines (i.e., ones only differing by whitespace).

Yet another way to solve the problem of illegible diffs is to use something else in place of GNU diff. There are many such tools, and Git is capable of using them to show diffs. Run git difftool --tool-help to see the list, and git difftool -t <tool-name> <commit1> <commit2> to use a particular tool instead of a regular diff. I tried a few of them, and the results were varied. Some of them gave very nice diffs, some of them were closer to gibberish. Some of them gave nice diffs but colored in a way that didn’t help at all. In any case, I am not a great fan of GUI tools, but I admit that using e.g. kdiff3 instead of the regular GNU diff did help in this particular case, and basically rendered my trick with making two commits useless. On the other hand, this won’t help if you review pull request on some web-based app like BitBucket.

CategoryEnglish, CategoryBlog, CategoryJavaScript

Comments on this page

2019-09-30 diff and ignoring lines

One of the most well-known commandline tools is the classical diff program. On my system, it is (of course) the GNU diff, which is a part of the GNU diffutils package.

Recently, I found out that GNU diff has an interesting option, -I (or --ignore-matching-lines). You can give it a regex and it will ignore added or deleted lines if they contain a match for this regex.

This may be useful in many circumstances. Consider, for instance, INI-style files, with sections and assignments, like this:

[section]
variable=setting
another=something

[default]
this=doesnt
make=sense

[third]
one=more

Assume that you have another one, with identical sections but different settings:

[section]
variable=value
another=whatever


[default]
this=does
make=install

[different]
something=else

Of course, in real life, the files could be quite long, and we would like to know if they follow the same structure – in other words, disrespect the settings (and blank lines) and only compare the section names. This is quite easy: diff -I '^$' -I '^[^[]' 1.ini 2.ini. (In fact, ignoring blank lines has its own shortcut, -B, so the first -I can be replaced by it.)

One caveat (which can be seen from the above example) is that diff only ignores the specified lines if the whole hunk consists of lines matching the regex. This may or may not be what you want, but remember that you can always pipe the result of diff through grep.

Consult the manpage of diff to learn more about its options. Various whitespace-ignoring possibilities may be of special interest.

As a side note, what I really miss is an AST-aware diff. Line-by-line comparing is nice, but often unsuitable for programs, which have an inherent tree structure.

CategoryEnglish, CategoryBlog

Comments on this page

More...

CategoryEnglish, CategoryBlog