2023-05-20 How to filter rows with duplicate ids in a csv

Some time ago I encountered an interesting problem. I had a CSV file (imported from a database), and I wanted to see only rows for which a certain column was duplicated. For example, like this (assume that the table is sorted by user_id, so the rows with the same ids are grouped):

Datapoint id,User id,Some more data
3,1,
1,2,
4,2,
2,3,
5,4,
6,4,
7,4,

So, imagine we have some datapoints (say, measurements), and each datapoint is associated with a user. I’d like to see only the rows showing datapoints for users who have more than one – so essentially this:

Datapoint id,User id,Some more data
1,2,
4,2,
5,4,
6,4,
7,4,

How to do that?

One way is to use the good old uniq. It turns out that you may tell it to skip N characters at the beginning of each line and stop comparing after M characters: -s N -w M. Coupled with -D (“print all duplicate lines”), we get exactly what we need (modulo the lack of the header line, which is not a big issue, and could probably be corrected with some clever invocation of pee and head). One problem with this solution is that it requires the widths of the columns to be constant. In my case it was not a problem, since they were UUIDs, but in the case of integers there would be problems.

Another solution would be to use a tool like AWK. This would probably work very well unless the CSV file contained (escaped) cells with newlines. Since AWK is line-based, and does not know proper CSV syntax, such a solution would break then (unless AWK has a way of dealing with things like that – I suppose it does not, but after all, it’s AWK…).

Yet another idea is to use a tool like Miller. I use Miller from time to time, and I’m almost sure it could handle this particular issue, but I would have to search through its manual to discover the way to do it. And I wanted a quick solution, so teaching myself more advanced AWK or Miller syntax was out of the question.

Well, I could (of course!) write some Elisp. But even though I like command-line tools, and I love Emacs, I wanted something really, really quick. So in the end, I went with the following trick with (gasp!) LibreOffice.

I introduced another column (moving all the other ones to the right), and inserted into the A2 cell a formula like this:

=OR(C2=C1,C2=C3)

and copied it to the whole A column. This way, I had true in A if, and only if, the corresponding cell in the C column had the same value as the next or previous one. Now the only thing that remained was to filter out all the false ones.

So, even though I have some issues with LibreOffice, I have to admit that there is a certain category of tasks where it’s really hard to beat a spreadsheet!

CategoryEnglish, CategoryBlog