Marcin Borkowski: 2025-10-13 Grouping rows by values in a column

A few days ago I needed to analyze some tabular data where some rows where grouped in a natural way. Imagine for example a database table which stores events, and every event is related to a user. For example, it could have a user_id column, a timestamp column and an event_type column. I want to see all the events, but events related to different users should be visually spearated (as they are completely unrelated). A natural way would be to sort the data by user_id first and by timestamp next. Ideally, groups of events related to different users would be separated by e.g. an empty row.

It seems that this is possible to achieve in PostgreSQL by some clever use of the lag window function, but I thought that maybe there’s an easier way. I could export the results of my query in the csv format and apply some clever Miller code. Trouble is, I don’t know Miller’s DSL.

After poking a bit on the Internet, I discovered that there’s no need to dabble in Miller’s DSL to do that – it has a separate verb just to do exactly what I want!

Given the following csv data

user_id,timestamp,event_type
1,2025-10-07T05:31:00Z,log_in
1,2025-10-07T05:32:00Z,edit_profile
1,2025-10-07T05:33:00Z,log_out
2,2025-10-07T05:30:00Z,log_in
2,2025-10-07T05:34:00Z,log_out

and applying mlr --csv gap -g user_id, I get

user_id,timestamp,event_type
1,2025-10-07T05:31:00Z,log_in
1,2025-10-07T05:32:00Z,edit_profile
1,2025-10-07T05:33:00Z,log_out
,,
2,2025-10-07T05:30:00Z,log_in
2,2025-10-07T05:34:00Z,log_out

which is great!

One problem I had with this is that on some machines I needed to do this, I only had access to an older version of Miller (v6.0.0) which did not add the commas in the empty rows – it just produced an empty line, which it then considers a “schema change” and outputs a new header whenever that happens. The solution was to say mlr --csv gap -g user_id then unsparsify.

Now, what I needed next was to plug this into psql. This is easy enough thanks to psql’s abilities – csv formatting and piping. While the most common way to tell psql “I’m done writing the query, execute it” is to type a semicolon at the end, you can also use the \g command which can take some parameters – for example, \g (format=csv) produces the results in csv format. You can also say \g |command to pipe the results into command – and you can obviously use both these features at once, too, so that generating a csv and piping it into Miller is easy. Of course, I don’t like to have to view the raw csv file – but I don’t have to. One option is to use mlr --icsv --opprint so that Miller’s output is in a nice, tabular format. Another option I prefer is to pipe Miller’s output to pspg, which can act as an insanely cool csv viewer. (Even if you don’t use any advanced features of pspg, seeing the header all the time even when you scroll through the table is very useful!)

select user_id, timestamp, event_type from event
order by user_id, timestamp
\g (format=csv) | mlr --csv gap -g user_id | pspg --csv

That’s it for today, see you next week!

CategoryEnglish, CategoryBlog, CategoryPostgreSQL