2023-09-18 Making Anki flashcards from subtitles

Those of you who follow my blog know that one of my hobbies is translating subtitles. The main reason I do this is to watch stuff with my daughter, who doesn’t yet speak English fluently. Some time ago it dawned on me that I can use my translations twice. Not only can I watch films and tv series with her, but I can use them to help her learn English.

Here is the idea. Since I have both the original English subtitles and my Polish translation – and the srt files contain the timestamps – I could use ffmpeg to cut clips with individual sentences, extract Polish and English versions of the dialog and create Anki flashcards (almost) automatically!

One thing which makes this a bit more difficult than it seems is that not every piece of dialog is suitable for importing into Anki. In many cases the translation is not (and cannot be) very faithful. For example, sometimes one subtitle contains only half (or less) of the sentence, and due to how English and Polish work, the order of these parts is different in both versions. Sometimes parts of dialog must be cut in translation because the characters speak too fast. And sometimes there are English cultural references which do not map to similar concepts in Polish, so I have to change things completely.

I decided that the best approach I can think of is to prepare the flashcards in three stages. The first stage (which is completely automatic) is merging the English source file and the Polish translation. This is done with the subed-anki-combine-subtitle-files command. In this stage, I create one srt file with subtitles in both languages – the Polish ones prepended by Q: and English ones prepended by A:. (For some reason, Emacs Subed mode doesn’t like it when the subtitle text has that form – adding a space after the colon helped.) Coding this was fairly easy, although there’s one catch: when translating, I often tinker with the timestamps. This means that I cannot assume that the timestamps are exactly the same in both files. That is not a big problem, though – my code iterates over all subtitles in the “question” file (the Polish one) and for each of them finds the nearest subtitle in the “answer” file (the English one). (There are a few ways to define the “nearest” subtitle – I provided a function for one of them and stubs for two other ones in case anyone wants a different metric.)

The second stage is manual – it is now that I can go through the merged subtitle file, delete the parts which are not suitable for flashcards, fix any issues with the timestamps etc. This can of course be done in Emacs Subed mode.

The third, final stage is automatic again – the merged and edited subtitle file is converted (using subed-anki-export) to a csv, which can be imported into Anki. Additionally, ffmpeg is called for every question/answer pair and a clip is put into a collection.media subdirectory of the current directory. After importing the csv into Anki and copying the clips to its directory the flashcards are ready for learning!

The code is not very elegant – in part because it is still sort of a proof-of-concept, in part because the whole thing has a distinct DIY feeling (I have to admit that the UX with all these stages is rather poor, but I didn’t have any idea how to do it better, at least not without a lot of work), and in part because it is an inherently complex process. If you want to try doing this yourself, I uploaded the code to Gitlab. I will certainly be preparing lots of flashcards very soon!

CategoryEnglish, CategoryBlog, CategoryEmacs