Content AND Presentation

2023-09-30 Confirming potentially dangerous actions

A common situation is when I want to do something on my computer with possibly destructive consequences, like deleting a file, truncating a database table etc. One common approach to making sure this won’t happen accidentally is to require the user to press y or even type yes (this is what Emacs does with its y-or-n-p and yes-or-no-p functions). Some time ago I had a similar need in a shell script. I was afraid, however, that even requiring the user to type yes is not enough – it is easy to condition oneself to type a three-letter word without much thinking, after all.

So, I came up with an idea of telling the user to type a different word every time. More precisely, selecting the word to type at random from some large list, which theoretically may result into the same one twice, but it’s very improbable. (Interestingly, another person had almost the same idea very recently, too.)

And here is the code that accomplishes exactly that. You can put it in the beginning of a script doing some dangerous thing, or wrap it in a Bash function and call it only in the appropriate place, etc.

WORD=$(grep -E '^[a-z]{4,5}$' /usr/share/dict/words | shuf -n1)
read -rp "Type the word '$WORD' if you really want to turn the friction contrafibulator on: " TYPED
if [ "$WORD" != "$TYPED" ]; then
        echo You typed \'"$WORD"\' wrong, aborting.
        exit 1
fi

I decided to only go with 4 or 5 letter words, so that I don’t have to type anything too short nor too long. Of course, this creates a small risk of getting, well, certain 4-letter words, but let’s not overthink that. On my machine, grep -E '^[a-z]{4,5}$' /usr/share/dict/words | wc -l reports well over 7000 results, which is more than enough for my purpose.

I installed this code in one of the scripts I’ve written which must not be run accidentally (but which I need to run from time to time), and so I’ll see how it works – so far, it seems to do its job.

CategoryEnglish, CategoryBlog

Comments on this page

2023-09-18 Making Anki flashcards from subtitles

Those of you who follow my blog know that one of my hobbies is translating subtitles. The main reason I do this is to watch stuff with my daughter, who doesn’t yet speak English fluently. Some time ago it dawned on me that I can use my translations twice. Not only can I watch films and tv series with her, but I can use them to help her learn English.

Here is the idea. Since I have both the original English subtitles and my Polish translation – and the srt files contain the timestamps – I could use ffmpeg to cut clips with individual sentences, extract Polish and English versions of the dialog and create Anki flashcards (almost) automatically!

One thing which makes this a bit more difficult than it seems is that not every piece of dialog is suitable for importing into Anki. In many cases the translation is not (and cannot be) very faithful. For example, sometimes one subtitle contains only half (or less) of the sentence, and due to how English and Polish work, the order of these parts is different in both versions. Sometimes parts of dialog must be cut in translation because the characters speak too fast. And sometimes there are English cultural references which do not map to similar concepts in Polish, so I have to change things completely.

I decided that the best approach I can think of is to prepare the flashcards in three stages. The first stage (which is completely automatic) is merging the English source file and the Polish translation. This is done with the subed-anki-combine-subtitle-files command. In this stage, I create one srt file with subtitles in both languages – the Polish ones prepended by Q: and English ones prepended by A:. (For some reason, Emacs Subed mode doesn’t like it when the subtitle text has that form – adding a space after the colon helped.) Coding this was fairly easy, although there’s one catch: when translating, I often tinker with the timestamps. This means that I cannot assume that the timestamps are exactly the same in both files. That is not a big problem, though – my code iterates over all subtitles in the “question” file (the Polish one) and for each of them finds the nearest subtitle in the “answer” file (the English one). (There are a few ways to define the “nearest” subtitle – I provided a function for one of them and stubs for two other ones in case anyone wants a different metric.)

The second stage is manual – it is now that I can go through the merged subtitle file, delete the parts which are not suitable for flashcards, fix any issues with the timestamps etc. This can of course be done in Emacs Subed mode.

The third, final stage is automatic again – the merged and edited subtitle file is converted (using subed-anki-export) to a csv, which can be imported into Anki. Additionally, ffmpeg is called for every question/answer pair and a clip is put into a collection.media subdirectory of the current directory. After importing the csv into Anki and copying the clips to its directory the flashcards are ready for learning!

The code is not very elegant – in part because it is still sort of a proof-of-concept, in part because the whole thing has a distinct DIY feeling (I have to admit that the UX with all these stages is rather poor, but I didn’t have any idea how to do it better, at least not without a lot of work), and in part because it is an inherently complex process. If you want to try doing this yourself, I uploaded the code to Gitlab. I will certainly be preparing lots of flashcards very soon!

CategoryEnglish, CategoryBlog, CategoryEmacs

Comments on this page

2023-09-02 Irregular recurring TODOs in Org mode, part I

Warning: this is the first part of a series which is not even finished yet. And even though it’s not the whole story, it is still a long, a bit meandering post, with quite a lot of code.

Some time ago I mentioned a very peculiar type of TODOs I’d like to implement. These are things I’d like to do from time to time, but not necessarily on a regular basis. A canonical example is an inspirational blog post I’d like to reread once in a while. I admit that this idea is inspired by spaced repetition, where things I want to remember are presented to me repeatedly, but with increasing intervals. Here, however, the situation is a bit different. First of all, I don’t really need to remember these things actively – I just want to be reminded of them from time to time in the future. The second difference is that I’m not sure if increasing intervals would be the best choice here. In classical spaced repetition algorithms the intervals grow exponentially, so after, say, 10 repetitions of something remembered well, the intervals can become so long you are basically guaranteed to see that item at most once or twice again in your life. In this case, I still want to read that blog post a few times – maybe once per two/three years even, but not once per a decade or even less often! The third issue is that with classical SR I can have days without any repetitions as well as days with many of them. Here, I’d prefer to be shown the same anount of “recurring TODOs” per day (preferably one, but I assume that if I like the system, I may have more of them).

So, after some consideration, here is the heuristic I came up with. (Spoiler alert: it’s not the one that will get implemented evetually.) First of all, the “items” are going to be Org mode headlines (well, that is pretty obvious;-)) – they may contain links, but they may also be just pieces of text stored on my disk. I’d prefer to be flexible and not require all of them to be stored on the same level of Org hierarchy – for example, I might want a tree structure, with broad categories (like “blog posts”, “quotations” or “ideas for things to do”) as level-one headlines, narrower categories (like blog posts about “faith”, “computing” or “languages”) as level-two headlines etc.

When I’m introducing a new item to my system, it should be set to be shown again after, say, a random number of days between a week and a month (more or less). When I’m shown an item again, I should be able to decide what to do next – either schedule it for later with the interval doubled compared to the previous one, or schedule it for later for the interval shortened to 7-30 days (like in the beginning), or use an interval similar in length to the previous one.

This looks clever, and in fact much more similar to classic SR than I initially expected, but I am a bit worried if it’s sustainable. Let’s do some estimation. Assume that I’m going to add one item every five days to the system. This means that after a year I’ll have about 73 items to review regularly. Assuming the first interval to be about 20 days, the subsequent one roughly 40 days, and all the next ones 80 days, it gives me an average load of one item per day after a year of using the system. Let’s also assume that I’m going to use that system for the next 5 years (which I think is quite conservative – a more generous but still realistic estimate could well be 20 years). This means that I’ll have 5 items to review every day. A bit too much.

A tinkerer in me wants now to devise an extremely complex system where there is a cap for the number of items to be reviewed per day, the latest date (kept separately for every item) when I want it to be reviewed, a way for more important items to shift the review dates of the less important ones to later… This is fun to think about, probably a bit less fun to implement and almost surely not fun at all to use.;-) So, let’s keep things as simple as possible. Here is another idea. (Spoiler alert: it’s also not the final one.) Since I do not want to get 0 items on some days and 8+ items on other days (which can happen with classical SR), I could turn the whole thing on its head and just ask the system for “the most important/urgent thing to review now”. On a busy day, I might do it once; on a less busy one, I could do it 2-3 times. This means that for every item I should store some data like “when it was reviewed for the last time” and “how many times it was reviewed”, and devise some function which would then calculate the “urgency” of that item. (Note: I know about the famous Eisenhower matrix, but in this case, I consider all the items “equally important”, so I can just sort them by urgency and that’s fine.)

Here is one idea that came to my mind. Let’s have an item last reviewed D days ago, and reviewed N times altogether (the act of entering the item into the system is considered the first review, so N>0). The higher D is, and the lower N is, the more urgent the item is. Let’s compute U(N,D) := N²/D (excluding the items reviewed today) and consider the items with lower U be more urgent. The square is there to make items reviewed more times less and less urgent, so that the average intervals between reviews will grow.

This formula implies a few things. First of all, items reviewed fewer times will have a preference over those reviewed many times. As (N+1)²/N² tends to 1 as N→∞, this preference will (in a sense) become less pronounced over time. For example, consider two items, one newly entered (N=1) and one entered and then reviewed once more (N=2). To achieve the same urgency, the interval between the last review and today will have to be four times larger for the latter item. On the other hand, if we have two items, one reviewed four times and one reviewed five times, the ratio of the intervals to have the same urgency will need to be only equal to (5/4)²≈1.6.

The next thing is, the longer time elapsed from the last review, the more “urgent” the item is. This is pretty obvious, though the urgency will increase over time rather slowly. Another formula I considered was U(N,D) := N²/D(1+ε)for ε somewhere between 0.001 and 0.01, where longer intervals “contribute more” to the urgency. The obvious downside of that formula is its complexity, which makes it more difficult to analyze.

Probably the most important thing about my approach is that I’d like to have a guarantee that no item will be postponed indefinitely – in other words, I’d like to be sure that every item will have the lowest urgency of them all after a finite amount of time. Surprisingly, this seems a non-trivial property to prove! The intuition goes like this: every item’s urgency tends towards zero, and every review will make that item’s urgency jump above 1, so if we have some item I with such that n items have urgency lower than I, its urgency will be the lowest one after (n-1) days. This “proof” has one flaw – it silently assumes that the change in urgencies happening over time does not change the order of items. This, however, is simply not true! For example, consider an item A reviewed for the second time 8 days ago and an item B reviewed for the third time 18 days ago. Assuming neither A nor B is reviewed from yesterday to tomorrow, it means that while A was less urgent than B yesterday, and it will be more urgent tomorrow!

Unless there exists some simple trick which escapes me now, I cannot prove that every item will be reviewed after a finite number of days, even with the simplifying assumption that no items are added to the system.

Well. If I can’t prove it, so be it. Let’s don my programmer’s hat and make a simulation. I whipped up some code to simulate doing reviews (a given number of them every day) while introducing a new item with a given probability (so 0.2 would mean one new item every 5 days on average). For reference, here is the code – it is definitely not the most beautiful thing in the world, but it is just a prototype to perform some experiments and then be got rid of.

;; Recurring TODOs - simulation

(require 'cl-lib)

(defvar recurring-todos ()
  "A list of \"TODO items\" as plists -- the properties are :id (an
integer) and :reviews (dates of review, integers, decreasing).")

(defvar recurring-next 0
  "The next value of :id.")

(defvar recurring-date 0
  "The \"date\" (number of days elapsed).")

(defvar recurring-buffer-name "*Recurring TODOs simulation data*"
  "Data about recurring TODOs simulation as csv.  Every row
corresponds to one review (including the first one, i.e.,
addition of the item to the system).")

(get-buffer-create recurring-buffer-name)
(with-current-buffer recurring-buffer-name
  (insert "date,id,review,interval\n"))

(defun recurring-add-review-datapoint (id date review interval)
  "Add a datapoint about a review to buffer `recurring-buffer-name'."
  (with-current-buffer recurring-buffer-name
    (goto-char (point-max))
    (insert (format "%s,%s,%s,%s\n"
                    date id review interval))))

(defun recurring-add-todo ()
  "Add a new recurring todo to `recurring-todos'."
  (let ((new-item (list :id recurring-next
                        :reviews (list recurring-date))))
    (push new-item recurring-todos)
    (recurring-add-review-datapoint recurring-next
                                    recurring-date
                                    1
                                    "")
    (cl-incf recurring-next)))

(defun recurring-next-day ()
  "Increment `recurring-date'."
  (cl-incf recurring-date))

(defun recurring-last-review (todo)
  "The date of the last review of TODO."
  (car (plist-get todo :reviews)))

(defun recurring-number-of-reviews (todo)
  "The number of reviews of TODO so far."
  (length (plist-get todo :reviews)))

(defun recurring-urgency (date todo)
  "Compute the urgency of TODO."
  (let ((n (recurring-number-of-reviews todo))
        (d (- date
              (recurring-last-review todo))))
    (/ (* n n) d 1.0)))

(defun recurring-review (todo)
  "Review TODO.  Destructive."
  (when todo
    (recurring-add-review-datapoint (plist-get todo :id)
                                    recurring-date
                                    (1+ (length (plist-get todo :reviews)))
                                    (- recurring-date (car (plist-get todo :reviews))))
    (push recurring-date (plist-get todo :reviews))))

(defun recurring-find-most-urgent (date todo-list)
  "Return the most urgent todo."
  (let* ((result nil)
         (urgency most-positive-fixnum))
    (mapc (lambda (todo)
            (let ((new-urgency (recurring-urgency date todo)))
              (when (< new-urgency urgency)
                (setq urgency new-urgency
                      result todo))))
          (cl-remove-if (lambda (todo)
                          (= (recurring-last-review todo)
                             date))
                        todo-list))
    result))

(defun recurring-reset ()
  "Reset the recurring reviews simulation."
  (setq recurring-todos ()
        recurring-next 0
        recurring-date 0))

(defun recurring-simulate (iterations new-frequency review-frequency)
  "Simulate ITERATIONS days of reviewing TODOs.
NEW-FREQUENCY is the probability of adding a new TODO every day.
REVIEW-FREQUENCY is the number of reviews done every day.  Do not
reset the variables, so that a simulation can be resumed."
  (dotimes-with-progress-reporter
      (_ iterations)
      "Simulating reviews..."
    (when (< (cl-random 1.0) new-frequency)
      (recurring-add-todo))
    (recurring-review (recurring-find-most-urgent recurring-date recurring-todos))
    (recurring-next-day)))

So, my first experiment was to run one year of simulation, with one review per day and the probability of introducing a new item equal to 0.2. After running my code I imported the resulting CSV to SQLite and ran a few queries to analyze it.

The number of items reached 84. It turned out that 5 items were reviewed 9 times and the average interval between interviews for that 5 items was about 142 days. The maximum interval between reviews turned out to be 262 days. The one item which achieved such a long interval between reviews was reviewed on days 20, 21, 22, 25, 30, 40, 55, 89, and 351.

The next experiment assumed the same probability of adding a new item, but now I ran the simulation for 10 years. Again, the item with the longest interval between reviews had the intervals between them rise (more or less) exponentially. Each of reviews 2-9 happened after less than two weeks after the previous one, and then every interval was at least 10 times longer than the previous one!

So, back to square one. It turns out that my initial approach was pretty elegant mathematically, but practically useless. How surprising.

(to be continued…)

CategoryEnglish, CategoryBlog, CategoryEmacs, CategoryOrgMode

Comments on this page

More...

CategoryEnglish, CategoryBlog