2019-08-19 Transplanting a directory to another Git repository

Last edit

Summary: +categories

Added:

> CategoryEnglish, CategoryBlog, CategoryGit


Recently, I had a very specific need. I wanted to move a directory to another Git repo, but I really needed to preserve its history.

There is a quite well-known instance of a similar thing – the famous coolest merge ever is basically importing one project into another, preserving its history. My use-case, however, was a bit more difficult because I wanted to simultaneously move things to a directory with another name. (I could, of course, start with making a temporary clone – or a branch – in the “source” repo, delete everything I do not want to merge in it, change the directory structure to reflect what I want in the “destination” repo and commit all these changes. I wanted to avoid that, though. One of the reasons is that file renaming, while supported by Git, introduces unnecessary complications when analysing history.)

So, let’s get started. Assume that we have two Git repos, source and dest. In the source repo we have (among other things) a subdirectory called directory, and we want to move all files from it to a directory called folder in dest.

To begin with, let us create the repos. (On my machine, /mem is a scratch directory, much like /tmp, with the advantage that it only contains stuff I put here instead of a whole lot of things some random programs decide to save in /tmp. Also, as the name suggests, it resides in a ramdisk, so nothing there sticks for too long.)

cd /mem
rm -rf source dest
git init source
git init dest
cd source
mkdir directory
echo "some file in the root dir" > some-file.txt
echo "another file in the directory dir" > directory/another-file.txt
git add .
git status
git commit -m "Initial commit"
echo "added line" >> directory/another-file.txt
git commit -am "Add a line"
echo "An unrelated commit" >> some-file.txt
git commit -am "An unrelated commit"
echo "A commit spanning everything" >> some-file.txt
echo "A commit spanning everything" >> directory/another-file.txt
git commit -am "Make huge changes"
cd ../dest
echo "The destination repo" > README.txt
git add README.txt
git commit -m "Add README.txt"

We now have two simple Git repos to experiment. (Note that because of rm -rf, the snippet above will recreate them from scratch every time it is run, which is quite convenient for experimentation.)

If we were happy to just merge everything from source to dest, things would be very easy:

git remote add source ../source
git fetch source
git merge source/master --no-edit --allow-unrelated-histories

Note the options for git merge. The man page says explicitly that usually you do not want --no-edit, but since I want a smooth presentation of the main ideas here and not manually crafted merge commit messages, this is exactly what I need. The option --allow-unrelated-histories (which was the default in older versions of Git) is pretty self-explanatory.

This approach works well if we want just to merge two repositories (solving conflicts should they arise, but this is another story), but it is not what we want here. The first problem is that it imports also the some-file.txt, and we only want directory and its contents. (For bonus points, notice how the second commit in source touches both a file in directory and a file outside of it – we would like to perform some surgery on this commit to preserve only the modification to another-file.txt.)

Well, this is Git, so all this is perfectly doable. There is even a dedicated Git command solving a very similar issue, called git-subtree. We will not resort to it, however (I will probably write another post on git-subtree one day), using lower-level git-filter-branch, git-read-tree and a few other commands instead.

In fact, a ready solution for the hard part can be easily found on the Internet. What I aim to do here is to (try to) explain the meaning of the commands involved. Note that, like in my previous post, I found this out by careful experiments and studying the manual, not by reading the Git sources, so there may be mistakes. Please point them out in the comments should you spot any.

So, let’s get down to business. First, we will excise everything but directory from the source repo (this is actually the easy part):

cd /mem
rm -rf source-tmp
git clone source source-tmp
cd source-tmp
git filter-branch --prune-empty --subdirectory-filter directory -- --all

Note how the contents of directory have just migrated to the root directory of our repo. Also, inspect the history and note how the “Make huge changes” commit now only touches the another-file.txt (which is logical, since it has nothing else to touch now, but still nice).

Interestingly, we have given the --all parameter to make Git operate on all the references (not “all the commits”!), local and remote. Without it, the history on branches other than “master” would be unaffected, thus leaving a terrible mess. Also, a good thing to know is that git-filter-branch will create a directory called .git/refs/original, where it stores all references it has changed. This means that the whole operation is easily undoable – just move .git/refs/original/refs to .git/refs, overwriting everything in the process, and you are done. (In particular, we did not really have to create source-tmp – but throwing it away is easier than manipulating stuff within .git.) You may read more about --all in e.g. the manpage of the plumbing command git-rev-list.

Another thing worth mentioning is the --prune-empty switch. Here, things get a bit hazy for me. The manual says that its aim is to remove empty commits (apart from merge commits), but a quick experiment showed that the commands seems to work the same way without it. (I asked about it on StackOverflow and learned that indeed, --prune-empty is superfluous in this case.)

Now we need to import (merge) our temporary repo into dest.

cd /mem/dest
git remote add -f source ../source-tmp
git merge -s ours --no-commit --allow-unrelated-histories source/master

Now this is where the fun starts. First we add our temporary repo as a remote (and immediately fetch from it, using the -f option) – this is simple. Then, we prepare the merge. First of all, we supply the ours merge strategy. (Note: this must not be confused with the ours option for the recursive merge strategy. See the manpage of git-merge for more information.) This means that the “merge” will actually completely disregard the tree in the merged-in heads. In other words, after an ours merge, while the history will look as if two (or more) branches has been merged, all the changes from the merged-in branches will be completely lost. (This may be actually useful in rare cases, I guess.)

The next thing is the --no-commit option. It seems obvious, but actually it is not so in this case. I mean, with a “normal” merge, this just leaves the last step (the actual commit) to the user (much like in the case of conflicts). However, you might wonder what this does in the case of the our stategy we have used. Turns out, the only thing our merge command does is update a few files in the .git directory: ORIG_HEAD (the reference to the head before the merge started – this reference is actually written by more operations in Git so that undoing is easier), MERGE_MSG (pretty obvious), MERGE_MODE (no-ff in our case, which is not surprising) – this seems a bit, erm, underdocumented, but I found some information here, and – most importantly – MERGE_HEAD, which contains the reference to the branch we are merging in (source/master). (In the case of an octopus merge, this file contains more references, of course.)

The --allow-unrelated-histories option we already mentioned, and there is not much to explain here.

So, if we commit now, the commit would be “empty” (i.e., it would not introduce any changes), but the history would show that we have merged the source/master branch (and we would have two roots now). What we need to do is to put the contents of source/master (i.e., the current state of the directory in the source repo) into the folder directory. This is the easier part and can be done with yet another plumbing Git command, read-tree.

cd /mem/dest
mkdir folder
git read-tree --prefix=folder/ -u source/master

Now we could just copy the files from source-tmp and stage them instead. (This is a bit hazy again. I performed an experiment to check if copying and staging from source-tmp would lead to the same result. It did, in the sense that in both cases the .git directory contained almost exactly the same stuff (in particular, the objects were bit by bit the same). The “almost” part was the index file (i.e., the “staging area”). While git ls-files --stage also showed the same output, there were binary differences in .git/index. If some brave soul wants to perform a similar experiment and dig even deeper into this, here is the official description of the index file format. Also, to make sure the objects in both cases are the same so that the comparison is fair, you have to make sure that the timestamps of all commits are the same in both cases. One way to ensure that is to set the environment variables GIT_AUTHOR_DATE and GIT_COMMITTER_DATE (see e.g. here for some explanation) or use the datefudge utility with the -s option, which I did.)

However, git-read-tree does this in one step instead of two: it puts everything from the source/master commit into the index (aka staging area), which is its basic aim, but the -u option makes it also update the working directory. (Again, the man page is not very precise here – it says what the -u switch does after a successful merge. In our case, we do not request a merge, but OTOH we will not have any conflicts, since we assume the folder is empty. I made a few experiments, and it seems that -u is only meaningful with -m, --prefix or --reset. That kind of makes sense, although is not said explicitly in the manual.)

Since we now have everything we want in the index, the only thing that’s left is to commit the changes and delete the temporary repo:

git commit -m "Merge source into dest under folder"
git remote rm source
rm -rf ../source-tmp

And we are done!

Of course, again, this is Git, so this is definitely not the whole story. We could – instead of deleting source-tmp – pull further changes (under the assumption that source is being worked upon, we could repeat the filter-branch stuff in the future and pull the resulting changes into desc. In case you are afraid that this will mess up the history: no, it won’t, git-filter-branch generated perfectly deterministic commit hashes every time you repeat it (which is not surprising, taking into account what exactly goes into a commit hash). Also, Git has the very useful in this case subtree merge strategy (which I admit I haven’t experimented with) which apparently does not even require you to specify the folder explicitly. Also, there is the git-subtree command I mentioned. In any case, the above was enough for me, so I decided to share it.

CategoryEnglish, CategoryBlog, CategoryGit