Saturday, 16 September 2017

Migrating (and Open-Sourcing) an Historical Codebase: SVN-to-Git

I have a SVN repo on my local machine that I have been shoving stuff into since before I knew how to use revision control systems properly (I still don't). Each of the directories in my repo represents a separate project. Some of my directories have a proper trunk/branches/tags structure, but most don't. I want to migrate each of my SVN repo's directories into it's own new respective git repo, retaining all historical commits.


[My dodgy SVN repo]

The best tool I have found to do this so-far is svn2git: https://github.com/nirvdrum/svn2git

This tool is a ruby gem that essentially wraps the native git feature for managing svn-to-git migration - git-svn. If you use svn2git's "--verbose" flag, you can see what commands it is issuing to the wrapped tool.

Migrating a dodgy SVN repo to multiple git repos


The reason I decided to blog about my experience was because I had a bit of trouble with the documentation and a bug with the svn2git tool. Mostly the tool is great, but for my particular purpose, I needed the "--rootistrunk" flag, which as it turns out, doesn't work. Instead (as the following Stack Overflow issue reports - https://github.com/nirvdrum/svn2git/issues/144), you can work around this by using the "--trunk", "--nobranches" and "--notags" flags.


[Excerpt from StackOverflow]

So, with reference to the images I have provided (above) of my haphazard SVN repo/directory structure, the following command (which I used Bash on Ubuntu on Windows to run) worked fine in the end (although it certainly took me a bit of messing about to get to this point):

$ svn2git file:///mnt/d/Work/SVN/SimpleList.V2 --verbose --trunk / --nobranches --notags --no-minimize-url


[Git log following successful application of svn2git]

Managing sensitive data


The following article provides some information on how to remove checkins of historical sensitive data (such as passwords) from your (new or old) git repo - https://help.github.com/articles/removing-sensitive-data-from-a-repository/

This is especially useful if you want to open-source an historical codebase. I used the "BFG Repo-Cleaner" tool, which is written in Scala and run using Java. Exceptionally useful and highly effective. After I had downloaded the tool, I used the following command on my new repo:

$ java -jar ../../tools/BFG/bfg.jar --replace-text sensitive.txt

The following image shows what your file containing sensitive data's historical commits will look like on GitHub once you have run across it with BFG. 

[After using BFG tool]

Git's native functionality for this purpose only allows you to get down to the file level (delete files), whereas BFG enables you to search/replace specific text in files across the entire history of your git repo, which is fantastic.

Mapping SVN authors to git authors


I think you can avoid having the Author come up ad "bernard@87bffff6-b7f0-bf49-a188-06524d5e88c0" (as shown in the example above) by specifying a mapping from your SVN authors to your git authors. You can extract that information using an approach described in the following link; a(nother) helpful blog post - https://john.albin.net/git/convert-subversion-to-git - by using this relatively gnarly command:

$ svn log -q | awk -F '|' '/^r/ {sub("^ ", "", $2); sub(" $", "", $2); print $2" = "$2" <"$2">"}' | sort -u > authors-transform.txt

 Once you have this file, using can use the svn2git "--authors" option to specify the file that you've specified mappings in (would be "authors-transform.txt" if you used the above command to get the information). Clearly I forgot to apply this step before kicking off the migration for my SimpleList.V2 codebase. Bugger. If like me you forget to do this, nevermind, you can use the following approach to modifying the author of historical commits (if you can be bothered) - https://help.github.com/articles/changing-author-info/.

Get the new git repo on to GitHub


Anyway, next step is to get my new git repo hosted on GitHub. Best advise I can offer here is to pick up section (4) of Troy Hunt's useful and succinct blog post on the same subject (his case is different from mine until this point in that he has a properly formed SVN repo): https://www.troyhunt.com/migrating-from-subversion-to-git-with/

Essentially once you have your git repo, complete with history brought across from SVN, you do the following to get your code onto GitHub:

  1. Create a new GitHub repo to push your code into.
  2. Add your remote repo as your new (migrated) repo's remote (GitHub provides good documentation for managing this process - https://help.github.com/articles/adding-a-remote/) - like this:
  3. Push to GitHub.
Note that if you have elected to add a ReadMe and/or a .gitgnore file on GitHub, that will need to be pulled and merged with your local repo before you can push it all  to GitHub. 

You're done!


I're retrospectively changed this blog post to make it into what are hopefully independently applicable sections. I've built this procedure up over a little while; it's essentially a piecemeal template for open-sourcing historical codebases on to GitHub (or any public git repo) using a few well-established tools. Have fun!


No comments:

Post a Comment