From: Jonathan Nieder <jrnieder@gmail.com>
To: Stephen Bash <bash@genarts.com>
Cc: Matt Stump <mstump@goatyak.com>,
git@vger.kernel.org, David Barr <david.barr@cordelta.com>,
Tomas Carnecky <tom@dbservice.com>,
Sverre Rabbelier <srabbelier@gmail.com>,
Ramkumar Ramachandra <artagnon@gmail.com>
Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
Date: Thu, 14 Oct 2010 11:34:41 -0500 [thread overview]
Message-ID: <20101014163441.GD16500@burratino> (raw)
In-Reply-To: <12137268.486377.1287073355267.JavaMail.root@mail.hq.genarts.com>
[just cc-ing David, Tom, Sverre, Ram, who might be interested.]
Stephen Bash wrote:
>> I hate making more work for people but I would love a copy of your
>> notes.
>
> Okay, here we go! I've uploaded the applicable scripts to
> https://gist.github.com/f6902cb4e3534f07ba48
>
> If you (or anyone) finds I describe something here that isn't on github, let
> me know and I'll add to it. I did a cursory pass through the scripts to
> remove a lot of the specific-to-our-repo stuff, so I'm not even sure these
> scripts will run as is... But most errors should be pretty minor (typos in
> variable names, etc) the overall procedure is unchanged. (And please be
> gentle, these are not anything approaching production-ready)
>
> As always, these scripts come with ABSOLUTELY NO WARRANTEE, use at your own
> risk, your mileage may vary, etc.
>
> Converting to Git using svn-fe
> ------------------------------
> Most people who have tried using git-svn to convert a medium to large
> Subversion repository have found it's a slow process. When I asked the Git
> mailing list about this problem in June 2010, I was pointed to David Barr's
> svn-dump-fast-export tool:
> http://github.com/barrbrain/svn-dump-fast-export
> svn-fe (as the executable is called) converts an entire svn repository to git
> very quickly (our repository took about 20 minutes), but the entire svn file
> system is one branch. I developed the following process to reproduce the svn
> history in git.
>
> Initial Thoughts
> ----------------
> 1) Our SVN repository was approximately 20k commits, about 7k files in HEAD,
> a little less than 400 tags, and about 100-150 branches. It was organized
> /trunk/project rather than /project/trunk. Branches were
> /branches/branchName where the branchName directory was a copy of the entire
> trunk (so /branches/branchName/project is what a user would checkout). This
> does affect the scripts, but I think it should be relatively easy to modify
> (no guarantee though).
>
> 2) Our SVN repository originated from cvs2svn, so there are some artifacts
> from that conversion that affect this conversion.
>
> 3) I make very little use of Git.pm because while I was developing I ran into
> a bunch of problems with it (none of which I remember now). Instead I make
> use of perl's system call to send commands to Git (where possible I avoid
> invoking the shell, see perldoc -f exec). I don't want to imply Git.pm
> doesn't work, but at the time it didn't work for me (and I was more focused
> on making my scripts work than improving Git.pm. Sorry!).
>
> 4) The vast majority of our history was before SVN introduced merge-info, so
> I made no attempt to capture SVN merges in Git. Rather I kept all branch
> heads, but moved most of them to a "hidden" namespace (see hideFromGit.pl for
> details). This does mean for a couple merges post-conversion I've had to add
> temporary grafts to make the merge work, but I haven't bothered making those
> grafts permanent (hopefully this isn't a problem?)
>
> 5) I performed this entire process using a local mirror of our SVN repository
> in about 4 hours. It is mostly automated, but does require some human
> monitoring (maybe I'm just paranoid). Since svn-fe runs off a SVN dump file,
> creating the local mirror was a trivial additional step.
>
> 6) To keep what follows a *little* shorter, I'm going to assume you can read
> Perl to extract the details of what's going on. I'll try to keep the prose
> to a high level...
>
> Extracting SVN's History
> ------------------------
> First we want to understand SVN's branching/tagging history. Modify
> buildSVNTree.pl as necessary, then run
> perl buildSVNTree.pl > svnBranches.txt
>
> buildSVNTree.pl does the following steps:
> 1. Traverses the SVN history chronologically looking for copies.
> 2. Records the source path/rev and destination path/rev for (most) copies
> (see script for details)
> 3. Once all copies are collected, further filters copies based on:
> * source path is a directory
> * source and destination are not in trunk
> * source and destination are not in the same branch or tag
> * source path is not /vendor (an artifact of cvs2svn)
> 4. Checks that source path is "shortest" path from it's rev (protect against
> subdirectories that get added in the same commit)
> 5. Checks the source and destination paths match globs for expected paths
> (non-matching copies that make it this far are printed to STDERR)
> 6. Creates a Git branch name for destination (note that svn tags are closer to git branches than git tags)
> 7. Search history for the last commit that actually changed the source path
> 8. Find a parent path from the source path (mostly recurse up the SVN tree to a known branch)
> 9. Use the parent path to determine the parent git branch name
> 10. Record parent/child relationships
> 11. Dump output to STDOUT (which you should redirect to a file for later use)
>
> I did run into one place where two SVN branches had the same name but
> different SVN paths (it's complicated). In this case I just manually edited
> the git branch name in svnBranches.txt. As long as you do that before
> continuing, everything should be okay.
>
> There's also some logic in buildSVNTree to determine if a branch/tag is
> deleted in the SVN head. That information is used by hideFromGit.
>
> Create the Single Branch Git Repo
> ---------------------------------
> Use svn-fe for what it's designed:
> 1. svnadmin dump /path/to/svn/repo > svn-dump.txt
> 2. git init /path/to/initial/git/repo
> 3. cd /path/to/initial/git/repo
> 4. cat /path/to/svn-dump.txt | svn-fe svnRepoName | git fast-import
>
> svnRepoName in step 4 can be anything you want, but it has to be specified so
> that svn-fe appends the git-svn style "git-svn-id: svnRepoName@svnRevNum
> svnRepoUUID" line to each commit message. This line is required later to map
> SVN revs to Git commits.
>
> Create Git Branches and Tags
> ----------------------------
> Now comes the next script, filterBranch.pl. filterBranch will create Git
> branches and tags out of the single branch repo by creating a ton of clones
> and filtering each one. While it's doing this, it also changes the SVN user
> names to proper Git user IDs (name + email). fetchSVNNames.pl can be used to
> get all the svn users, then you can edit $authorScript in filterBranch to
> modify names appropriately ($authorScript is a git-filter-branch
> --env-filter, so it gets eval'ed by git). Per the git-filter-branch manpage,
> you'll want to create/use a RAM disk for temporary files (see $tempdir). And
> you'll need to set various paths like $parentRepo (this is the repo created
> in step 2 above), etc.
>
> Then the script should be (?) relatively automated:
> perl filterBranch.pl svnBranches.txt
>
> The fancy logic here is probably figuring out which Git refs go to which Git
> commit, but I'll leave that as an exercise to the reader... Ah, I should
> probably mention: svn-fe can produce "empty" commits, and filterBranch does
> nothing to remove them. By "empty" I mean there will be a commit object
> without any content changes. So creating a branch/tag in SVN creates a
> commit, but doesn't change content. That commit will be part of the new Git
> history. Similarly, filterBranch will create git tags from svn tags, but
> they point to one of these "empty" commits rather than the branch they are
> tagged from. It's not very git-ish, but it seems to work...
>
> filterBranch is probably the longest step of the process; there's a lot of
> filtering going on. It will be very verbose on STDOUT, so I recommend
> tee'ing to a file or a terminal with infinite scroll back. It also involves
> a lot of disk hits (somewhat reduced if $tempdir is a RAM disk), and
> potentially a lot of space (it will create a git repo for every branch/tag in
> your subversion history). For our repository this step took about 1.5-2
> hours IIRC.
>
> Create SVN/Git Revmaps
> ----------------------
> Next step is to create a map that goes from SVN rev to Git commit object.
> genRevmap.pl and genJointRevmap.pl will be helpful here:
> 1. cd $cleanDir (from filterBranch)
> 2. find . -type d -name "*.git" -exec genRevmap.pl '{}' svnRepoName destDir ';'
> 3. cd destDir
> 4. find . -name "*.revmap" -exec grep . '{}' + | genJointRevMap.pl > jointRevmap.revmap
>
> genRevmap will respect the directory hierarchy created by filterBranch, and
> destDir must have a similar structure (doesn't require the individual Git
> repos, but any directory that contains a git repo must exist in destDir).
> genJointRevMap takes individual revmaps and creates a big revmap for all the
> repositories. These scripts aren't doing any real magic, just parsing the
> Git log messages for commit ID and the git-svn-id line to get the SVN rev the
> commit corresponds to. Note that SVN rev to Git commit can be one to many!
> (genRevmap just lists the same rev twice if it has more than one git commit
> associated with it, genJointRevMap flags those revs specially and lists all
> commit IDs on a single line).
>
> Assembling the Final Git Repo
> -----------------------------
> Now we need to combine all the small git repos into one repo that represents
> the SVN history. Similar to filterBranch, you'll need to edit paths in
> repoFusion.pl to make sure it finds everything. Then simply:
> perl repoFusion.pl svnBranches.txt jointRevmap.revmap
>
> At a high level, repoFusion:
> 1. Clones the trunk repository, this will become the new master branch
> 2. Performs a git-fetch on every other repository created by filterBranch to
> retrieve the git branch/tags contained there
> 3. Creates grafts to match up git branches with their parents using the revmap
> 4. If manual grafts are required, it will pause so the user can edit the
> grafts file (search for '*', the message there might be a little cryptic, but
> using svn log and git log in combination, hopefully you can figure out what
> the correct SHA is to insert)
> 5. Runs filter-branch one more time to make the grafts permanent.
>
> This is a bit faster than filterBranch, but still takes on the order of an
> hour for our repository. It also produces a lot of stuff on STDOUT, but I
> think it's a little easier on the disk. At the end of the filter branch, I
> found it useful to scan the output for refs that weren't updated... That
> usually indicates a graft didn't get created correctly (although due to SVN
> conventions, it's unlikely the master ref will change) At this point it's
> also possible to get some branch/tag name clashes (I did), so those may
> require clean up.
>
> Hiding 'Deleted' Branches
> -------------------------
> hideFromGit.pl will use the svnBranches.txt file to move any git refs
> associated with deleted SVN paths to refs/hidden in the new repository. This
> keeps the objects associated with those refs from getting garbage collected,
> but hides them from most user commands. This is entirely a personal
> preference. (Just like the other scripts, you'll probably have to edit the
> paths in the script itself)
>
> 'Validating' the Conversion
> ---------------------------
> gitValidation.pl is a script I wrote to randomly select revs from SVN and try
> to compare the SVN diffs to the Git diffs. It uses git-patch-id to compute a
> SHA of the changes in each repository, and reports if something doesn't match
> up. It's not particularly polished, and does find "errors" in our Git repo,
> but after investigating all the discrepancies I'm pretty happy that nothing
> vital is wrong.
>
> Closing Thoughts
> ----------------
> Do I have any? This is quite the brain dump, so I'm sure I've been
> incomplete and probably somewhat confusing... I'm happy to answer questions
> as I can, but again, this is entirely based on my experience with our local
> repo. YMMV!
next prev parent reply other threads:[~2010-10-14 16:38 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-10-13 15:44 Speeding up the initial git-svn fetch Matt Stump
2010-10-13 16:02 ` Stephen Bash
2010-10-13 17:47 ` Matt Stump
2010-10-13 18:18 ` Stephen Bash
2010-10-14 16:22 ` Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch) Stephen Bash
2010-10-14 16:34 ` Jonathan Nieder [this message]
2010-10-14 20:07 ` Sverre Rabbelier
2010-10-15 14:50 ` Stephen Bash
2010-10-15 23:39 ` Sverre Rabbelier
2010-10-16 0:16 ` Stephen Bash
2010-10-17 2:25 ` Sverre Rabbelier
2010-10-17 3:33 ` David Michael Barr
2010-10-18 5:17 ` Ramkumar Ramachandra
2010-10-18 7:31 ` Jonathan Nieder
2010-10-18 16:38 ` Ramkumar Ramachandra
2010-10-18 16:46 ` Sverre Rabbelier
2010-10-18 16:56 ` Jonathan Nieder
2010-10-18 17:16 ` Ramkumar Ramachandra
2010-10-18 17:18 ` Sverre Rabbelier
2010-10-18 17:28 ` Jonathan Nieder
2010-10-18 18:10 ` Sverre Rabbelier
2010-10-18 18:13 ` Jonathan Nieder
2010-10-18 18:20 ` Sverre Rabbelier
2010-10-18 18:25 ` Jonathan Nieder
2010-10-18 18:35 ` Sverre Rabbelier
2010-10-18 19:33 ` Jonathan Nieder
2010-10-19 3:08 ` Ramkumar Ramachandra
2010-10-19 0:40 ` Stephen Bash
2010-10-19 1:42 ` Stephen Bash
2010-10-19 6:42 ` Ramkumar Ramachandra
2010-10-19 13:33 ` Stephen Bash
2010-10-19 14:28 ` David Michael Barr
2010-10-19 14:57 ` Stephen Bash
2010-10-20 8:39 ` Will Palmer
2010-10-20 11:59 ` Jakub Narebski
2010-10-20 13:42 ` Will Palmer
2010-10-20 20:44 ` Jakub Narebski
2010-10-21 1:54 ` mrevilgnome
2010-10-21 8:16 ` Jakub Narebski
2010-10-21 13:49 ` Stephen Bash
2010-10-21 9:08 ` Will Palmer
2010-10-21 14:00 ` Stephen Bash
2010-10-21 18:37 ` Jakub Narebski
2010-10-21 21:27 ` Stephen Bash
2010-10-21 22:49 ` Jakub Narebski
2010-10-21 23:26 ` Stephen Bash
2010-10-22 10:38 ` Jakub Narebski
2010-10-21 15:52 ` Jakub Narebski
2010-10-21 16:16 ` Jonathan Nieder
2010-10-20 14:05 ` Ramkumar Ramachandra
2010-10-20 14:21 ` Stephen Bash
2010-10-20 16:56 ` Ramkumar Ramachandra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101014163441.GD16500@burratino \
--to=jrnieder@gmail.com \
--cc=artagnon@gmail.com \
--cc=bash@genarts.com \
--cc=david.barr@cordelta.com \
--cc=git@vger.kernel.org \
--cc=mstump@goatyak.com \
--cc=srabbelier@gmail.com \
--cc=tom@dbservice.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).