git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jonathan Nieder <jrnieder@gmail.com>
To: Stephen Bash <bash@genarts.com>
Cc: Matt Stump <mstump@goatyak.com>,
	git@vger.kernel.org, David Barr <david.barr@cordelta.com>,
	Tomas Carnecky <tom@dbservice.com>,
	Sverre Rabbelier <srabbelier@gmail.com>,
	Ramkumar Ramachandra <artagnon@gmail.com>
Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
Date: Thu, 14 Oct 2010 11:34:41 -0500	[thread overview]
Message-ID: <20101014163441.GD16500@burratino> (raw)
In-Reply-To: <12137268.486377.1287073355267.JavaMail.root@mail.hq.genarts.com>

[just cc-ing David, Tom, Sverre, Ram, who might be interested.]

Stephen Bash wrote:

>> I hate making more work for people but I would love a copy of your
>> notes. 
>
> Okay, here we go!  I've uploaded the applicable scripts to 
>    https://gist.github.com/f6902cb4e3534f07ba48
> 
> If you (or anyone) finds I describe something here that isn't on github, let
> me know and I'll add to it.  I did a cursory pass through the scripts to
> remove a lot of the specific-to-our-repo stuff, so I'm not even sure these
> scripts will run as is...  But most errors should be pretty minor (typos in
> variable names, etc) the overall procedure is unchanged.  (And please be
> gentle, these are not anything approaching production-ready)
> 
> As always, these scripts come with ABSOLUTELY NO WARRANTEE, use at your own
> risk, your mileage may vary, etc.
> 
> Converting to Git using svn-fe
> ------------------------------
> Most people who have tried using git-svn to convert a medium to large
> Subversion repository have found it's a slow process.  When I asked the Git
> mailing list about this problem in June 2010, I was pointed to David Barr's
> svn-dump-fast-export tool:
>    http://github.com/barrbrain/svn-dump-fast-export
> svn-fe (as the executable is called) converts an entire svn repository to git
> very quickly (our repository took about 20 minutes), but the entire svn file
> system is one branch.  I developed the following process to reproduce the svn
> history in git.
>
> Initial Thoughts
> ----------------
> 1) Our SVN repository was approximately 20k commits, about 7k files in HEAD,
> a little less than 400 tags, and about 100-150 branches.  It was organized
> /trunk/project rather than /project/trunk.  Branches were
> /branches/branchName where the branchName directory was a copy of the entire
> trunk (so /branches/branchName/project is what a user would checkout).  This
> does affect the scripts, but I think it should be relatively easy to modify
> (no guarantee though).
> 
> 2) Our SVN repository originated from cvs2svn, so there are some artifacts
> from that conversion that affect this conversion.
> 
> 3) I make very little use of Git.pm because while I was developing I ran into
> a bunch of problems with it (none of which I remember now).  Instead I make
> use of perl's system call to send commands to Git (where possible I avoid
> invoking the shell, see perldoc -f exec).  I don't want to imply Git.pm
> doesn't work, but at the time it didn't work for me (and I was more focused
> on making my scripts work than improving Git.pm. Sorry!).
> 
> 4) The vast majority of our history was before SVN introduced merge-info, so
> I made no attempt to capture SVN merges in Git.  Rather I kept all branch
> heads, but moved most of them to a "hidden" namespace (see hideFromGit.pl for
> details).  This does mean for a couple merges post-conversion I've had to add
> temporary grafts to make the merge work, but I haven't bothered making those
> grafts permanent (hopefully this isn't a problem?)
> 
> 5) I performed this entire process using a local mirror of our SVN repository
> in about 4 hours.  It is mostly automated, but does require some human
> monitoring (maybe I'm just paranoid).  Since svn-fe runs off a SVN dump file,
> creating the local mirror was a trivial additional step.
> 
> 6) To keep what follows a *little* shorter, I'm going to assume you can read
> Perl to extract the details of what's going on.  I'll try to keep the prose
> to a high level...
> 
> Extracting SVN's History
> ------------------------
> First we want to understand SVN's branching/tagging history.  Modify
> buildSVNTree.pl as necessary, then run
>    perl buildSVNTree.pl > svnBranches.txt
> 
> buildSVNTree.pl does the following steps:
> 1. Traverses the SVN history chronologically looking for copies.
> 2. Records the source path/rev and destination path/rev for (most) copies
>    (see script for details)
> 3. Once all copies are collected, further filters copies based on:
>    * source path is a directory
>    * source and destination are not in trunk
>    * source and destination are not in the same branch or tag
>    * source path is not /vendor (an artifact of cvs2svn)
> 4. Checks that source path is "shortest" path from it's rev (protect against
> subdirectories that get added in the same commit)
> 5. Checks the source and destination paths match globs for expected paths
> (non-matching copies that make it this far are printed to STDERR)
> 6. Creates a Git branch name for destination (note that svn tags are closer to git branches than git tags)
> 7. Search history for the last commit that actually changed the source path
> 8. Find a parent path from the source path (mostly recurse up the SVN tree to a known branch)
> 9. Use the parent path to determine the parent git branch name
> 10. Record parent/child relationships
> 11. Dump output to STDOUT (which you should redirect to a file for later use)
> 
> I did run into one place where two SVN branches had the same name but
> different SVN paths (it's complicated).  In this case I just manually edited
> the git branch name in svnBranches.txt.  As long as you do that before
> continuing, everything should be okay.
> 
> There's also some logic in buildSVNTree to determine if a branch/tag is
> deleted in the SVN head.  That information is used by hideFromGit.
> 
> Create the Single Branch Git Repo
> ---------------------------------
> Use svn-fe for what it's designed:
> 1. svnadmin dump /path/to/svn/repo > svn-dump.txt
> 2. git init /path/to/initial/git/repo
> 3. cd /path/to/initial/git/repo
> 4. cat /path/to/svn-dump.txt | svn-fe svnRepoName | git fast-import
> 
> svnRepoName in step 4 can be anything you want, but it has to be specified so
> that svn-fe appends the git-svn style "git-svn-id: svnRepoName@svnRevNum
> svnRepoUUID" line to each commit message.  This line is required later to map
> SVN revs to Git commits.
> 
> Create Git Branches and Tags
> ----------------------------
> Now comes the next script, filterBranch.pl.  filterBranch will create Git
> branches and tags out of the single branch repo by creating a ton of clones
> and filtering each one.  While it's doing this, it also changes the SVN user
> names to proper Git user IDs (name + email).  fetchSVNNames.pl can be used to
> get all the svn users, then you can edit $authorScript in filterBranch to
> modify names appropriately ($authorScript is a git-filter-branch
> --env-filter, so it gets eval'ed by git).  Per the git-filter-branch manpage,
> you'll want to create/use a RAM disk for temporary files (see $tempdir).  And
> you'll need to set various paths like $parentRepo (this is the repo created
> in step 2 above), etc.
> 
> Then the script should be (?) relatively automated:
>    perl filterBranch.pl svnBranches.txt
> 
> The fancy logic here is probably figuring out which Git refs go to which Git
> commit, but I'll leave that as an exercise to the reader...  Ah, I should
> probably mention: svn-fe can produce "empty" commits, and filterBranch does
> nothing to remove them.  By "empty" I mean there will be a commit object
> without any content changes.  So creating a branch/tag in SVN creates a
> commit, but doesn't change content.  That commit will be part of the new Git
> history.  Similarly, filterBranch will create git tags from svn tags, but
> they point to one of these "empty" commits rather than the branch they are
> tagged from.  It's not very git-ish, but it seems to work...
> 
> filterBranch is probably the longest step of the process; there's a lot of
> filtering going on.  It will be very verbose on STDOUT, so I recommend
> tee'ing to a file or a terminal with infinite scroll back.  It also involves
> a lot of disk hits (somewhat reduced if $tempdir is a RAM disk), and
> potentially a lot of space (it will create a git repo for every branch/tag in
> your subversion history).  For our repository this step took about 1.5-2
> hours IIRC.
> 
> Create SVN/Git Revmaps
> ----------------------
> Next step is to create a map that goes from SVN rev to Git commit object.
> genRevmap.pl and genJointRevmap.pl will be helpful here:
> 1. cd $cleanDir (from filterBranch)
> 2. find . -type d -name "*.git" -exec genRevmap.pl '{}' svnRepoName destDir ';'
> 3. cd destDir
> 4. find . -name "*.revmap" -exec grep . '{}' + | genJointRevMap.pl > jointRevmap.revmap
> 
> genRevmap will respect the directory hierarchy created by filterBranch, and
> destDir must have a similar structure (doesn't require the individual Git
> repos, but any directory that contains a git repo must exist in destDir).
> genJointRevMap takes individual revmaps and creates a big revmap for all the
> repositories.  These scripts aren't doing any real magic, just parsing the
> Git log messages for commit ID and the git-svn-id line to get the SVN rev the
> commit corresponds to.  Note that SVN rev to Git commit can be one to many!
> (genRevmap just lists the same rev twice if it has more than one git commit
> associated with it, genJointRevMap flags those revs specially and lists all
> commit IDs on a single line).
> 
> Assembling the Final Git Repo
> -----------------------------
> Now we need to combine all the small git repos into one repo that represents
> the SVN history.  Similar to filterBranch, you'll need to edit paths in
> repoFusion.pl to make sure it finds everything.  Then simply:
>    perl repoFusion.pl svnBranches.txt jointRevmap.revmap
> 
> At a high level, repoFusion:
> 1. Clones the trunk repository, this will become the new master branch
> 2. Performs a git-fetch on every other repository created by filterBranch to
> retrieve the git branch/tags contained there
> 3. Creates grafts to match up git branches with their parents using the revmap
> 4. If manual grafts are required, it will pause so the user can edit the
> grafts file (search for '*', the message there might be a little cryptic, but
> using svn log and git log in combination, hopefully you can figure out what
> the correct SHA is to insert)
> 5. Runs filter-branch one more time to make the grafts permanent.
> 
> This is a bit faster than filterBranch, but still takes on the order of an
> hour for our repository.  It also produces a lot of stuff on STDOUT, but I
> think it's a little easier on the disk.  At the end of the filter branch, I
> found it useful to scan the output for refs that weren't updated...  That
> usually indicates a graft didn't get created correctly (although due to SVN
> conventions, it's unlikely the master ref will change)  At this point it's
> also possible to get some branch/tag name clashes (I did), so those may
> require clean up.
> 
> Hiding 'Deleted' Branches
> -------------------------
> hideFromGit.pl will use the svnBranches.txt file to move any git refs
> associated with deleted SVN paths to refs/hidden in the new repository.  This
> keeps the objects associated with those refs from getting garbage collected,
> but hides them from most user commands.  This is entirely a personal
> preference.  (Just like the other scripts, you'll probably have to edit the
> paths in the script itself)
> 
> 'Validating' the Conversion
> ---------------------------
> gitValidation.pl is a script I wrote to randomly select revs from SVN and try
> to compare the SVN diffs to the Git diffs.  It uses git-patch-id to compute a
> SHA of the changes in each repository, and reports if something doesn't match
> up.  It's not particularly polished, and does find "errors" in our Git repo,
> but after investigating all the discrepancies I'm pretty happy that nothing
> vital is wrong.
> 
> Closing Thoughts
> ----------------
> Do I have any?  This is quite the brain dump, so I'm sure I've been
> incomplete and probably somewhat confusing...  I'm happy to answer questions
> as I can, but again, this is entirely based on my experience with our local
> repo.  YMMV!

  reply	other threads:[~2010-10-14 16:38 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-13 15:44 Speeding up the initial git-svn fetch Matt Stump
2010-10-13 16:02 ` Stephen Bash
2010-10-13 17:47   ` Matt Stump
2010-10-13 18:18     ` Stephen Bash
2010-10-14 16:22     ` Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch) Stephen Bash
2010-10-14 16:34       ` Jonathan Nieder [this message]
2010-10-14 20:07         ` Sverre Rabbelier
2010-10-15 14:50           ` Stephen Bash
2010-10-15 23:39             ` Sverre Rabbelier
2010-10-16  0:16               ` Stephen Bash
2010-10-17  2:25                 ` Sverre Rabbelier
2010-10-17  3:33                   ` David Michael Barr
2010-10-18  5:17       ` Ramkumar Ramachandra
2010-10-18  7:31         ` Jonathan Nieder
2010-10-18 16:38           ` Ramkumar Ramachandra
2010-10-18 16:46             ` Sverre Rabbelier
2010-10-18 16:56               ` Jonathan Nieder
2010-10-18 17:16                 ` Ramkumar Ramachandra
2010-10-18 17:18                 ` Sverre Rabbelier
2010-10-18 17:28                   ` Jonathan Nieder
2010-10-18 18:10                     ` Sverre Rabbelier
2010-10-18 18:13                       ` Jonathan Nieder
2010-10-18 18:20                         ` Sverre Rabbelier
2010-10-18 18:25                           ` Jonathan Nieder
2010-10-18 18:35                             ` Sverre Rabbelier
2010-10-18 19:33                               ` Jonathan Nieder
2010-10-19  3:08                             ` Ramkumar Ramachandra
2010-10-19  0:40                           ` Stephen Bash
2010-10-19  1:42         ` Stephen Bash
2010-10-19  6:42           ` Ramkumar Ramachandra
2010-10-19 13:33             ` Stephen Bash
2010-10-19 14:28               ` David Michael Barr
2010-10-19 14:57                 ` Stephen Bash
2010-10-20  8:39             ` Will Palmer
2010-10-20 11:59               ` Jakub Narebski
2010-10-20 13:42                 ` Will Palmer
2010-10-20 20:44                   ` Jakub Narebski
2010-10-21  1:54                     ` mrevilgnome
2010-10-21  8:16                       ` Jakub Narebski
2010-10-21 13:49                         ` Stephen Bash
2010-10-21  9:08                     ` Will Palmer
2010-10-21 14:00                       ` Stephen Bash
2010-10-21 18:37                         ` Jakub Narebski
2010-10-21 21:27                           ` Stephen Bash
2010-10-21 22:49                             ` Jakub Narebski
2010-10-21 23:26                               ` Stephen Bash
2010-10-22 10:38                                 ` Jakub Narebski
2010-10-21 15:52                       ` Jakub Narebski
2010-10-21 16:16                         ` Jonathan Nieder
2010-10-20 14:05               ` Ramkumar Ramachandra
2010-10-20 14:21               ` Stephen Bash
2010-10-20 16:56                 ` Ramkumar Ramachandra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101014163441.GD16500@burratino \
    --to=jrnieder@gmail.com \
    --cc=artagnon@gmail.com \
    --cc=bash@genarts.com \
    --cc=david.barr@cordelta.com \
    --cc=git@vger.kernel.org \
    --cc=mstump@goatyak.com \
    --cc=srabbelier@gmail.com \
    --cc=tom@dbservice.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).