git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ramkumar Ramachandra <artagnon@gmail.com>
To: Stephen Bash <bash@genarts.com>
Cc: Matt Stump <mstump@goatyak.com>,
	git@vger.kernel.org, Jonathan Nieder <jrnieder@gmail.com>,
	David Michael Barr <david.barr@cordelta.com>,
	Sverre Rabbelier <srabbelier@gmail.com>,
	Tomas Carnecky <tom@dbservice.com>
Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
Date: Tue, 19 Oct 2010 12:12:15 +0530	[thread overview]
Message-ID: <20101019064210.GA14309@kytes> (raw)
In-Reply-To: <8043579.526738.1287452576766.JavaMail.root@mail.hq.genarts.com>

Hi Stephen,

Stephen Bash writes:
> > From: "Ramkumar Ramachandra" <artagnon@gmail.com>
> > Stephen Bash writes:
> > > Extracting SVN's History
> > > ------------------------
> > > First we want to understand SVN's branching/tagging history. Modify
> > > buildSVNTree.pl as necessary, then run
> > >    perl buildSVNTree.pl > svnBranches.txt
> > 
> > > ...
> >
> > Unnecessary
> 
> I'm going to collapse all these comments because I think we're
> coming at this from different angles.  I agree, discovering the
> copies in git is "easy" (albeit an n^2 operation), and git will
> correctly identify file content.  But when I was asked to preserve
> the SVN history, I decided to extract a DAG from SVN and migrate
> that DAG to Git.  Thus the history itself is preserved (sans
> merges), not just the contents of the files.  This is the purpose of
> buildSVNTree.  I can elaborate further if requested.

Yep, they're certainly two different ways to approach the problem: I'd
be interested in investigating why it will produce different
results. Since we both agree that it's easier (and faster) to do it in
Git-land, I'm looking into the the areas where it falls short.

Yes, I understand your script (although I can't actually read Perl
:p), but the differences are still not very clear to me.

> > > Ah, I should probably mention: svn-fe can produce "empty"
> > > commits, and filterBranch does nothing to remove them. By "empty" I
> > > mean there will be a commit object without any content changes. So
> > > creating a branch/tag in SVN creates a commit, but doesn't change
> > > content. That commit will be part of the new Git history.
> > > Similarly, filterBranch will create git tags from svn tags, but they
> > > point to one of these "empty" commits rather than the branch they
> > > are tagged from. It's not very git-ish, but it seems to work...
> > 
> > Oh, I didn't realize that fast-import allows the creation of empty
> > commits. We should probably fix this?
> 
> To be precise: svn-fe creates commits where
>   git diff-tree treeA treeB
> is empty with treeA being the tree object of /trunk/project and
> treeB being the tree of /branches/foo/project.  This version of my
> tools does not squash these commits, a future version probably will
> (this may cause problems with two-way communication?).

Right, that IS expected behavior. Don't they correspond to separate
SVN revisions anyway? Why would you want to squash them?

[Ignore this; see later in the email]

> > > filterBranch is probably the longest step of the process; there's a
> > > lot of filtering going on. It will be very verbose on STDOUT, so I
> > > recommend tee'ing to a file or a terminal with infinite scroll back.
> > > It also involves a lot of disk hits (somewhat reduced if $tempdir is
> > > a RAM disk), and potentially a lot of space (it will create a git
> > > repo for every branch/tag in your subversion history). For our
> > > repository this step took about 1.5-2 hours IIRC.
> > 
> > Wow, this really brute-force.
> 
> Yes it is.  If I get around to writing a new version, I'll at least
> advance to a single pass using commit-tree.  Beyond that I'm
> probably into the fast-import code, which I'll happily leave to the
> rest of you :)

*nod*

> > > Note that SVN rev to Git commit can be one to many!
> >
> > Unless there's a one-to-one mapping between Git revisions and SVN
> > revisions, a two-way bridge will become very difficult to build. Can
> > you think of any scenarios where a one-to-one mapping doesn't make
> > sense?
> 
> I have 32 SVN revs in my history that touch multiple Git commit
> objects.  The simplest example is
>   svn mv svn://svnrepo/branches/badBranchName svn://svnrepo/branches/goodBranchName
> which creates a single SVN commit that touches two branches
> (badBranchName will have all it's contents deleted, goodBranchName
> will have an "empty commit" as described above).  The more devious
> version is the SVN rev where a developer checked out / (yes, I'm not
> kidding) and proceeded to modify a single file on all branches in
> one commit.  In our case, that one SVN rev touches 23 git commit
> objects.  And while the latter is somewhat a corner case, the former
> is common and probably needs to be dealt with appropriately (it's
> kind of a stupid operation in Git-land, so maybe it can just be
> squashed).

Ouch! Thanks for the illustrative example- I understand now. We have
to bend backwards to perform a one-to-one mapping. It's finally struck
me- one-to-one mapping is nearly impossible to achieve, and I don't
know if it makes sense to strive for it anymore. Looks like Jonathan
got it earlier.

> > Grafts and filter-branch. db-svn-filter-root does this more elegantly.
> 
> I found a 'db-svn-filter-root' branch, but it was not entirely
> obvious to me what code I should be looking at...

Um, there's just one commit that deviates from the branch it's based
on (but you don't know that, and I should have been clearer): look at
contrib/svn-fe/svn-filter-root.py

It's just a minimalistic mapper, but it's fast and done nicely. You
can use ideas from it when you're building yours.

> > > Hiding 'Deleted' Branches
> > > -------------------------
> > 
> > Hm. You didn't include the history of deleted branches in the main
> > repository. Why? 
> 
> The commit objects are still there, I simply moved the refs to
> refs/hidden/{heads,tags}.  Because my goal was to maintain the full
> SVN history I needed to somehow protect the objects from garbage
> collection.  At the time I didn't know about "git merge -s ours", so
> this strategy achieved my goal of protecting the objects.  In this
> case, the refs are not cloned, but are fetch-able, so I found it to
> be a reasonable solution.

Oh.

> > Does it make sense to provide the user an option to
> > exclude some (deleted) branches in the SVN history? It'll make the
> > two-way mapping extremely difficult.
> 
> I think there are cases where a user could say "I don't care about
> dead development branches".  In my current system, all branches,
> even those that do not contribute back to the trunk are saved in the
> hidden namespace.  But I could see users that don't care about some
> or all extraneous branches and would be happy to not convert them or
> to let them be garbage collected.

When I made this comment, I was thinking of the one-to-one mapping. It
makes much more sense now.

> > Thanks for the interesting and insightful read :)
> 
> I'm glad it's stimulating conversation.  I'm beginning to wonder if
> there might be competing design goals for one-way vs. two-way
> compatibility...  Performance is one place where opinions probably
> greatly differ (I didn't mind taking an extra 30 minutes to mirror
> my SVN repo because it probably saved more than that in
> communication overhead later in the process, but that mirror
> operation is very taxing on your timeline); my exhaustive search of
> all SVN copies is another (I wanted to be *extremely* certain I knew
> about all the misplaced branches/tags, but it's inefficient for a
> casual developer who just wants to interact with an SVN server).
> It's all just food for thought, and I'm happy to carry on the
> conversation from my different point-of-view :)

Ok, I still don't get this part- why mirror at all? Can't all the
information be mined out of the in-memory tree that svn-fe builds
while parsing the dumpfile? From the SVN-side, all that's required is
a streaming dumpfile like the one that `svnrdump dump` produces.

-- Ram

  reply	other threads:[~2010-10-19  6:43 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-13 15:44 Speeding up the initial git-svn fetch Matt Stump
2010-10-13 16:02 ` Stephen Bash
2010-10-13 17:47   ` Matt Stump
2010-10-13 18:18     ` Stephen Bash
2010-10-14 16:22     ` Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch) Stephen Bash
2010-10-14 16:34       ` Jonathan Nieder
2010-10-14 20:07         ` Sverre Rabbelier
2010-10-15 14:50           ` Stephen Bash
2010-10-15 23:39             ` Sverre Rabbelier
2010-10-16  0:16               ` Stephen Bash
2010-10-17  2:25                 ` Sverre Rabbelier
2010-10-17  3:33                   ` David Michael Barr
2010-10-18  5:17       ` Ramkumar Ramachandra
2010-10-18  7:31         ` Jonathan Nieder
2010-10-18 16:38           ` Ramkumar Ramachandra
2010-10-18 16:46             ` Sverre Rabbelier
2010-10-18 16:56               ` Jonathan Nieder
2010-10-18 17:16                 ` Ramkumar Ramachandra
2010-10-18 17:18                 ` Sverre Rabbelier
2010-10-18 17:28                   ` Jonathan Nieder
2010-10-18 18:10                     ` Sverre Rabbelier
2010-10-18 18:13                       ` Jonathan Nieder
2010-10-18 18:20                         ` Sverre Rabbelier
2010-10-18 18:25                           ` Jonathan Nieder
2010-10-18 18:35                             ` Sverre Rabbelier
2010-10-18 19:33                               ` Jonathan Nieder
2010-10-19  3:08                             ` Ramkumar Ramachandra
2010-10-19  0:40                           ` Stephen Bash
2010-10-19  1:42         ` Stephen Bash
2010-10-19  6:42           ` Ramkumar Ramachandra [this message]
2010-10-19 13:33             ` Stephen Bash
2010-10-19 14:28               ` David Michael Barr
2010-10-19 14:57                 ` Stephen Bash
2010-10-20  8:39             ` Will Palmer
2010-10-20 11:59               ` Jakub Narebski
2010-10-20 13:42                 ` Will Palmer
2010-10-20 20:44                   ` Jakub Narebski
2010-10-21  1:54                     ` mrevilgnome
2010-10-21  8:16                       ` Jakub Narebski
2010-10-21 13:49                         ` Stephen Bash
2010-10-21  9:08                     ` Will Palmer
2010-10-21 14:00                       ` Stephen Bash
2010-10-21 18:37                         ` Jakub Narebski
2010-10-21 21:27                           ` Stephen Bash
2010-10-21 22:49                             ` Jakub Narebski
2010-10-21 23:26                               ` Stephen Bash
2010-10-22 10:38                                 ` Jakub Narebski
2010-10-21 15:52                       ` Jakub Narebski
2010-10-21 16:16                         ` Jonathan Nieder
2010-10-20 14:05               ` Ramkumar Ramachandra
2010-10-20 14:21               ` Stephen Bash
2010-10-20 16:56                 ` Ramkumar Ramachandra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101019064210.GA14309@kytes \
    --to=artagnon@gmail.com \
    --cc=bash@genarts.com \
    --cc=david.barr@cordelta.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=mstump@goatyak.com \
    --cc=srabbelier@gmail.com \
    --cc=tom@dbservice.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).