* Speeding up the initial git-svn fetch
@ 2010-10-13 15:44 Matt Stump
2010-10-13 16:02 ` Stephen Bash
0 siblings, 1 reply; 52+ messages in thread
From: Matt Stump @ 2010-10-13 15:44 UTC (permalink / raw)
To: git
I have a big repository, 100,000+ revisions with a very high branching
factor. The initial fetch of the full SVN repository using git-svn has
been running for around 2 months and it's only up to revision 60,000.
Is there any way to speed this thing up?
I'm already regularly killing and restarting the fetch due to git-svn
leaking memory like a sieve. The transfer is occurring over the local
LAN, so link speed shouldn't be an issue. The repository is on a
dedicated machine backed by dedicated fiber channel arrays so the
server should have plenty of oomph. The only other thing that I can
think of is do the clone from a local copy of the SVN repository.
What have other people done in similar circumstances?
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Speeding up the initial git-svn fetch
2010-10-13 15:44 Speeding up the initial git-svn fetch Matt Stump
@ 2010-10-13 16:02 ` Stephen Bash
2010-10-13 17:47 ` Matt Stump
0 siblings, 1 reply; 52+ messages in thread
From: Stephen Bash @ 2010-10-13 16:02 UTC (permalink / raw)
To: Matt Stump; +Cc: git
> What have other people done in similar circumstances?
Based on suggestions from this list, I sidestepped git-svn and used svn-fe [1] and git-fast-import. It imports the entire Subversion tree in a single git branch, but using git's tools that's workable. At an extremely high level I used git-filter-branch to split up into branches and git grafts to stitch the various branches together to represent the SVN history.
The real devil is in extrating the SVN history, but there are a few gotchas in the filtering/recombining. I haven't written up a complete summary for the list because I thought the GSoC project would supersede my process rather quickly... If there's interest I can transpose my internal documentation for public use.
As a benchmark, our SVN repository was about 20k commits, ~400 tags, ~100 branches, HEAD contained ~7k files. git-svn took several weeks (and never finished), svn-fe and git-fast-import took ~20 minutes (my entire process takes about 4 hours).
[1] http://github.com/barrbrain/svn-dump-fast-export
HTH,
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Speeding up the initial git-svn fetch
2010-10-13 16:02 ` Stephen Bash
@ 2010-10-13 17:47 ` Matt Stump
2010-10-13 18:18 ` Stephen Bash
2010-10-14 16:22 ` Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch) Stephen Bash
0 siblings, 2 replies; 52+ messages in thread
From: Matt Stump @ 2010-10-13 17:47 UTC (permalink / raw)
To: git
I hate making more work for people but I would love a copy of your
notes. Getting a full clone of our SVN repository is probably the
biggest hurdle to having a git insurgency take root. Also, which GSoC
project were you referring to?
On Wed, Oct 13, 2010 at 9:02 AM, Stephen Bash <bash@genarts.com> wrote:
>> What have other people done in similar circumstances?
>
> Based on suggestions from this list, I sidestepped git-svn and used svn-fe [1] and git-fast-import. It imports the entire Subversion tree in a single git branch, but using git's tools that's workable. At an extremely high level I used git-filter-branch to split up into branches and git grafts to stitch the various branches together to represent the SVN history.
>
> The real devil is in extrating the SVN history, but there are a few gotchas in the filtering/recombining. I haven't written up a complete summary for the list because I thought the GSoC project would supersede my process rather quickly... If there's interest I can transpose my internal documentation for public use.
>
> As a benchmark, our SVN repository was about 20k commits, ~400 tags, ~100 branches, HEAD contained ~7k files. git-svn took several weeks (and never finished), svn-fe and git-fast-import took ~20 minutes (my entire process takes about 4 hours).
>
> [1] http://github.com/barrbrain/svn-dump-fast-export
>
> HTH,
> Stephen
>
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Speeding up the initial git-svn fetch
2010-10-13 17:47 ` Matt Stump
@ 2010-10-13 18:18 ` Stephen Bash
2010-10-14 16:22 ` Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch) Stephen Bash
1 sibling, 0 replies; 52+ messages in thread
From: Stephen Bash @ 2010-10-13 18:18 UTC (permalink / raw)
To: Matt Stump; +Cc: git
----- Original Message -----
> From: "Matt Stump" <mstump@goatyak.com>
> To: git@vger.kernel.org
> Sent: Wednesday, October 13, 2010 1:47:46 PM
> Subject: Re: Speeding up the initial git-svn fetch
>
> I hate making more work for people but I would love a copy of your
> notes. Getting a full clone of our SVN repository is probably the
> biggest hurdle to having a git insurgency take root. Also, which GSoC
> project were you referring to?
This one:
https://git.wiki.kernel.org/index.php/SoC2010Projects#Native_SVN_support_in_git
My notes are currently in our internal wiki, so those should be pretty easy to transpose to e-mail. I'll need to do a little sanitization on the scripts though, and I'll look into posting those somewhere as links (all perl, just under 1400 lines all tolled).
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-13 17:47 ` Matt Stump
2010-10-13 18:18 ` Stephen Bash
@ 2010-10-14 16:22 ` Stephen Bash
2010-10-14 16:34 ` Jonathan Nieder
2010-10-18 5:17 ` Ramkumar Ramachandra
1 sibling, 2 replies; 52+ messages in thread
From: Stephen Bash @ 2010-10-14 16:22 UTC (permalink / raw)
To: Matt Stump; +Cc: git
> I hate making more work for people but I would love a copy of your
> notes.
Okay, here we go! I've uploaded the applicable scripts to
https://gist.github.com/f6902cb4e3534f07ba48
If you (or anyone) finds I describe something here that isn't on github, let me know and I'll add to it. I did a cursory pass through the scripts to remove a lot of the specific-to-our-repo stuff, so I'm not even sure these scripts will run as is... But most errors should be pretty minor (typos in variable names, etc) the overall procedure is unchanged. (And please be gentle, these are not anything approaching production-ready)
As always, these scripts come with ABSOLUTELY NO WARRANTEE, use at your own risk, your mileage may vary, etc.
Converting to Git using svn-fe
------------------------------
Most people who have tried using git-svn to convert a medium to large Subversion repository have found it's a slow process. When I asked the Git mailing list about this problem in June 2010, I was pointed to David Barr's svn-dump-fast-export tool:
http://github.com/barrbrain/svn-dump-fast-export
svn-fe (as the executable is called) converts an entire svn repository to git very quickly (our repository took about 20 minutes), but the entire svn file system is one branch. I developed the following process to reproduce the svn history in git.
Initial Thoughts
----------------
1) Our SVN repository was approximately 20k commits, about 7k files in HEAD, a little less than 400 tags, and about 100-150 branches. It was organized /trunk/project rather than /project/trunk. Branches were /branches/branchName where the branchName directory was a copy of the entire trunk (so /branches/branchName/project is what a user would checkout). This does affect the scripts, but I think it should be relatively easy to modify (no guarantee though).
2) Our SVN repository originated from cvs2svn, so there are some artifacts from that conversion that affect this conversion.
3) I make very little use of Git.pm because while I was developing I ran into a bunch of problems with it (none of which I remember now). Instead I make use of perl's system call to send commands to Git (where possible I avoid invoking the shell, see perldoc -f exec). I don't want to imply Git.pm doesn't work, but at the time it didn't work for me (and I was more focused on making my scripts work than improving Git.pm. Sorry!).
4) The vast majority of our history was before SVN introduced merge-info, so I made no attempt to capture SVN merges in Git. Rather I kept all branch heads, but moved most of them to a "hidden" namespace (see hideFromGit.pl for details). This does mean for a couple merges post-conversion I've had to add temporary grafts to make the merge work, but I haven't bothered making those grafts permanent (hopefully this isn't a problem?)
5) I performed this entire process using a local mirror of our SVN repository in about 4 hours. It is mostly automated, but does require some human monitoring (maybe I'm just paranoid). Since svn-fe runs off a SVN dump file, creating the local mirror was a trivial additional step.
6) To keep what follows a *little* shorter, I'm going to assume you can read Perl to extract the details of what's going on. I'll try to keep the prose to a high level...
Extracting SVN's History
------------------------
First we want to understand SVN's branching/tagging history. Modify buildSVNTree.pl as necessary, then run
perl buildSVNTree.pl > svnBranches.txt
buildSVNTree.pl does the following steps:
1. Traverses the SVN history chronologically looking for copies.
2. Records the source path/rev and destination path/rev for (most) copies (see script for details)
3. Once all copies are collected, further filters copies based on:
* source path is a directory
* source and destination are not in trunk
* source and destination are not in the same branch or tag
* source path is not /vendor (an artifact of cvs2svn)
4. Checks that source path is "shortest" path from it's rev (protect against subdirectories that get added in the same commit)
5. Checks the source and destination paths match globs for expected paths (non-matching copies that make it this far are printed to STDERR)
6. Creates a Git branch name for destination (note that svn tags are closer to git branches than git tags)
7. Search history for the last commit that actually changed the source path
8. Find a parent path from the source path (mostly recurse up the SVN tree to a known branch)
9. Use the parent path to determine the parent git branch name
10. Record parent/child relationships
11. Dump output to STDOUT (which you should redirect to a file for later use)
I did run into one place where two SVN branches had the same name but different SVN paths (it's complicated). In this case I just manually edited the git branch name in svnBranches.txt. As long as you do that before continuing, everything should be okay.
There's also some logic in buildSVNTree to determine if a branch/tag is deleted in the SVN head. That information is used by hideFromGit.
Create the Single Branch Git Repo
---------------------------------
Use svn-fe for what it's designed:
1. svnadmin dump /path/to/svn/repo > svn-dump.txt
2. git init /path/to/initial/git/repo
3. cd /path/to/initial/git/repo
4. cat /path/to/svn-dump.txt | svn-fe svnRepoName | git fast-import
svnRepoName in step 4 can be anything you want, but it has to be specified so that svn-fe appends the git-svn style "git-svn-id: svnRepoName@svnRevNum svnRepoUUID" line to each commit message. This line is required later to map SVN revs to Git commits.
Create Git Branches and Tags
----------------------------
Now comes the next script, filterBranch.pl. filterBranch will create Git branches and tags out of the single branch repo by creating a ton of clones and filtering each one. While it's doing this, it also changes the SVN user names to proper Git user IDs (name + email). fetchSVNNames.pl can be used to get all the svn users, then you can edit $authorScript in filterBranch to modify names appropriately ($authorScript is a git-filter-branch --env-filter, so it gets eval'ed by git). Per the git-filter-branch manpage, you'll want to create/use a RAM disk for temporary files (see $tempdir). And you'll need to set various paths like $parentRepo (this is the repo created in step 2 above), etc.
Then the script should be (?) relatively automated:
perl filterBranch.pl svnBranches.txt
The fancy logic here is probably figuring out which Git refs go to which Git commit, but I'll leave that as an exercise to the reader... Ah, I should probably mention: svn-fe can produce "empty" commits, and filterBranch does nothing to remove them. By "empty" I mean there will be a commit object without any content changes. So creating a branch/tag in SVN creates a commit, but doesn't change content. That commit will be part of the new Git history. Similarly, filterBranch will create git tags from svn tags, but they point to one of these "empty" commits rather than the branch they are tagged from. It's not very git-ish, but it seems to work...
filterBranch is probably the longest step of the process; there's a lot of filtering going on. It will be very verbose on STDOUT, so I recommend tee'ing to a file or a terminal with infinite scroll back. It also involves a lot of disk hits (somewhat reduced if $tempdir is a RAM disk), and potentially a lot of space (it will create a git repo for every branch/tag in your subversion history). For our repository this step took about 1.5-2 hours IIRC.
Create SVN/Git Revmaps
----------------------
Next step is to create a map that goes from SVN rev to Git commit object. genRevmap.pl and genJointRevmap.pl will be helpful here:
1. cd $cleanDir (from filterBranch)
2. find . -type d -name "*.git" -exec genRevmap.pl '{}' svnRepoName destDir ';'
3. cd destDir
4. find . -name "*.revmap" -exec grep . '{}' + | genJointRevMap.pl > jointRevmap.revmap
genRevmap will respect the directory hierarchy created by filterBranch, and destDir must have a similar structure (doesn't require the individual Git repos, but any directory that contains a git repo must exist in destDir). genJointRevMap takes individual revmaps and creates a big revmap for all the repositories. These scripts aren't doing any real magic, just parsing the Git log messages for commit ID and the git-svn-id line to get the SVN rev the commit corresponds to. Note that SVN rev to Git commit can be one to many! (genRevmap just lists the same rev twice if it has more than one git commit associated with it, genJointRevMap flags those revs specially and lists all commit IDs on a single line).
Assembling the Final Git Repo
-----------------------------
Now we need to combine all the small git repos into one repo that represents the SVN history. Similar to filterBranch, you'll need to edit paths in repoFusion.pl to make sure it finds everything. Then simply:
perl repoFusion.pl svnBranches.txt jointRevmap.revmap
At a high level, repoFusion:
1. Clones the trunk repository, this will become the new master branch
2. Performs a git-fetch on every other repository created by filterBranch to retrieve the git branch/tags contained there
3. Creates grafts to match up git branches with their parents using the revmap
4. If manual grafts are required, it will pause so the user can edit the grafts file (search for '*', the message there might be a little cryptic, but using svn log and git log in combination, hopefully you can figure out what the correct SHA is to insert)
5. Runs filter-branch one more time to make the grafts permanent.
This is a bit faster than filterBranch, but still takes on the order of an hour for our repository. It also produces a lot of stuff on STDOUT, but I think it's a little easier on the disk. At the end of the filter branch, I found it useful to scan the output for refs that weren't updated... That usually indicates a graft didn't get created correctly (although due to SVN conventions, it's unlikely the master ref will change) At this point it's also possible to get some branch/tag name clashes (I did), so those may require clean up.
Hiding 'Deleted' Branches
-------------------------
hideFromGit.pl will use the svnBranches.txt file to move any git refs associated with deleted SVN paths to refs/hidden in the new repository. This keeps the objects associated with those refs from getting garbage collected, but hides them from most user commands. This is entirely a personal preference. (Just like the other scripts, you'll probably have to edit the paths in the script itself)
'Validating' the Conversion
---------------------------
gitValidation.pl is a script I wrote to randomly select revs from SVN and try to compare the SVN diffs to the Git diffs. It uses git-patch-id to compute a SHA of the changes in each repository, and reports if something doesn't match up. It's not particularly polished, and does find "errors" in our Git repo, but after investigating all the discrepancies I'm pretty happy that nothing vital is wrong.
Closing Thoughts
----------------
Do I have any? This is quite the brain dump, so I'm sure I've been incomplete and probably somewhat confusing... I'm happy to answer questions as I can, but again, this is entirely based on my experience with our local repo. YMMV!
Thanks,
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-14 16:22 ` Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch) Stephen Bash
@ 2010-10-14 16:34 ` Jonathan Nieder
2010-10-14 20:07 ` Sverre Rabbelier
2010-10-18 5:17 ` Ramkumar Ramachandra
1 sibling, 1 reply; 52+ messages in thread
From: Jonathan Nieder @ 2010-10-14 16:34 UTC (permalink / raw)
To: Stephen Bash
Cc: Matt Stump, git, David Barr, Tomas Carnecky, Sverre Rabbelier,
Ramkumar Ramachandra
[just cc-ing David, Tom, Sverre, Ram, who might be interested.]
Stephen Bash wrote:
>> I hate making more work for people but I would love a copy of your
>> notes.
>
> Okay, here we go! I've uploaded the applicable scripts to
> https://gist.github.com/f6902cb4e3534f07ba48
>
> If you (or anyone) finds I describe something here that isn't on github, let
> me know and I'll add to it. I did a cursory pass through the scripts to
> remove a lot of the specific-to-our-repo stuff, so I'm not even sure these
> scripts will run as is... But most errors should be pretty minor (typos in
> variable names, etc) the overall procedure is unchanged. (And please be
> gentle, these are not anything approaching production-ready)
>
> As always, these scripts come with ABSOLUTELY NO WARRANTEE, use at your own
> risk, your mileage may vary, etc.
>
> Converting to Git using svn-fe
> ------------------------------
> Most people who have tried using git-svn to convert a medium to large
> Subversion repository have found it's a slow process. When I asked the Git
> mailing list about this problem in June 2010, I was pointed to David Barr's
> svn-dump-fast-export tool:
> http://github.com/barrbrain/svn-dump-fast-export
> svn-fe (as the executable is called) converts an entire svn repository to git
> very quickly (our repository took about 20 minutes), but the entire svn file
> system is one branch. I developed the following process to reproduce the svn
> history in git.
>
> Initial Thoughts
> ----------------
> 1) Our SVN repository was approximately 20k commits, about 7k files in HEAD,
> a little less than 400 tags, and about 100-150 branches. It was organized
> /trunk/project rather than /project/trunk. Branches were
> /branches/branchName where the branchName directory was a copy of the entire
> trunk (so /branches/branchName/project is what a user would checkout). This
> does affect the scripts, but I think it should be relatively easy to modify
> (no guarantee though).
>
> 2) Our SVN repository originated from cvs2svn, so there are some artifacts
> from that conversion that affect this conversion.
>
> 3) I make very little use of Git.pm because while I was developing I ran into
> a bunch of problems with it (none of which I remember now). Instead I make
> use of perl's system call to send commands to Git (where possible I avoid
> invoking the shell, see perldoc -f exec). I don't want to imply Git.pm
> doesn't work, but at the time it didn't work for me (and I was more focused
> on making my scripts work than improving Git.pm. Sorry!).
>
> 4) The vast majority of our history was before SVN introduced merge-info, so
> I made no attempt to capture SVN merges in Git. Rather I kept all branch
> heads, but moved most of them to a "hidden" namespace (see hideFromGit.pl for
> details). This does mean for a couple merges post-conversion I've had to add
> temporary grafts to make the merge work, but I haven't bothered making those
> grafts permanent (hopefully this isn't a problem?)
>
> 5) I performed this entire process using a local mirror of our SVN repository
> in about 4 hours. It is mostly automated, but does require some human
> monitoring (maybe I'm just paranoid). Since svn-fe runs off a SVN dump file,
> creating the local mirror was a trivial additional step.
>
> 6) To keep what follows a *little* shorter, I'm going to assume you can read
> Perl to extract the details of what's going on. I'll try to keep the prose
> to a high level...
>
> Extracting SVN's History
> ------------------------
> First we want to understand SVN's branching/tagging history. Modify
> buildSVNTree.pl as necessary, then run
> perl buildSVNTree.pl > svnBranches.txt
>
> buildSVNTree.pl does the following steps:
> 1. Traverses the SVN history chronologically looking for copies.
> 2. Records the source path/rev and destination path/rev for (most) copies
> (see script for details)
> 3. Once all copies are collected, further filters copies based on:
> * source path is a directory
> * source and destination are not in trunk
> * source and destination are not in the same branch or tag
> * source path is not /vendor (an artifact of cvs2svn)
> 4. Checks that source path is "shortest" path from it's rev (protect against
> subdirectories that get added in the same commit)
> 5. Checks the source and destination paths match globs for expected paths
> (non-matching copies that make it this far are printed to STDERR)
> 6. Creates a Git branch name for destination (note that svn tags are closer to git branches than git tags)
> 7. Search history for the last commit that actually changed the source path
> 8. Find a parent path from the source path (mostly recurse up the SVN tree to a known branch)
> 9. Use the parent path to determine the parent git branch name
> 10. Record parent/child relationships
> 11. Dump output to STDOUT (which you should redirect to a file for later use)
>
> I did run into one place where two SVN branches had the same name but
> different SVN paths (it's complicated). In this case I just manually edited
> the git branch name in svnBranches.txt. As long as you do that before
> continuing, everything should be okay.
>
> There's also some logic in buildSVNTree to determine if a branch/tag is
> deleted in the SVN head. That information is used by hideFromGit.
>
> Create the Single Branch Git Repo
> ---------------------------------
> Use svn-fe for what it's designed:
> 1. svnadmin dump /path/to/svn/repo > svn-dump.txt
> 2. git init /path/to/initial/git/repo
> 3. cd /path/to/initial/git/repo
> 4. cat /path/to/svn-dump.txt | svn-fe svnRepoName | git fast-import
>
> svnRepoName in step 4 can be anything you want, but it has to be specified so
> that svn-fe appends the git-svn style "git-svn-id: svnRepoName@svnRevNum
> svnRepoUUID" line to each commit message. This line is required later to map
> SVN revs to Git commits.
>
> Create Git Branches and Tags
> ----------------------------
> Now comes the next script, filterBranch.pl. filterBranch will create Git
> branches and tags out of the single branch repo by creating a ton of clones
> and filtering each one. While it's doing this, it also changes the SVN user
> names to proper Git user IDs (name + email). fetchSVNNames.pl can be used to
> get all the svn users, then you can edit $authorScript in filterBranch to
> modify names appropriately ($authorScript is a git-filter-branch
> --env-filter, so it gets eval'ed by git). Per the git-filter-branch manpage,
> you'll want to create/use a RAM disk for temporary files (see $tempdir). And
> you'll need to set various paths like $parentRepo (this is the repo created
> in step 2 above), etc.
>
> Then the script should be (?) relatively automated:
> perl filterBranch.pl svnBranches.txt
>
> The fancy logic here is probably figuring out which Git refs go to which Git
> commit, but I'll leave that as an exercise to the reader... Ah, I should
> probably mention: svn-fe can produce "empty" commits, and filterBranch does
> nothing to remove them. By "empty" I mean there will be a commit object
> without any content changes. So creating a branch/tag in SVN creates a
> commit, but doesn't change content. That commit will be part of the new Git
> history. Similarly, filterBranch will create git tags from svn tags, but
> they point to one of these "empty" commits rather than the branch they are
> tagged from. It's not very git-ish, but it seems to work...
>
> filterBranch is probably the longest step of the process; there's a lot of
> filtering going on. It will be very verbose on STDOUT, so I recommend
> tee'ing to a file or a terminal with infinite scroll back. It also involves
> a lot of disk hits (somewhat reduced if $tempdir is a RAM disk), and
> potentially a lot of space (it will create a git repo for every branch/tag in
> your subversion history). For our repository this step took about 1.5-2
> hours IIRC.
>
> Create SVN/Git Revmaps
> ----------------------
> Next step is to create a map that goes from SVN rev to Git commit object.
> genRevmap.pl and genJointRevmap.pl will be helpful here:
> 1. cd $cleanDir (from filterBranch)
> 2. find . -type d -name "*.git" -exec genRevmap.pl '{}' svnRepoName destDir ';'
> 3. cd destDir
> 4. find . -name "*.revmap" -exec grep . '{}' + | genJointRevMap.pl > jointRevmap.revmap
>
> genRevmap will respect the directory hierarchy created by filterBranch, and
> destDir must have a similar structure (doesn't require the individual Git
> repos, but any directory that contains a git repo must exist in destDir).
> genJointRevMap takes individual revmaps and creates a big revmap for all the
> repositories. These scripts aren't doing any real magic, just parsing the
> Git log messages for commit ID and the git-svn-id line to get the SVN rev the
> commit corresponds to. Note that SVN rev to Git commit can be one to many!
> (genRevmap just lists the same rev twice if it has more than one git commit
> associated with it, genJointRevMap flags those revs specially and lists all
> commit IDs on a single line).
>
> Assembling the Final Git Repo
> -----------------------------
> Now we need to combine all the small git repos into one repo that represents
> the SVN history. Similar to filterBranch, you'll need to edit paths in
> repoFusion.pl to make sure it finds everything. Then simply:
> perl repoFusion.pl svnBranches.txt jointRevmap.revmap
>
> At a high level, repoFusion:
> 1. Clones the trunk repository, this will become the new master branch
> 2. Performs a git-fetch on every other repository created by filterBranch to
> retrieve the git branch/tags contained there
> 3. Creates grafts to match up git branches with their parents using the revmap
> 4. If manual grafts are required, it will pause so the user can edit the
> grafts file (search for '*', the message there might be a little cryptic, but
> using svn log and git log in combination, hopefully you can figure out what
> the correct SHA is to insert)
> 5. Runs filter-branch one more time to make the grafts permanent.
>
> This is a bit faster than filterBranch, but still takes on the order of an
> hour for our repository. It also produces a lot of stuff on STDOUT, but I
> think it's a little easier on the disk. At the end of the filter branch, I
> found it useful to scan the output for refs that weren't updated... That
> usually indicates a graft didn't get created correctly (although due to SVN
> conventions, it's unlikely the master ref will change) At this point it's
> also possible to get some branch/tag name clashes (I did), so those may
> require clean up.
>
> Hiding 'Deleted' Branches
> -------------------------
> hideFromGit.pl will use the svnBranches.txt file to move any git refs
> associated with deleted SVN paths to refs/hidden in the new repository. This
> keeps the objects associated with those refs from getting garbage collected,
> but hides them from most user commands. This is entirely a personal
> preference. (Just like the other scripts, you'll probably have to edit the
> paths in the script itself)
>
> 'Validating' the Conversion
> ---------------------------
> gitValidation.pl is a script I wrote to randomly select revs from SVN and try
> to compare the SVN diffs to the Git diffs. It uses git-patch-id to compute a
> SHA of the changes in each repository, and reports if something doesn't match
> up. It's not particularly polished, and does find "errors" in our Git repo,
> but after investigating all the discrepancies I'm pretty happy that nothing
> vital is wrong.
>
> Closing Thoughts
> ----------------
> Do I have any? This is quite the brain dump, so I'm sure I've been
> incomplete and probably somewhat confusing... I'm happy to answer questions
> as I can, but again, this is entirely based on my experience with our local
> repo. YMMV!
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-14 16:34 ` Jonathan Nieder
@ 2010-10-14 20:07 ` Sverre Rabbelier
2010-10-15 14:50 ` Stephen Bash
0 siblings, 1 reply; 52+ messages in thread
From: Sverre Rabbelier @ 2010-10-14 20:07 UTC (permalink / raw)
To: Stephen Bash, Jonathan Nieder
Cc: Matt Stump, git, David Barr, Tomas Carnecky, Ramkumar Ramachandra
Heya,
On Thu, Oct 14, 2010 at 18:34, Jonathan Nieder <jrnieder@gmail.com> wrote:
> [just cc-ing David, Tom, Sverre, Ram, who might be interested.]
Thanks, I'm definitely interested.
> Stephen Bash wrote:
>>> I hate making more work for people but I would love a copy of your
>>> notes.
>>
>> Okay, here we go!
Thanks for the very interesting read. It seems like a (very) long
pipeline though, I wonder how we can make this not only easier, but
also more streamlined for git-remote-svn. Do you have any suggestions
on how you would prefer this to be done in git-remote-svn? (Main
advantage for git-remote-svn might be that we can use git notes to
store commit conversion information, instead of having to mine commit
messages.)
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-14 20:07 ` Sverre Rabbelier
@ 2010-10-15 14:50 ` Stephen Bash
2010-10-15 23:39 ` Sverre Rabbelier
0 siblings, 1 reply; 52+ messages in thread
From: Stephen Bash @ 2010-10-15 14:50 UTC (permalink / raw)
To: Sverre Rabbelier
Cc: Matt Stump, git, David Barr, Tomas Carnecky, Ramkumar Ramachandra,
Jonathan Nieder
> Thanks for the very interesting read. It seems like a (very) long
> pipeline though, I wonder how we can make this not only easier, but
> also more streamlined for git-remote-svn.
The process can certainly be streamlined. As is often the case, this process was created via the "just make it work" mentality (and a barely passable knowledge of git). Now that I'm a little more comfortable with git and it's basic objects, I think I could probably create a new process that does a single pass through the svn-fe created repository and creates a new repository with the correct history (and some other nice features that come with any 2.0).
But I'm also looking at this from a one-time conversion view. I had a couple of conversations with Ram that showed me my point of view is very narrow compared to the larger git-remote-svn effort...
> Do you have any suggestions
> on how you would prefer this to be done in git-remote-svn? (Main
> advantage for git-remote-svn might be that we can use git notes to
> store commit conversion information, instead of having to mine commit
> messages.)
I think using notes is a better way to associate conversion information with commits, but I would probably still end up mining the notes to create some sort of svn to git mapping... Correct me if I'm wrong, but I don't see how notes would help me get from an svn rev to a git sha (a common practice for tickets and wiki links in our organization). The latter is more a job for tags, and while that would be possible, that more than doubles the number of objects in the repository (I have a good percentage of SVN revs that turned into multiple git commit objects).
But otherwise, my suggestions are (unfortunately) rather naive. "Make it work like git-svn, but faster" :) I can offer the warning to watch out for cross-branch (subdirectory/file) copies; we had a lot of those in our SVN repository, and I still don't know if there's anyway in Git to represent that operation... And obviously even if I did have/use the svn merge information, svn merges don't map directly to git merges... but I'm guessing I'm not saying anything you haven't already thought about.
I guess after that I should add that I'm happy to help, I'm just not sure where my experience maps to the on going effort.
Thanks,
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-15 14:50 ` Stephen Bash
@ 2010-10-15 23:39 ` Sverre Rabbelier
2010-10-16 0:16 ` Stephen Bash
0 siblings, 1 reply; 52+ messages in thread
From: Sverre Rabbelier @ 2010-10-15 23:39 UTC (permalink / raw)
To: Stephen Bash
Cc: Matt Stump, git, David Barr, Tomas Carnecky, Ramkumar Ramachandra,
Jonathan Nieder
Heya,
On Fri, Oct 15, 2010 at 09:50, Stephen Bash <bash@genarts.com> wrote:
> The process can certainly be streamlined. As is often the case, this
> process was created via the "just make it work" mentality (and a barely
> passable knowledge of git).
Fair enough :)
> But I'm also looking at this from a one-time conversion view. I had a
> couple of conversations with Ram that showed me my point of view is
> very narrow compared to the larger git-remote-svn effort...
Yeah, we not only want 'git-remote-svn' to be able to do incremental
imports, but we also want it to be able to push back to svn.
> I think using notes is a better way to associate conversion information
> with commits, but I would probably still end up mining the notes to create
> some sort of svn to git mapping... Correct me if I'm wrong, but I don't see
> how notes would help me get from an svn rev to a git sha (a common
> practice for tickets and wiki links in our organization).
Ah, hmm, that is a good point. Couldn't you just tag object
0000000000000000143 for svn revision 143?
> I guess after that I should add that I'm happy to help, I'm just not sure
> where my experience maps to the on going effort.
Just general feedback, sanity checking, and if you're interested,
"beta testing" I think would be very useful :).
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-15 23:39 ` Sverre Rabbelier
@ 2010-10-16 0:16 ` Stephen Bash
2010-10-17 2:25 ` Sverre Rabbelier
0 siblings, 1 reply; 52+ messages in thread
From: Stephen Bash @ 2010-10-16 0:16 UTC (permalink / raw)
To: Sverre Rabbelier
Cc: Matt Stump, git, David Barr, Tomas Carnecky, Ramkumar Ramachandra,
Jonathan Nieder
----- Original Message -----
> From: "Sverre Rabbelier" <srabbelier@gmail.com>
> To: "Stephen Bash" <bash@genarts.com>
> Sent: Friday, October 15, 2010 7:39:09 PM
> Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
>
> On Fri, Oct 15, 2010 at 09:50, Stephen Bash <bash@genarts.com> wrote:
> > I think using notes is a better way to associate conversion
> > information
> > with commits, but I would probably still end up mining the notes to
> > create
> > some sort of svn to git mapping... Correct me if I'm wrong, but I
> > don't see
> > how notes would help me get from an svn rev to a git sha (a common
> > practice for tickets and wiki links in our organization).
>
> Ah, hmm, that is a good point. Couldn't you just tag object
> 0000000000000000143 for svn revision 143?
Yeah, I actually thought about that as I was writing the comment... And after a completely unrelated conversation about tags at $WORK this afternoon, I'm half tempted to make a refs/tags/svn directory with lightweight tags named for SVN revs that point to the appropriate git object. That idea along with a bunch of others are now brewing a 2.0 in my head since I started revisiting this proces. We'll see if I have a productive weekend or not...
> > I guess after that I should add that I'm happy to help, I'm just not
> > sure
> > where my experience maps to the on going effort.
>
> Just general feedback, sanity checking, and if you're interested,
> "beta testing" I think would be very useful :).
I will do my best.
Thanks,
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-16 0:16 ` Stephen Bash
@ 2010-10-17 2:25 ` Sverre Rabbelier
2010-10-17 3:33 ` David Michael Barr
0 siblings, 1 reply; 52+ messages in thread
From: Sverre Rabbelier @ 2010-10-17 2:25 UTC (permalink / raw)
To: Stephen Bash
Cc: Matt Stump, git, David Barr, Tomas Carnecky, Ramkumar Ramachandra,
Jonathan Nieder
Heya,
On Fri, Oct 15, 2010 at 19:16, Stephen Bash <bash@genarts.com> wrote:
> That idea along with a bunch of others are now brewing a 2.0 in
> my head since I started revisiting this proces. We'll see if I have a
> productive weekend or not...
Looking forward to seeing what you come up with :).
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-17 2:25 ` Sverre Rabbelier
@ 2010-10-17 3:33 ` David Michael Barr
0 siblings, 0 replies; 52+ messages in thread
From: David Michael Barr @ 2010-10-17 3:33 UTC (permalink / raw)
To: Sverre Rabbelier
Cc: Stephen Bash, Matt Stump, git, Tomas Carnecky,
Ramkumar Ramachandra, Jonathan Nieder
Hi all,
>> That idea along with a bunch of others are now brewing a 2.0 in
>> my head since I started revisiting this proces. We'll see if I have a
>> productive weekend or not...
>
> Looking forward to seeing what you come up with :).
I don't know how much it will help, but I started work on a simplistic
mapping strategy with speed in mind. [1]
I am curious to read through the details of your heuristics and work
out how to streamline the process.
--
David Barr
[1] http://thread.gmane.org/gmane.comp.version-control.git/158375
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-14 16:22 ` Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch) Stephen Bash
2010-10-14 16:34 ` Jonathan Nieder
@ 2010-10-18 5:17 ` Ramkumar Ramachandra
2010-10-18 7:31 ` Jonathan Nieder
2010-10-19 1:42 ` Stephen Bash
1 sibling, 2 replies; 52+ messages in thread
From: Ramkumar Ramachandra @ 2010-10-18 5:17 UTC (permalink / raw)
To: Stephen Bash
Cc: Matt Stump, git, Jonathan Nieder, David Michael Barr,
Sverre Rabbelier, Tomas Carnecky
Hi Stephen,
[sorry about the delayed reply; was ill]
Stephen Bash writes:
> Okay, here we go! I've uploaded the applicable scripts to
> https://gist.github.com/f6902cb4e3534f07ba48
>
> If you (or anyone) finds I describe something here that isn't on
> github, let me know and I'll add to it. I did a cursory pass
> through the scripts to remove a lot of the specific-to-our-repo
> stuff, so I'm not even sure these scripts will run as is... But
> most errors should be pretty minor (typos in variable names, etc)
> the overall procedure is unchanged. (And please be gentle, these
> are not anything approaching production-ready)
Maybe you'd like to fork off the repository we're working on and add
the scripts there for convinience? We'll start working on the mapping
as soon as svn-fe3 is finished.
> As always, these scripts come with ABSOLUTELY NO WARRANTEE, use at
> your own risk, your mileage may vary, etc.
>
> Converting to Git using svn-fe
> ------------------------------
> Most people who have tried using git-svn to convert a medium to
> large Subversion repository have found it's a slow process. When I
> asked the Git mailing list about this problem in June 2010, I was
> pointed to David Barr's svn-dump-fast-export tool:
> http://github.com/barrbrain/svn-dump-fast-export
> svn-fe (as the executable is called) converts an entire svn
> repository to git very quickly (our repository took about 20
> minutes), but the entire svn file system is one branch. I developed
> the following process to reproduce the svn history in git.
So you used the version that supports dumpfile v2 that's merged into
git.git `master`.
> Initial Thoughts
> ----------------
> 1) Our SVN repository was approximately 20k commits, about 7k files
> in HEAD, a little less than 400 tags, and about 100-150 branches.
> It was organized /trunk/project rather than /project/trunk.
> Branches were /branches/branchName where the branchName directory
> was a copy of the entire trunk (so /branches/branchName/project
> is what a user would checkout). This does affect the scripts,
> but I think it should be relatively easy to modify (no guarantee
> though).
This is a relatively simple scenario. When building the actual mapper,
we must also take into account moves like /project -> /trunk/project.
> 2) Our SVN repository originated from cvs2svn, so there are some
> artifacts from that conversion that affect this conversion.
>
> 3) I make very little use of Git.pm because while I was developing I
> ran into a bunch of problems with it (none of which I remember
> now). Instead I make use of perl's system call to send commands
> to Git (where possible I avoid invoking the shell, see perldoc -f
> exec). I don't want to imply Git.pm doesn't work, but at the
> time it didn't work for me (and I was more focused on making my
> scripts work than improving Git.pm. Sorry!).
I see that you've invoked a lot of Git porcelain command in your
scripts. We must factor these out eventually.
> 4) The vast majority of our history was before SVN introduced
> merge-info, so I made no attempt to capture SVN merges in Git.
> Rather I kept all branch heads, but moved most of them to a
> "hidden" namespace (see hideFromGit.pl for details). This does
> mean for a couple merges post-conversion I've had to add
> temporary grafts to make the merge work, but I haven't bothered
> making those grafts permanent (hopefully this isn't a problem?)
>
> 5) I performed this entire process using a local mirror of our SVN
> repository in about 4 hours. It is mostly automated, but does
> require some human monitoring (maybe I'm just paranoid). Since
> svn-fe runs off a SVN dump file, creating the local mirror was a
> trivial additional step.
The mirroring process is painfully slow :(
> 6) To keep what follows a *little* shorter, I'm going to assume you
> can read Perl to extract the details of what's going on. I'll
> try to keep the prose to a high level...
>
> Extracting SVN's History
> ------------------------
> First we want to understand SVN's branching/tagging history. Modify
> buildSVNTree.pl as necessary, then run
> perl buildSVNTree.pl > svnBranches.txt
This looks like the heart of the mapper. You've used the libsvn Perl
bindings to mine data from a local mirror of the repository -- this is
not the way to go. Our current approach is to first get everything
into Git-land and then perform the mapping.
Fundamentally, we get our dumpfile parser (svn-fe) to dump all the
data to a Git object store blindly and then post-process (see
db-svn-filter-root). I'm currently investigating if the dumpfile
parser knows something more that we need to note down before
performing the actual mapping.
Also, since we're aiming for a two-way mapping, it's going to be
significantly more challenging: we will need a mapping function that
can be inverted perfectly.
> buildSVNTree.pl does the following steps:
> 1. Traverses the SVN history chronologically looking for copies.
Unnecessary: the object store doesn't care about being told about
copies explicitly. Then again, we can't produce the exact same
Copyfrom information when we export it back to a dumpfile. The aim of
the two-way mapping:
SVN repository 1 -> dumpfile -> Git repository
Git repository -> dumpfile' -> SVN repository 2
I don't care if dumpfile and dumpfile' don't match. The SVN
repositories must, that's all.
> 2. Records the source path/rev and destination path/rev for (most)
> copies (see script for details)
Unnecessary again.
> 3. Once all copies are collected, further filters copies based on:
> * source path is a directory
> * source and destination are not in trunk
> * source and destination are not in the same branch or tag
> * source path is not /vendor (an artifact of cvs2svn)
Interesting. I can't figure out why this is necessary yet (I'll
probably find out later in the email).
> 4. Checks that source path is "shortest" path from it's rev (protect
> against subdirectories that get added in the same commit)
>
> 5. Checks the source and destination paths match globs for expected
> paths (non-matching copies that make it this far are printed to
> STDERR)
Easier in Git-land.
> 6. Creates a Git branch name for destination (note that svn tags are
> closer to git branches than git tags)
In future, we should give users the flexibility to choose - I'm
thinking of a language for specifying the mapping. And this
information should be persistent: maybe put it in a Git note?
> 7. Search history for the last commit that actually changed the source path
> 8. Find a parent path from the source path (mostly recurse up the SVN tree to a known branch)
> 9. Use the parent path to determine the parent git branch name
> 10. Record parent/child relationships
> 11. Dump output to STDOUT (which you should redirect to a file for later use)
Simplified in Git-land.
> I did run into one place where two SVN branches had the same name
> but different SVN paths (it's complicated). In this case I just
> manually edited the git branch name in svnBranches.txt. As long as
> you do that before continuing, everything should be okay.
Another interesting point. I'm thinking of a tool that assumes some
layout and produces something similar to svnBranches.txt that
end-users can edit before the actual mapping occurs.
> There's also some logic in buildSVNTree to determine if a branch/tag
> is deleted in the SVN head. That information is used by
> hideFromGit.
It'll be in the revision history in Git anyway- it doesn't require
special handling.
> Create the Single Branch Git Repo
> ---------------------------------
> Use svn-fe for what it's designed:
> 1. svnadmin dump /path/to/svn/repo > svn-dump.txt
> 2. git init /path/to/initial/git/repo
> 3. cd /path/to/initial/git/repo
> 4. cat /path/to/svn-dump.txt | svn-fe svnRepoName | git fast-import
>
> svnRepoName in step 4 can be anything you want, but it has to be
> specified so that svn-fe appends the git-svn style "git-svn-id:
> svnRepoName@svnRevNum svnRepoUUID" line to each commit message.
> This line is required later to map SVN revs to Git commits.
>
> Create Git Branches and Tags
> ----------------------------
> Now comes the next script, filterBranch.pl. filterBranch will
> create Git branches and tags out of the single branch repo by
> creating a ton of clones and filtering each one. While it's doing
> this, it also changes the SVN user names to proper Git user IDs
> (name + email). fetchSVNNames.pl can be used to get all the svn
> users, then you can edit $authorScript in filterBranch to modify
> names appropriately ($authorScript is a git-filter-branch
> --env-filter, so it gets eval'ed by git). Per the git-filter-branch
> manpage, you'll want to create/use a RAM disk for temporary files
> (see $tempdir). And you'll need to set various paths like
> $parentRepo (this is the repo created in step 2 above), etc.
Hm, we need to maintain a persistent author map too -- another note
object? Also, as Tom's earlier notes point, we need a persistent
timestamp mapping too because Git's concept of commit timestamp is
different from SVN's concept.
> Then the script should be (?) relatively automated:
> perl filterBranch.pl svnBranches.txt
>
> The fancy logic here is probably figuring out which Git refs go to
> which Git commit, but I'll leave that as an exercise to the
> reader... Ah, I should probably mention: svn-fe can produce "empty"
> commits, and filterBranch does nothing to remove them. By "empty" I
> mean there will be a commit object without any content changes. So
> creating a branch/tag in SVN creates a commit, but doesn't change
> content. That commit will be part of the new Git history.
> Similarly, filterBranch will create git tags from svn tags, but they
> point to one of these "empty" commits rather than the branch they
> are tagged from. It's not very git-ish, but it seems to work...
Oh, I didn't realize that fast-import allows the creation of empty
commits. We should probably fix this?
> filterBranch is probably the longest step of the process; there's a
> lot of filtering going on. It will be very verbose on STDOUT, so I
> recommend tee'ing to a file or a terminal with infinite scroll back.
> It also involves a lot of disk hits (somewhat reduced if $tempdir is
> a RAM disk), and potentially a lot of space (it will create a git
> repo for every branch/tag in your subversion history). For our
> repository this step took about 1.5-2 hours IIRC.
Wow, this really brute-force.
> Create SVN/Git Revmaps
> ----------------------
> Next step is to create a map that goes from SVN rev to Git commit
> object. genRevmap.pl and genJointRevmap.pl will be helpful here:
> 1. cd $cleanDir (from filterBranch)
> 2. find . -type d -name "*.git" -exec genRevmap.pl '{}' svnRepoName
> destDir ';'
> 3. cd destDir
> 4. find . -name "*.revmap" -exec grep . '{}' + | genJointRevMap.pl >
> jointRevmap.revmap
>
> genRevmap will respect the directory hierarchy created by
> filterBranch, and destDir must have a similar structure (doesn't
> require the individual Git repos, but any directory that contains a
> git repo must exist in destDir). genJointRevMap takes individual
> revmaps and creates a big revmap for all the repositories. These
> scripts aren't doing any real magic, just parsing the Git log
> messages for commit ID and the git-svn-id line to get the SVN rev
> the commit corresponds to. Note that SVN rev to Git commit can be
> one to many! (genRevmap just lists the same rev twice if it has
> more than one git commit associated with it, genJointRevMap flags
> those revs specially and lists all commit IDs on a single line).
Unless there's a one-to-one mapping between Git revisions and SVN
revisions, a two-way bridge will become very difficult to build. Can
you think of any scenarios where a one-to-one mapping doesn't make
sense?
> Assembling the Final Git Repo
> -----------------------------
> Now we need to combine all the small git repos into one repo that
> represents the SVN history. Similar to filterBranch, you'll need to
> edit paths in repoFusion.pl to make sure it finds everything. Then
> simply:
> perl repoFusion.pl svnBranches.txt jointRevmap.revmap
>
> At a high level, repoFusion:
> 1. Clones the trunk repository, this will become the new master branch
> 2. Performs a git-fetch on every other repository created by
> filterBranch to retrieve the git branch/tags contained there
> 3. Creates grafts to match up git branches with their parents using
> the revmap
> 4. If manual grafts are required, it will pause so the user can edit
> the grafts file (search for '*', the message there might be a
> little cryptic, but using svn log and git log in combination,
> hopefully you can figure out what the correct SHA is to insert)
> 5. Runs filter-branch one more time to make the grafts permanent.
>
> This is a bit faster than filterBranch, but still takes on the order
> of an hour for our repository. It also produces a lot of stuff on
> STDOUT, but I think it's a little easier on the disk. At the end of
> the filter branch, I found it useful to scan the output for refs
> that weren't updated... That usually indicates a graft didn't get
> created correctly (although due to SVN conventions, it's unlikely
> the master ref will change) At this point it's also possible to get
> some branch/tag name clashes (I did), so those may require clean up.
Grafts and filter-branch. db-svn-filter-root does this more elegantly.
> Hiding 'Deleted' Branches
> -------------------------
> hideFromGit.pl will use the svnBranches.txt file to move any git
> refs associated with deleted SVN paths to refs/hidden in the new
> repository. This keeps the objects associated with those refs from
> getting garbage collected, but hides them from most user commands.
> This is entirely a personal preference. (Just like the other
> scripts, you'll probably have to edit the paths in the script
> itself)
Hm. You didn't include the history of deleted branches in the main
repository. Why? Does it make sense to provide the user an option to
exclude some (deleted) branches in the SVN history? It'll make the
two-way mapping extremely difficult.
> 'Validating' the Conversion
> ---------------------------
> gitValidation.pl is a script I wrote to randomly select revs from
> SVN and try to compare the SVN diffs to the Git diffs. It uses
> git-patch-id to compute a SHA of the changes in each repository, and
> reports if something doesn't match up. It's not particularly
> polished, and does find "errors" in our Git repo, but after
> investigating all the discrepancies I'm pretty happy that nothing
> vital is wrong.
Until the Git -> SVN bridge is complete, I don't suppose we can do
much better than this.
> Closing Thoughts
> ----------------
> Do I have any? This is quite the brain dump, so I'm sure I've been
> incomplete and probably somewhat confusing... I'm happy to answer
> questions as I can, but again, this is entirely based on my
> experience with our local repo. YMMV!
Thanks for the interesting and insightful read :)
Do discuss a little more with us on #git-devel or otherwise.
-- Ram
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 5:17 ` Ramkumar Ramachandra
@ 2010-10-18 7:31 ` Jonathan Nieder
2010-10-18 16:38 ` Ramkumar Ramachandra
2010-10-19 1:42 ` Stephen Bash
1 sibling, 1 reply; 52+ messages in thread
From: Jonathan Nieder @ 2010-10-18 7:31 UTC (permalink / raw)
To: Ramkumar Ramachandra
Cc: Stephen Bash, Matt Stump, git, David Michael Barr,
Sverre Rabbelier, Tomas Carnecky
Ramkumar Ramachandra wrote:
> Also, since we're aiming for a two-way mapping, it's going to be
> significantly more challenging: we will need a mapping function that
> can be inverted perfectly.
Sounds interesting! Let's see how much I can narrow scope/dash hopes.
:)
First of dreams is the possibility of using git as a replacement for
svnsync, to get semantically identical SVN repositories like so:
[...]
> SVN repository 1 -> dumpfile -> Git repository
> Git repository -> dumpfile' -> SVN repository 2
in a way that svn tools can look at repo 2 as a basically perfect
replacement for repo 1. This means copying svnsync properties,
rename tracking info, svn properties, etc.
I. Some people might want that, and I wouldn't want to stop them
trying (maybe using notes, perhaps even the mythical tree-based
form) but I'm not interested in it at all. Is it a goal for you?
Second would be the possibility of using an SVN repository as a
conduit for communication between git repositories:
Git repository 1 -> fast-export stream -> SVN repository
SVN repository -> dumpfile -> Git repository 2
II. It would be super cool to be able to transport arbitrary git
objects via svn (maybe using custom properties and fabricated
temporary branches named after the first commit after a fork
point). Perhaps some people could host git projects on Google
Code this way. Is that a goal?
Git 1 -> SVN 1 -> Git 2 -> SVN 2 -> Git 3
III. Perhaps only the subset of git objects with certain properties
should be considered safe to transport via an SVN repository
(e.g.:
- author matches committer
- timestamps are New York time
- author address is of the format username <username>
- filenames are valid UTF-8
). And maybe any existing git repository can be painlessly
transformed to consist only of such commits. Is that a model
to strive for?
SVN 1 -> Git 1 -> SVN 2 -> Git 2 -> SVN 3
IV. Maybe only some svn changes would be considered safe to
transport via git: no weird properties, no tracked renames
not involved in branches/merges, all branches named after the
git commit id of the first rev after the fork point, ...
And maybe any existing svn repository can be painlessly
transformed to consist only of such revisions. Is that a goal?
(As you might have guessed, my answers are "no, no, no, and no, at
least at first, but it is fun to imagine how a person would go about
achieving these things anyway").
Hope that clarifies something,
Jonathan
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 7:31 ` Jonathan Nieder
@ 2010-10-18 16:38 ` Ramkumar Ramachandra
2010-10-18 16:46 ` Sverre Rabbelier
0 siblings, 1 reply; 52+ messages in thread
From: Ramkumar Ramachandra @ 2010-10-18 16:38 UTC (permalink / raw)
To: Jonathan Nieder
Cc: Stephen Bash, Matt Stump, git, David Michael Barr,
Sverre Rabbelier, Tomas Carnecky
Hi Jonathan,
Jonathan Nieder writes:
> Ramkumar Ramachandra wrote:
>
> > Also, since we're aiming for a two-way mapping, it's going to be
> > significantly more challenging: we will need a mapping function that
> > can be inverted perfectly.
>
> Sounds interesting! Let's see how much I can narrow scope/dash hopes.
> :)
>
> First of dreams is the possibility of using git as a replacement for
> svnsync, to get semantically identical SVN repositories like so:
>
> [...]
> > SVN repository 1 -> dumpfile -> Git repository
> > Git repository -> dumpfile' -> SVN repository 2
>
> in a way that svn tools can look at repo 2 as a basically perfect
> replacement for repo 1. This means copying svnsync properties,
> rename tracking info, svn properties, etc.
>
> I. Some people might want that, and I wouldn't want to stop them
> trying (maybe using notes, perhaps even the mythical tree-based
> form) but I'm not interested in it at all. Is it a goal for you?
Hm. I didn't imagine that it would be *that* difficult. The challenge
is to design an invertible mapping function by encapsulating
incompatibilities (or inconsistencies) bit-by-bit using hacks like
notes for the additional information. I'll think about this a little
more and get back to it in a few days.
> Second would be the possibility of using an SVN repository as a
> conduit for communication between git repositories:
>
> Git repository 1 -> fast-export stream -> SVN repository
> SVN repository -> dumpfile -> Git repository 2
Interesting, but I don't necessarily see why this is useful.
> II. It would be super cool to be able to transport arbitrary git
> objects via svn (maybe using custom properties and fabricated
> temporary branches named after the first commit after a fork
> point). Perhaps some people could host git projects on Google
> Code this way. Is that a goal?
>
> Git 1 -> SVN 1 -> Git 2 -> SVN 2 -> Git 3
Wow. That IS super-cool, but I'd have to stretch my imagination quite
a bit to find a usecase for this. I actually find this inelegant (and
probably even grotesque) on many levels, so no- absolutely not
interested in this.
> III. Perhaps only the subset of git objects with certain properties
> should be considered safe to transport via an SVN repository
> (e.g.:
>
> - author matches committer
> - timestamps are New York time
> - author address is of the format username <username>
> - filenames are valid UTF-8
>
> ). And maybe any existing git repository can be painlessly
> transformed to consist only of such commits. Is that a model
> to strive for?
>
> SVN 1 -> Git 1 -> SVN 2 -> Git 2 -> SVN 3
Dunno, and I don't like this.
> IV. Maybe only some svn changes would be considered safe to
> transport via git: no weird properties, no tracked renames
> not involved in branches/merges, all branches named after the
> git commit id of the first rev after the fork point, ...
> And maybe any existing svn repository can be painlessly
> transformed to consist only of such revisions. Is that a goal?
Again, no usecase. I'm not looking for making SVN do Git wizardry-
there's always Git for that. SVN is a simple book-keeping system, and
I want to keep it that way.
> (As you might have guessed, my answers are "no, no, no, and no, at
> least at first, but it is fun to imagine how a person would go about
> achieving these things anyway").
Let me guess: you're targeting git-svn like functionality with all the
dcommit/ rebase ugliness? I'm looking for a slightly nicer way, not
too much more; (I) is just a sort of "ideal" target- it's just nice to
think about it that way. It's needn't be entirely realistic.
-- Ram
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 16:38 ` Ramkumar Ramachandra
@ 2010-10-18 16:46 ` Sverre Rabbelier
2010-10-18 16:56 ` Jonathan Nieder
0 siblings, 1 reply; 52+ messages in thread
From: Sverre Rabbelier @ 2010-10-18 16:46 UTC (permalink / raw)
To: Ramkumar Ramachandra
Cc: Jonathan Nieder, Stephen Bash, Matt Stump, git,
David Michael Barr, Tomas Carnecky
Heya,
On Mon, Oct 18, 2010 at 11:38, Ramkumar Ramachandra <artagnon@gmail.com> wrote:
> Let me guess: you're targeting git-svn like functionality with all the
> dcommit/ rebase ugliness? I'm looking for a slightly nicer way, not
> too much more; (I) is just a sort of "ideal" target- it's just nice to
> think about it that way. It's needn't be entirely realistic.
I'm thinking we can just refuse to let through a commit that is
non-linear, as if there's a hook on the server side that rejects such
a history. Since we're representing the svn remote as a regular
remote, the user can just do 'git rebase @{u}" themselves if they end
up with a non-linear history.
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 16:46 ` Sverre Rabbelier
@ 2010-10-18 16:56 ` Jonathan Nieder
2010-10-18 17:16 ` Ramkumar Ramachandra
2010-10-18 17:18 ` Sverre Rabbelier
0 siblings, 2 replies; 52+ messages in thread
From: Jonathan Nieder @ 2010-10-18 16:56 UTC (permalink / raw)
To: Sverre Rabbelier
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
David Michael Barr, Tomas Carnecky
Sverre Rabbelier wrote:
> I'm thinking we can just refuse to let through a commit that is
> non-linear, as if there's a hook on the server side that rejects such
> a history. Since we're representing the svn remote as a regular
> remote, the user can just do 'git rebase @{u}" themselves if they end
> up with a non-linear history.
Sounds good to me!
FWIW I just wanted to make sure people don't forget about the
incompatible object models. The pretend-upstream-has-a-vicious-update-hook
approach sounds like a sane way to deal with this for pushing from
git to svn (like (III) but making the user do more of the work).
Pulling from svn is a harder problem but luckily the single-upstream
case is the usual case (so object model mismatches are easier to cope
with as long as one can find the corresponding svn rev number for a
given git object easily).
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 16:56 ` Jonathan Nieder
@ 2010-10-18 17:16 ` Ramkumar Ramachandra
2010-10-18 17:18 ` Sverre Rabbelier
1 sibling, 0 replies; 52+ messages in thread
From: Ramkumar Ramachandra @ 2010-10-18 17:16 UTC (permalink / raw)
To: Jonathan Nieder
Cc: Sverre Rabbelier, Stephen Bash, Matt Stump, git,
David Michael Barr, Tomas Carnecky
Jonathan Nieder writes:
> Sverre Rabbelier wrote:
> > I'm thinking we can just refuse to let through a commit that is
> > non-linear, as if there's a hook on the server side that rejects such
> > a history. Since we're representing the svn remote as a regular
> > remote, the user can just do 'git rebase @{u}" themselves if they end
> > up with a non-linear history.
>
> Sounds good to me!
Ofcourse. I can't think of a sane way to deal with commits that aren't
based on upstream. We can't expect the user to rewrite the history,
push and expect it to work. I'm only looking at perfect two-way
mapping for a restricted set of operations on the Git-side.
> FWIW I just wanted to make sure people don't forget about the
> incompatible object models. The pretend-upstream-has-a-vicious-update-hook
> approach sounds like a sane way to deal with this for pushing from
> git to svn (like (III) but making the user do more of the work).
>
> Pulling from svn is a harder problem but luckily the single-upstream
> case is the usual case (so object model mismatches are easier to cope
> with as long as one can find the corresponding svn rev number for a
> given git object easily).
Yeah, I'm only looking at single-upstream SVN with stable revision
numbers, timestamps etc.
-- Ram
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 16:56 ` Jonathan Nieder
2010-10-18 17:16 ` Ramkumar Ramachandra
@ 2010-10-18 17:18 ` Sverre Rabbelier
2010-10-18 17:28 ` Jonathan Nieder
1 sibling, 1 reply; 52+ messages in thread
From: Sverre Rabbelier @ 2010-10-18 17:18 UTC (permalink / raw)
To: Jonathan Nieder
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
David Michael Barr, Tomas Carnecky
Heya,
On Mon, Oct 18, 2010 at 11:56, Jonathan Nieder <jrnieder@gmail.com> wrote:
> FWIW I just wanted to make sure people don't forget about the
> incompatible object models.
> Pulling from svn is a harder problem but luckily the single-upstream
> case is the usual case (so object model mismatches are easier to cope
> with as long as one can find the corresponding svn rev number for a
> given git object easily).
I think I'm missing something. What do you mean with this?
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 17:18 ` Sverre Rabbelier
@ 2010-10-18 17:28 ` Jonathan Nieder
2010-10-18 18:10 ` Sverre Rabbelier
0 siblings, 1 reply; 52+ messages in thread
From: Jonathan Nieder @ 2010-10-18 17:28 UTC (permalink / raw)
To: Sverre Rabbelier
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
David Michael Barr, Tomas Carnecky
Sverre Rabbelier wrote:
> On Mon, Oct 18, 2010 at 11:56, Jonathan Nieder <jrnieder@gmail.com> wrote:
>> FWIW I just wanted to make sure people don't forget about the
>> incompatible object models.
>
>> Pulling from svn is a harder problem but luckily the single-upstream
>> case is the usual case (so object model mismatches are easier to cope
>> with as long as one can find the corresponding svn rev number for a
>> given git object easily).
>
> I think I'm missing something. What do you mean with this?
I mean that rejecting a fetch because upstream has weird history
would make no one happy.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 17:28 ` Jonathan Nieder
@ 2010-10-18 18:10 ` Sverre Rabbelier
2010-10-18 18:13 ` Jonathan Nieder
0 siblings, 1 reply; 52+ messages in thread
From: Sverre Rabbelier @ 2010-10-18 18:10 UTC (permalink / raw)
To: Jonathan Nieder
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
David Michael Barr, Tomas Carnecky
Heya,
On Mon, Oct 18, 2010 at 12:28, Jonathan Nieder <jrnieder@gmail.com> wrote:
> I mean that rejecting a fetch because upstream has weird history
> would make no one happy.
Agreed. What about when the remote's history is rewritten, do we want
to just transplant the new history, or do we do a forced update of the
remote?
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 18:10 ` Sverre Rabbelier
@ 2010-10-18 18:13 ` Jonathan Nieder
2010-10-18 18:20 ` Sverre Rabbelier
0 siblings, 1 reply; 52+ messages in thread
From: Jonathan Nieder @ 2010-10-18 18:13 UTC (permalink / raw)
To: Sverre Rabbelier
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
David Michael Barr, Tomas Carnecky
Sverre Rabbelier wrote:
> On Mon, Oct 18, 2010 at 12:28, Jonathan Nieder <jrnieder@gmail.com> wrote:
>> I mean that rejecting a fetch because upstream has weird history
>> would make no one happy.
>
> Agreed. What about when the remote's history is rewritten, do we want
> to just transplant the new history, or do we do a forced update of the
> remote?
I think treating it as a usual non-fast-forward update makes sense.
Log messages could be an annoying special case, though, since people
edit those a lot. Does svn store the original log message somewhere?
(Please forgive my ignorance). If not, I suppose downstream can
publish refs produced by "git replace" to cope.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 18:13 ` Jonathan Nieder
@ 2010-10-18 18:20 ` Sverre Rabbelier
2010-10-18 18:25 ` Jonathan Nieder
2010-10-19 0:40 ` Stephen Bash
0 siblings, 2 replies; 52+ messages in thread
From: Sverre Rabbelier @ 2010-10-18 18:20 UTC (permalink / raw)
To: Jonathan Nieder
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
David Michael Barr, Tomas Carnecky
Heya,
On Mon, Oct 18, 2010 at 13:13, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Log messages could be an annoying special case, though, since people
> edit those a lot. Does svn store the original log message somewhere?
> (Please forgive my ignorance). If not, I suppose downstream can
> publish refs produced by "git replace" to cope.
From what I've heard basically all meta-data about a commit (including
author and date!) is mutable. Previously suggested was to stub out the
commit message and user data with placeholders, and drop in the real
information using git notes. I like your suggestion (of using git
replace instead) better. How would we know to use git replace though?
Does the replay API somehow indicate that a revision changed since
last time you looked?
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 18:20 ` Sverre Rabbelier
@ 2010-10-18 18:25 ` Jonathan Nieder
2010-10-18 18:35 ` Sverre Rabbelier
2010-10-19 3:08 ` Ramkumar Ramachandra
2010-10-19 0:40 ` Stephen Bash
1 sibling, 2 replies; 52+ messages in thread
From: Jonathan Nieder @ 2010-10-18 18:25 UTC (permalink / raw)
To: Sverre Rabbelier
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
David Michael Barr, Tomas Carnecky
Sverre Rabbelier wrote:
> Does the replay API somehow indicate that a revision changed since
> last time you looked?
Good question. Ram, I think there was some discussion of this
recently in connection with svnrdump, right? IIRC the suggested
method was to use hooks or mine a commits@ mailing list. :(
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 18:25 ` Jonathan Nieder
@ 2010-10-18 18:35 ` Sverre Rabbelier
2010-10-18 19:33 ` Jonathan Nieder
2010-10-19 3:08 ` Ramkumar Ramachandra
1 sibling, 1 reply; 52+ messages in thread
From: Sverre Rabbelier @ 2010-10-18 18:35 UTC (permalink / raw)
To: Jonathan Nieder
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
David Michael Barr, Tomas Carnecky
Heya,
On Mon, Oct 18, 2010 at 13:25, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Good question. Ram, I think there was some discussion of this
> recently in connection with svnrdump, right? IIRC the suggested
> method was to use hooks or mine a commits@ mailing list. :(
Hmmm, in that case perhaps we should instead just ignore changed
history? Anyone collaborating on history can clone the repository
they're collaborating from, including all git-remote-svn meta-data. It
seems nearly impossible to guarantee that two people that clone the
same repository at different times get the same git hashes (unless we
stub out all mutable data, which is ugly and a pita).
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 18:35 ` Sverre Rabbelier
@ 2010-10-18 19:33 ` Jonathan Nieder
0 siblings, 0 replies; 52+ messages in thread
From: Jonathan Nieder @ 2010-10-18 19:33 UTC (permalink / raw)
To: Sverre Rabbelier
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
David Michael Barr, Tomas Carnecky
Sverre Rabbelier wrote:
> On Mon, Oct 18, 2010 at 13:25, Jonathan Nieder <jrnieder@gmail.com> wrote:
>> Good question. Ram, I think there was some discussion of this
>> recently in connection with svnrdump, right? IIRC the suggested
>> method was to use hooks or mine a commits@ mailing list. :(
>
> Hmmm, in that case perhaps we should instead just ignore changed
> history?
Yeah. It's unpleasant to imagine that
git clone svn://whatever
... sneakily change svn repo ...
... add some new revs on top ...
cd whatever && git fetch origin
would produce an origin/trunk that does not match any clone of the
svn repo at all, but in practice it is not so different from coping
with any other upstream that is incurably willing to rewrite history.
Example: downstream tracking an unstable branch
-----------------------------------------------
Suppose I maintain a patchset in the long term, based, for whatever
reason, on git's "next" branch. Occasionally there is a need to
merge from upstream. What can one do?
Simple use of "git merge" produces history that is difficult to
follow. Time flowing from left to right, "u" denotes upstream
commits:
u --- u --- u [next-2005-01-03]
|\ \
| \ A --- o - o ----- B
\ \ / /
\ u --- u --- u --- u [next-2006-03-27]
\ /
u --- u --- u --- u --- u [next-2009-11-27]
If a person wants to find what changed downstream between A and B,
a simple "git log A..B ^origin/next" will unfortunately include the
commits from next-2006-03-27 as well.
One option is to rebase whenever upstream does, but that is
dangerous because it prevents users from tracking changes in the
project long-term.
Another option is to use a "rebasing merge" [1]. The history can be
followed without too much trouble if you set up "git log" commands
appropriately. Naïve use of "git log" will list (and git will store)
multiple copies of every commit, though.
And lastly, one can say "screw upstream" and produce a long-term
"next" branch to build on. :) Like this:
1. git branch long-term-next next-2005-01-03
2. When "next" is rebased to clean out cruft, advance long-term-next
to the pre-rebase state. Luckily such rebases leave a "before" in
long-term-next and "after" in next with identical content. Add a
replace ref to make history easy to follow.
git diff <after> <before>; # confirm that they really match
git replace <after> <before>
3. To advance long-term-next, rewrite commits from upstream.
git checkout origin/next
git filter-branch HEAD
git diff origin/next; # should match
git push . HEAD:long-term-next; # should be fast-forward
4. Only merge long-term-next into downstream branches.
5. Publish the latest replace ref so others can follow the history
easily.
[1] http://thread.gmane.org/gmane.comp.version-control.msysgit/10264
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 18:20 ` Sverre Rabbelier
2010-10-18 18:25 ` Jonathan Nieder
@ 2010-10-19 0:40 ` Stephen Bash
1 sibling, 0 replies; 52+ messages in thread
From: Stephen Bash @ 2010-10-19 0:40 UTC (permalink / raw)
To: Sverre Rabbelier
Cc: Ramkumar Ramachandra, Matt Stump, git, David Michael Barr,
Tomas Carnecky, Jonathan Nieder
----- Original Message -----
> From: "Sverre Rabbelier" <srabbelier@gmail.com>
> To: "Jonathan Nieder" <jrnieder@gmail.com>
> Sent: Monday, October 18, 2010 2:20:43 PM
> Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
>
> On Mon, Oct 18, 2010 at 13:13, Jonathan Nieder <jrnieder@gmail.com>
> wrote:
> > Log messages could be an annoying special case, though, since people
> > edit those a lot. Does svn store the original log message somewhere?
> > (Please forgive my ignorance). If not, I suppose downstream can
> > publish refs produced by "git replace" to cope.
>
> From what I've heard basically all meta-data about a commit (including
> author and date!) is mutable.
The default repository configuration does not allow changes to revision properties. But if the repository administrator sets up a pre-revprop-change hook script that exits zero then users with commit access are allowed to modify revision properties. In the general case, it's probably best to assume all properties are mutable.
http://svnbook.red-bean.com/en/1.5/svn.ref.reposhooks.pre-revprop-change.html
Thanks,
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 5:17 ` Ramkumar Ramachandra
2010-10-18 7:31 ` Jonathan Nieder
@ 2010-10-19 1:42 ` Stephen Bash
2010-10-19 6:42 ` Ramkumar Ramachandra
1 sibling, 1 reply; 52+ messages in thread
From: Stephen Bash @ 2010-10-19 1:42 UTC (permalink / raw)
To: Ramkumar Ramachandra
Cc: Matt Stump, git, Jonathan Nieder, David Michael Barr,
Sverre Rabbelier, Tomas Carnecky
----- Original Message -----
> From: "Ramkumar Ramachandra" <artagnon@gmail.com>
> To: "Stephen Bash" <bash@genarts.com>
> Sent: Monday, October 18, 2010 1:17:05 AM
> Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
>
> [sorry about the delayed reply; was ill]
No problem! It's taken me more than 12 hours to actually compose a response (literally, I hit "Reply All" over 12 hours ago!), I don't think I can complain :)
> Stephen Bash writes:
> > Converting to Git using svn-fe
> > ------------------------------
> > I was
> > pointed to David Barr's svn-dump-fast-export tool:
> > http://github.com/barrbrain/svn-dump-fast-export
>
> So you used the version that supports dumpfile v2 that's merged into
> git.git `master`.
Yes, thanks for the clarification.
> > Extracting SVN's History
> > ------------------------
> > First we want to understand SVN's branching/tagging history. Modify
> > buildSVNTree.pl as necessary, then run
> > perl buildSVNTree.pl > svnBranches.txt
>
> > ...
>
> Unnecessary
I'm going to collapse all these comments because I think we're coming at this from different angles. I agree, discovering the copies in git is "easy" (albeit an n^2 operation), and git will correctly identify file content. But when I was asked to preserve the SVN history, I decided to extract a DAG from SVN and migrate that DAG to Git. Thus the history itself is preserved (sans merges), not just the contents of the files. This is the purpose of buildSVNTree. I can elaborate further if requested.
> > There's also some logic in buildSVNTree to determine if a branch/tag
> > is deleted in the SVN head. That information is used by
> > hideFromGit.
>
> It'll be in the revision history in Git anyway- it doesn't require
> special handling.
See below.
> > Ah, I should probably mention: svn-fe can produce "empty"
> > commits, and filterBranch does nothing to remove them. By "empty" I
> > mean there will be a commit object without any content changes. So
> > creating a branch/tag in SVN creates a commit, but doesn't change
> > content. That commit will be part of the new Git history.
> > Similarly, filterBranch will create git tags from svn tags, but they
> > point to one of these "empty" commits rather than the branch they
> > are tagged from. It's not very git-ish, but it seems to work...
>
> Oh, I didn't realize that fast-import allows the creation of empty
> commits. We should probably fix this?
To be precise: svn-fe creates commits where
git diff-tree treeA treeB
is empty with treeA being the tree object of /trunk/project and treeB being the tree of /branches/foo/project. This version of my tools does not squash these commits, a future version probably will (this may cause problems with two-way communication?).
> > filterBranch is probably the longest step of the process; there's a
> > lot of filtering going on. It will be very verbose on STDOUT, so I
> > recommend tee'ing to a file or a terminal with infinite scroll back.
> > It also involves a lot of disk hits (somewhat reduced if $tempdir is
> > a RAM disk), and potentially a lot of space (it will create a git
> > repo for every branch/tag in your subversion history). For our
> > repository this step took about 1.5-2 hours IIRC.
>
> Wow, this really brute-force.
Yes it is. If I get around to writing a new version, I'll at least advance to a single pass using commit-tree. Beyond that I'm probably into the fast-import code, which I'll happily leave to the rest of you :)
> > Note that SVN rev to Git commit can be one to many!
>
> Unless there's a one-to-one mapping between Git revisions and SVN
> revisions, a two-way bridge will become very difficult to build. Can
> you think of any scenarios where a one-to-one mapping doesn't make
> sense?
I have 32 SVN revs in my history that touch multiple Git commit objects. The simplest example is
svn mv svn://svnrepo/branches/badBranchName svn://svnrepo/branches/goodBranchName
which creates a single SVN commit that touches two branches (badBranchName will have all it's contents deleted, goodBranchName will have an "empty commit" as described above). The more devious version is the SVN rev where a developer checked out / (yes, I'm not kidding) and proceeded to modify a single file on all branches in one commit. In our case, that one SVN rev touches 23 git commit objects. And while the latter is somewhat a corner case, the former is common and probably needs to be dealt with appropriately (it's kind of a stupid operation in Git-land, so maybe it can just be squashed).
> Grafts and filter-branch. db-svn-filter-root does this more elegantly.
I found a 'db-svn-filter-root' branch, but it was not entirely obvious to me what code I should be looking at...
> > Hiding 'Deleted' Branches
> > -------------------------
>
> Hm. You didn't include the history of deleted branches in the main
> repository. Why?
The commit objects are still there, I simply moved the refs to refs/hidden/{heads,tags}. Because my goal was to maintain the full SVN history I needed to somehow protect the objects from garbage collection. At the time I didn't know about "git merge -s ours", so this strategy achieved my goal of protecting the objects. In this case, the refs are not cloned, but are fetch-able, so I found it to be a reasonable solution.
> Does it make sense to provide the user an option to
> exclude some (deleted) branches in the SVN history? It'll make the
> two-way mapping extremely difficult.
I think there are cases where a user could say "I don't care about dead development branches". In my current system, all branches, even those that do not contribute back to the trunk are saved in the hidden namespace. But I could see users that don't care about some or all extraneous branches and would be happy to not convert them or to let them be garbage collected.
> Thanks for the interesting and insightful read :)
I'm glad it's stimulating conversation. I'm beginning to wonder if there might be competing design goals for one-way vs. two-way compatibility... Performance is one place where opinions probably greatly differ (I didn't mind taking an extra 30 minutes to mirror my SVN repo because it probably saved more than that in communication overhead later in the process, but that mirror operation is very taxing on your timeline); my exhaustive search of all SVN copies is another (I wanted to be *extremely* certain I knew about all the misplaced branches/tags, but it's inefficient for a casual developer who just wants to interact with an SVN server). It's all just food for thought, and I'm happy to carry on the conversation from my different point-of-view :)
Thanks,
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-18 18:25 ` Jonathan Nieder
2010-10-18 18:35 ` Sverre Rabbelier
@ 2010-10-19 3:08 ` Ramkumar Ramachandra
1 sibling, 0 replies; 52+ messages in thread
From: Ramkumar Ramachandra @ 2010-10-19 3:08 UTC (permalink / raw)
To: Jonathan Nieder
Cc: Sverre Rabbelier, Stephen Bash, Matt Stump, git,
David Michael Barr, Tomas Carnecky
Jonathan Nieder writes:
> Sverre Rabbelier wrote:
>
> > Does the replay API somehow indicate that a revision changed since
> > last time you looked?
>
> Good question. Ram, I think there was some discussion of this
> recently in connection with svnrdump, right? IIRC the suggested
> method was to use hooks or mine a commits@ mailing list. :(
Yep. There's really no way to determine if a revision changed. Atleast
we can be happy that it's just the revprops that change- replace refs
are a great solution when we can tell if something changed. Frankly, I
haven't thought about how to solve this yet -- I'll comment after
looking at the later emails in the thread.
-- Ram
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-19 1:42 ` Stephen Bash
@ 2010-10-19 6:42 ` Ramkumar Ramachandra
2010-10-19 13:33 ` Stephen Bash
2010-10-20 8:39 ` Will Palmer
0 siblings, 2 replies; 52+ messages in thread
From: Ramkumar Ramachandra @ 2010-10-19 6:42 UTC (permalink / raw)
To: Stephen Bash
Cc: Matt Stump, git, Jonathan Nieder, David Michael Barr,
Sverre Rabbelier, Tomas Carnecky
Hi Stephen,
Stephen Bash writes:
> > From: "Ramkumar Ramachandra" <artagnon@gmail.com>
> > Stephen Bash writes:
> > > Extracting SVN's History
> > > ------------------------
> > > First we want to understand SVN's branching/tagging history. Modify
> > > buildSVNTree.pl as necessary, then run
> > > perl buildSVNTree.pl > svnBranches.txt
> >
> > > ...
> >
> > Unnecessary
>
> I'm going to collapse all these comments because I think we're
> coming at this from different angles. I agree, discovering the
> copies in git is "easy" (albeit an n^2 operation), and git will
> correctly identify file content. But when I was asked to preserve
> the SVN history, I decided to extract a DAG from SVN and migrate
> that DAG to Git. Thus the history itself is preserved (sans
> merges), not just the contents of the files. This is the purpose of
> buildSVNTree. I can elaborate further if requested.
Yep, they're certainly two different ways to approach the problem: I'd
be interested in investigating why it will produce different
results. Since we both agree that it's easier (and faster) to do it in
Git-land, I'm looking into the the areas where it falls short.
Yes, I understand your script (although I can't actually read Perl
:p), but the differences are still not very clear to me.
> > > Ah, I should probably mention: svn-fe can produce "empty"
> > > commits, and filterBranch does nothing to remove them. By "empty" I
> > > mean there will be a commit object without any content changes. So
> > > creating a branch/tag in SVN creates a commit, but doesn't change
> > > content. That commit will be part of the new Git history.
> > > Similarly, filterBranch will create git tags from svn tags, but they
> > > point to one of these "empty" commits rather than the branch they
> > > are tagged from. It's not very git-ish, but it seems to work...
> >
> > Oh, I didn't realize that fast-import allows the creation of empty
> > commits. We should probably fix this?
>
> To be precise: svn-fe creates commits where
> git diff-tree treeA treeB
> is empty with treeA being the tree object of /trunk/project and
> treeB being the tree of /branches/foo/project. This version of my
> tools does not squash these commits, a future version probably will
> (this may cause problems with two-way communication?).
Right, that IS expected behavior. Don't they correspond to separate
SVN revisions anyway? Why would you want to squash them?
[Ignore this; see later in the email]
> > > filterBranch is probably the longest step of the process; there's a
> > > lot of filtering going on. It will be very verbose on STDOUT, so I
> > > recommend tee'ing to a file or a terminal with infinite scroll back.
> > > It also involves a lot of disk hits (somewhat reduced if $tempdir is
> > > a RAM disk), and potentially a lot of space (it will create a git
> > > repo for every branch/tag in your subversion history). For our
> > > repository this step took about 1.5-2 hours IIRC.
> >
> > Wow, this really brute-force.
>
> Yes it is. If I get around to writing a new version, I'll at least
> advance to a single pass using commit-tree. Beyond that I'm
> probably into the fast-import code, which I'll happily leave to the
> rest of you :)
*nod*
> > > Note that SVN rev to Git commit can be one to many!
> >
> > Unless there's a one-to-one mapping between Git revisions and SVN
> > revisions, a two-way bridge will become very difficult to build. Can
> > you think of any scenarios where a one-to-one mapping doesn't make
> > sense?
>
> I have 32 SVN revs in my history that touch multiple Git commit
> objects. The simplest example is
> svn mv svn://svnrepo/branches/badBranchName svn://svnrepo/branches/goodBranchName
> which creates a single SVN commit that touches two branches
> (badBranchName will have all it's contents deleted, goodBranchName
> will have an "empty commit" as described above). The more devious
> version is the SVN rev where a developer checked out / (yes, I'm not
> kidding) and proceeded to modify a single file on all branches in
> one commit. In our case, that one SVN rev touches 23 git commit
> objects. And while the latter is somewhat a corner case, the former
> is common and probably needs to be dealt with appropriately (it's
> kind of a stupid operation in Git-land, so maybe it can just be
> squashed).
Ouch! Thanks for the illustrative example- I understand now. We have
to bend backwards to perform a one-to-one mapping. It's finally struck
me- one-to-one mapping is nearly impossible to achieve, and I don't
know if it makes sense to strive for it anymore. Looks like Jonathan
got it earlier.
> > Grafts and filter-branch. db-svn-filter-root does this more elegantly.
>
> I found a 'db-svn-filter-root' branch, but it was not entirely
> obvious to me what code I should be looking at...
Um, there's just one commit that deviates from the branch it's based
on (but you don't know that, and I should have been clearer): look at
contrib/svn-fe/svn-filter-root.py
It's just a minimalistic mapper, but it's fast and done nicely. You
can use ideas from it when you're building yours.
> > > Hiding 'Deleted' Branches
> > > -------------------------
> >
> > Hm. You didn't include the history of deleted branches in the main
> > repository. Why?
>
> The commit objects are still there, I simply moved the refs to
> refs/hidden/{heads,tags}. Because my goal was to maintain the full
> SVN history I needed to somehow protect the objects from garbage
> collection. At the time I didn't know about "git merge -s ours", so
> this strategy achieved my goal of protecting the objects. In this
> case, the refs are not cloned, but are fetch-able, so I found it to
> be a reasonable solution.
Oh.
> > Does it make sense to provide the user an option to
> > exclude some (deleted) branches in the SVN history? It'll make the
> > two-way mapping extremely difficult.
>
> I think there are cases where a user could say "I don't care about
> dead development branches". In my current system, all branches,
> even those that do not contribute back to the trunk are saved in the
> hidden namespace. But I could see users that don't care about some
> or all extraneous branches and would be happy to not convert them or
> to let them be garbage collected.
When I made this comment, I was thinking of the one-to-one mapping. It
makes much more sense now.
> > Thanks for the interesting and insightful read :)
>
> I'm glad it's stimulating conversation. I'm beginning to wonder if
> there might be competing design goals for one-way vs. two-way
> compatibility... Performance is one place where opinions probably
> greatly differ (I didn't mind taking an extra 30 minutes to mirror
> my SVN repo because it probably saved more than that in
> communication overhead later in the process, but that mirror
> operation is very taxing on your timeline); my exhaustive search of
> all SVN copies is another (I wanted to be *extremely* certain I knew
> about all the misplaced branches/tags, but it's inefficient for a
> casual developer who just wants to interact with an SVN server).
> It's all just food for thought, and I'm happy to carry on the
> conversation from my different point-of-view :)
Ok, I still don't get this part- why mirror at all? Can't all the
information be mined out of the in-memory tree that svn-fe builds
while parsing the dumpfile? From the SVN-side, all that's required is
a streaming dumpfile like the one that `svnrdump dump` produces.
-- Ram
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-19 6:42 ` Ramkumar Ramachandra
@ 2010-10-19 13:33 ` Stephen Bash
2010-10-19 14:28 ` David Michael Barr
2010-10-20 8:39 ` Will Palmer
1 sibling, 1 reply; 52+ messages in thread
From: Stephen Bash @ 2010-10-19 13:33 UTC (permalink / raw)
To: Ramkumar Ramachandra
Cc: Matt Stump, git, Jonathan Nieder, David Michael Barr,
Sverre Rabbelier, Tomas Carnecky
----- Original Message -----
> From: "Ramkumar Ramachandra" <artagnon@gmail.com>
> To: "Stephen Bash" <bash@genarts.com>
> Sent: Tuesday, October 19, 2010 2:42:15 AM
> Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
>
> Stephen Bash writes:
> > I'm going to collapse all these comments because I think we're
> > coming at this from different angles. I agree, discovering the
> > copies in git is "easy" (albeit an n^2 operation), and git will
> > correctly identify file content. But when I was asked to preserve
> > the SVN history, I decided to extract a DAG from SVN and migrate
> > that DAG to Git. Thus the history itself is preserved (sans
> > merges), not just the contents of the files. This is the purpose of
> > buildSVNTree. I can elaborate further if requested.
>
> Yep, they're certainly two different ways to approach the problem: I'd
> be interested in investigating why it will produce different
> results. Since we both agree that it's easier (and faster) to do it in
> Git-land, I'm looking into the the areas where it falls short.
Ack! I left my example at home this morning... I'll explain it here, but perhaps I can actually send out a test script tonight or tomorrow (if there's need). The basic premise is git's copy detection finds files with the same content, not necessarily the source of an SVN copy.
It's also possible you can do this in svn-fe or in fast-import -- there may be more information there. I was looking strictly pre-svn-fe or post-fast-import...
Here's how I created a discrepancy between SVN and Git:
1) Create a new svn repo
2) Create the standard layout (trunk, branches, tags)
3) Create multiple files on the trunk
4) Create a branch (svn cp trunk branches/branchName)
5) Edit a file on the branch (leave some of the others alone)
6) (optional) edit a file on the trunk
7) Merge the branch back to the trunk
8) Create a tag from the trunk (svn cp trunk tags/tagName)
9) git fast-import the repo
Now "svn log -v svn://svnrepo/tags/tagName" will show something like
A /tags/tagName (from /trunk:rev)
OTOH "git log --name-status --find-copies-harder" will show something like
C100 /tags/tagName/foo (from /trunk/foo)
C100 /tags/tagName/bar (from /branches/branchName/bar)
C100 /tags/tagName/baz (from /trunk/baz)
assuming bar is the file edited on the branch and then merged back to the trunk (this is all from memory, so please forgive me if the output isn't quite right). I think from Git's point-of-view, this copy information is correct, but it doesn't describe SVN's history -- and I'm not entirely sure how a Git-only solution could identify precisely what's going on there... (hopefully I'm just being naive)
> > I found a 'db-svn-filter-root' branch, but it was not entirely
> > obvious to me what code I should be looking at...
>
> Um, there's just one commit that deviates from the branch it's based
> on (but you don't know that, and I should have been clearer): look at
> contrib/svn-fe/svn-filter-root.py
>
> It's just a minimalistic mapper, but it's fast and done nicely. You
> can use ideas from it when you're building yours.
Okay, David pointed me to that earlier, but I haven't dug into it yet. I'll take a look.
> > I'm glad it's stimulating conversation. I'm beginning to wonder if
> > there might be competing design goals for one-way vs. two-way
> > compatibility... Performance is one place where opinions probably
> > greatly differ (I didn't mind taking an extra 30 minutes to mirror
> > my SVN repo because it probably saved more than that in
> > communication overhead later in the process, but that mirror
> > operation is very taxing on your timeline); my exhaustive search of
> > all SVN copies is another (I wanted to be *extremely* certain I knew
> > about all the misplaced branches/tags, but it's inefficient for a
> > casual developer who just wants to interact with an SVN server).
> > It's all just food for thought, and I'm happy to carry on the
> > conversation from my different point-of-view :)
>
> Ok, I still don't get this part- why mirror at all? Can't all the
> information be mined out of the in-memory tree that svn-fe builds
> while parsing the dumpfile? From the SVN-side, all that's required is
> a streaming dumpfile like the one that `svnrdump dump` produces.
Oh, from that point of view the svn mirror is a bystander. I was developing these tools at the same time as svnrdump (or at least prior to a stable version of svnrdump). So when I found that running "svnadmin dump | svn-fe | git fast-import" on the server was taxing the system, I decided it was better to create a dump file, copy it to my local machine, and run svn-fe and fast-import locally. Once I had the dump file, the local mirror sped up the SVN::Ra calls in buildSVNTree, and made any "did that really happen in svn?!" questions a little easier to answer.
Thanks,
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-19 13:33 ` Stephen Bash
@ 2010-10-19 14:28 ` David Michael Barr
2010-10-19 14:57 ` Stephen Bash
0 siblings, 1 reply; 52+ messages in thread
From: David Michael Barr @ 2010-10-19 14:28 UTC (permalink / raw)
To: Stephen Bash
Cc: Ramkumar Ramachandra, Matt Stump, git, Jonathan Nieder,
Sverre Rabbelier, Tomas Carnecky
Hi,
> Oh, from that point of view the svn mirror is a bystander. I was developing these tools at the same time as svnrdump (or at least prior to a stable version of svnrdump). So when I found that running "svnadmin dump | svn-fe | git fast-import" on the server was taxing the system, I decided it was better to create a dump file, copy it to my local machine, and run svn-fe and fast-import locally. Once I had the dump file, the local mirror sped up the SVN::Ra calls in buildSVNTree, and made any "did that really happen in svn?!" questions a little easier to answer.
So, I think there's two valuable nuggets per commit omitted at the moment in svn-fe.
Firstly, the longest common root between all paths in the commit, which can be computed efficiently.
Secondly, the copyfrom_rev and copyfrom_path for the copy operation that targets the common root.
The second nugget can be noted while computing the first.
From my reading of buildSVNTree.pl, these two nuggets drive the mapping logic.
The first nugget can be computed in git-land fairly easily.
The second requires information not embedded in the git commit graph.
I suggest that svn-fe be extended to annotate the commits with this information.
Implementation-wise, the revision context should be extended to include:
* longest common path
* source revision of copy operation targeting longest common path
* source path of copy operation targeting longest common path
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-19 14:28 ` David Michael Barr
@ 2010-10-19 14:57 ` Stephen Bash
0 siblings, 0 replies; 52+ messages in thread
From: Stephen Bash @ 2010-10-19 14:57 UTC (permalink / raw)
To: David Michael Barr
Cc: Ramkumar Ramachandra, Matt Stump, git, Jonathan Nieder,
Sverre Rabbelier, Tomas Carnecky
----- Original Message -----
> From: "David Michael Barr" <david.barr@cordelta.com>
> To: "Stephen Bash" <bash@genarts.com>
> Sent: Tuesday, October 19, 2010 10:28:03 AM
> Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
>
> So, I think there's two valuable nuggets per commit omitted at the
> moment in svn-fe.
> Firstly, the longest common root between all paths in the commit,
> which can be computed efficiently.
> Secondly, the copyfrom_rev and copyfrom_path for the copy operation
> that targets the common root.
> The second nugget can be noted while computing the first.
> From my reading of buildSVNTree.pl, these two nuggets drive the
> mapping logic.
Yep, they're the triggers, then the heuristics just filter out the noise SVN encourages because of light copies (or cruft from cvs2svn).
Just watch out for svn mv operations. They produce a single commit with an Add (with copyfrom_* set) and a Delete. So in the /project -> /trunk/project case, you're common path is /. I didn't have that case, but I did have a /trunk cp-> /branches/tagName (oops!) mv-> /tags/tagName and a /trunk cp-> /branchName (oops!) mv-> /branches/branchName (honestly I much preferred the cases where the user deleted the wrong location and then created a new copy in the right place -- there are a ton of those which I didn't bother to capture the misstep in the middle).
Thanks,
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-19 6:42 ` Ramkumar Ramachandra
2010-10-19 13:33 ` Stephen Bash
@ 2010-10-20 8:39 ` Will Palmer
2010-10-20 11:59 ` Jakub Narebski
` (2 more replies)
1 sibling, 3 replies; 52+ messages in thread
From: Will Palmer @ 2010-10-20 8:39 UTC (permalink / raw)
To: Ramkumar Ramachandra
Cc: Stephen Bash, Matt Stump, git, Jonathan Nieder,
David Michael Barr, Sverre Rabbelier, Tomas Carnecky
On Tue, 2010-10-19 at 12:12 +0530, Ramkumar Ramachandra wrote:
> Hi Stephen,
>
> Stephen Bash writes:
...
> >
> > I have 32 SVN revs in my history that touch multiple Git commit
> > objects. The simplest example is
> > svn mv svn://svnrepo/branches/badBranchName svn://svnrepo/branches/goodBranchName
> > which creates a single SVN commit that touches two branches
> > (badBranchName will have all it's contents deleted, goodBranchName
> > will have an "empty commit" as described above). The more devious
> > version is the SVN rev where a developer checked out / (yes, I'm not
> > kidding) and proceeded to modify a single file on all branches in
> > one commit. In our case, that one SVN rev touches 23 git commit
> > objects. And while the latter is somewhat a corner case, the former
> > is common and probably needs to be dealt with appropriately (it's
> > kind of a stupid operation in Git-land, so maybe it can just be
> > squashed).
>
> Ouch! Thanks for the illustrative example- I understand now. We have
> to bend backwards to perform a one-to-one mapping. It's finally struck
> me- one-to-one mapping is nearly impossible to achieve, and I don't
> know if it makes sense to strive for it anymore. Looks like Jonathan
> got it earlier.
It's been a while since I was involved in this discussion, so maybe the
design has changed by now, but I was under the impression that there
would be one "one-to-one" mapping branch (which would never be checked
out), containing the history of /, and that the "real" git branches,
tags, etc, would be based on the trees originally referenced by the root
checkout, with git-notes (or similar) being used to track the weirdness
in mappings. How does the "multiple branches touched in a single commit"
complicate anything other than the heuristics for automatic branch
detection (which I assume nobody is at the stage of talking about yet).
I suppose we wouldn't be talking, technically, about a one-to-one
mapping in that case, as we would be turning "one" svn revision into
"many" git branches, but in the conceptual sense of "one svn repository
equals one git repository", I don't see this as being impossible, or so
difficult that it shouldn't be striven-for.
Something else which is at least semi-common in svn is to treat a folder
both as a "directory" and a "branch", which the "checking out /" example
would just be an extreme example of. Think in terms of git branches
being a "view" of the history, with some mapper sitting between each
view and "root" checkout.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-20 8:39 ` Will Palmer
@ 2010-10-20 11:59 ` Jakub Narebski
2010-10-20 13:42 ` Will Palmer
2010-10-20 14:05 ` Ramkumar Ramachandra
2010-10-20 14:21 ` Stephen Bash
2 siblings, 1 reply; 52+ messages in thread
From: Jakub Narebski @ 2010-10-20 11:59 UTC (permalink / raw)
To: Will Palmer
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
Jonathan Nieder, David Michael Barr, Sverre Rabbelier,
Tomas Carnecky
Will Palmer <wmpalmer@gmail.com> writes:
> On Tue, 2010-10-19 at 12:12 +0530, Ramkumar Ramachandra wrote:
> > Stephen Bash writes:
> ...
> > >
> > > I have 32 SVN revs in my history that touch multiple Git commit
> > > objects. The simplest example is
> > > svn mv svn://svnrepo/branches/badBranchName svn://svnrepo/branches/goodBranchName
> > > which creates a single SVN commit that touches two branches
> > > (badBranchName will have all it's contents deleted, goodBranchName
> > > will have an "empty commit" as described above). The more devious
> > > version is the SVN rev where a developer checked out / (yes, I'm not
> > > kidding) and proceeded to modify a single file on all branches in
> > > one commit. In our case, that one SVN rev touches 23 git commit
> > > objects. And while the latter is somewhat a corner case, the former
> > > is common and probably needs to be dealt with appropriately (it's
> > > kind of a stupid operation in Git-land, so maybe it can just be
> > > squashed).
> >
> > Ouch! Thanks for the illustrative example- I understand now. We have
> > to bend backwards to perform a one-to-one mapping. It's finally struck
> > me- one-to-one mapping is nearly impossible to achieve, and I don't
> > know if it makes sense to strive for it anymore. Looks like Jonathan
> > got it earlier.
>
> It's been a while since I was involved in this discussion, so maybe the
> design has changed by now, but I was under the impression that there
> would be one "one-to-one" mapping branch (which would never be checked
> out), containing the history of /, and that the "real" git branches,
> tags, etc, would be based on the trees originally referenced by the root
> checkout, with git-notes (or similar) being used to track the weirdness
> in mappings. How does the "multiple branches touched in a single commit"
> complicate anything other than the heuristics for automatic branch
> detection (which I assume nobody is at the stage of talking about yet).
I think there might be a problem in that in git commit is defined by
its parents and its final state, while revision in Subversion is IIRC
defined by change. Isn't it?
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-20 11:59 ` Jakub Narebski
@ 2010-10-20 13:42 ` Will Palmer
2010-10-20 20:44 ` Jakub Narebski
0 siblings, 1 reply; 52+ messages in thread
From: Will Palmer @ 2010-10-20 13:42 UTC (permalink / raw)
To: Jakub Narebski
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
Jonathan Nieder, David Michael Barr, Sverre Rabbelier,
Tomas Carnecky
On Wed, 2010-10-20 at 04:59 -0700, Jakub Narebski wrote:
> Will Palmer <wmpalmer@gmail.com> writes:
> > On Tue, 2010-10-19 at 12:12 +0530, Ramkumar Ramachandra wrote:
> > > Stephen Bash writes:
> > ...
> > > >
> > > > I have 32 SVN revs in my history that touch multiple Git commit
> > > > objects. The simplest example is
> > > > svn mv svn://svnrepo/branches/badBranchName svn://svnrepo/branches/goodBranchName
> > > > which creates a single SVN commit that touches two branches
> > > > (badBranchName will have all it's contents deleted, goodBranchName
> > > > will have an "empty commit" as described above). The more devious
> > > > version is the SVN rev where a developer checked out / (yes, I'm not
> > > > kidding) and proceeded to modify a single file on all branches in
> > > > one commit. In our case, that one SVN rev touches 23 git commit
> > > > objects. And while the latter is somewhat a corner case, the former
> > > > is common and probably needs to be dealt with appropriately (it's
> > > > kind of a stupid operation in Git-land, so maybe it can just be
> > > > squashed).
> > >
> > > Ouch! Thanks for the illustrative example- I understand now. We have
> > > to bend backwards to perform a one-to-one mapping. It's finally struck
> > > me- one-to-one mapping is nearly impossible to achieve, and I don't
> > > know if it makes sense to strive for it anymore. Looks like Jonathan
> > > got it earlier.
> >
> > It's been a while since I was involved in this discussion, so maybe the
> > design has changed by now, but I was under the impression that there
> > would be one "one-to-one" mapping branch (which would never be checked
> > out), containing the history of /, and that the "real" git branches,
> > tags, etc, would be based on the trees originally referenced by the root
> > checkout, with git-notes (or similar) being used to track the weirdness
> > in mappings. How does the "multiple branches touched in a single commit"
> > complicate anything other than the heuristics for automatic branch
> > detection (which I assume nobody is at the stage of talking about yet).
>
> I think there might be a problem in that in git commit is defined by
> its parents and its final state, while revision in Subversion is IIRC
> defined by change. Isn't it?
>
A "change" is a delta between one state and another, so each revision is
dependent on those which came before it just as much as a a git commit
is. An svn "revision" is a snapshot, regardless of how it is stored, ie,
the "svn stores changes, git stores snapshots" is an implementation
detail. It's a detail which makes a lot of things easier/faster in git
than they would be in svn, but a mere detail none the less.
The difference of course is that the "name" of an svn revision stays the
same even if aspects of that revision (for example, the commit message)
are changed, while the "name" of a git commit is dependent on everything
that makes up a commit. In git terms, changing a commit message is
considered to be history rewriting, whereas in svn terms it is merely
something which happens occasionally as part of regularly maintained
repository.
the git Philosophy is ingrained in its object model: If you change
something which led to a state, you change the state itself. I don't
think there should be an attempt to work-around that philosophy when
talking to external repositories. That is to say: if a commit message
(or other revprop) in history changes, we want to treat it as if we were
recovering from an upstream rebase. Of course, a problem in that could
very well be "how would we know about it?", which is a good question,
but one not directly related to [revision+directory]<->[commit]
mappings, afaik ;)
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-20 8:39 ` Will Palmer
2010-10-20 11:59 ` Jakub Narebski
@ 2010-10-20 14:05 ` Ramkumar Ramachandra
2010-10-20 14:21 ` Stephen Bash
2 siblings, 0 replies; 52+ messages in thread
From: Ramkumar Ramachandra @ 2010-10-20 14:05 UTC (permalink / raw)
To: Will Palmer
Cc: Stephen Bash, Matt Stump, git, Jonathan Nieder,
David Michael Barr, Sverre Rabbelier, Tomas Carnecky
Hi Will,
Will Palmer writes:
> It's been a while since I was involved in this discussion, so maybe the
> design has changed by now,
Yep, and I'm to blame for that- sorry I didn't CC you earlier. I got
confused between "Tomas Carnecky" and "Will Palmer". To avoid this
confusion in future, I'd request everyone to display the names they
use on the list in the IRC whois information (unless it's a privacy
issue).
> but I was under the impression that there
> would be one "one-to-one" mapping branch (which would never be checked
> out), containing the history of /, and that the "real" git branches,
> tags, etc, would be based on the trees originally referenced by the root
> checkout, with git-notes (or similar) being used to track the weirdness
> in mappings. How does the "multiple branches touched in a single commit"
> complicate anything other than the heuristics for automatic branch
> detection (which I assume nobody is at the stage of talking about yet).
Yeah, that was my plan too originally, but I clearly haven't thought
about it enough. I'm currently noting down the various scenarios that
the others are quoting -- there are quite a few I hadn't thought about
earlier.
[...]
-- Ram
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-20 8:39 ` Will Palmer
2010-10-20 11:59 ` Jakub Narebski
2010-10-20 14:05 ` Ramkumar Ramachandra
@ 2010-10-20 14:21 ` Stephen Bash
2010-10-20 16:56 ` Ramkumar Ramachandra
2 siblings, 1 reply; 52+ messages in thread
From: Stephen Bash @ 2010-10-20 14:21 UTC (permalink / raw)
To: Will Palmer
Cc: Matt Stump, git, Jonathan Nieder, David Michael Barr,
Sverre Rabbelier, Tomas Carnecky, Ramkumar Ramachandra
----- Original Message -----
> From: "Will Palmer" <wmpalmer@gmail.com>
> To: "Ramkumar Ramachandra" <artagnon@gmail.com>
> Sent: Wednesday, October 20, 2010 4:39:30 AM
> Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
>
> I was under the impression that there
> would be one "one-to-one" mapping branch (which would never be checked
> out), containing the history of /, and that the "real" git branches,
> tags, etc, would be based on the trees originally referenced by the root
> checkout, with git-notes (or similar) being used to track the weirdness
> in mappings.
Admittedly I'm not in the inner circle, but this is the first time I've heard the idea. It's certainly intriguing. In this case would the one-to-one branch include the full SVN repository history (all projects), or would svn-fe/git-fast-import filter down to subdirectories of interest?
Along those lines I can contribute the following data point: my initial fast-import repository weighs in at 1.3G, while after my scripts run the final product is 659M (and no, they are not hard linking to each other). Unfortunately I don't have a good accounting of the size difference (obviously some is filtering down to a single SVN project).
Thanks,
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-20 14:21 ` Stephen Bash
@ 2010-10-20 16:56 ` Ramkumar Ramachandra
0 siblings, 0 replies; 52+ messages in thread
From: Ramkumar Ramachandra @ 2010-10-20 16:56 UTC (permalink / raw)
To: Stephen Bash
Cc: Will Palmer, Matt Stump, git, Jonathan Nieder, David Michael Barr,
Sverre Rabbelier, Tomas Carnecky
Hi Stephen,
Stephen Bash writes:
> > From: "Will Palmer" <wmpalmer@gmail.com>
> > I was under the impression that there
> > would be one "one-to-one" mapping branch (which would never be checked
> > out), containing the history of /, and that the "real" git branches,
> > tags, etc, would be based on the trees originally referenced by the root
> > checkout, with git-notes (or similar) being used to track the weirdness
> > in mappings.
>
> Admittedly I'm not in the inner circle, but this is the first time
> I've heard the idea.
Do hang out on the development channel - a lot of stuff cooks there :)
> It's certainly intriguing. In this case would
> the one-to-one branch include the full SVN repository history (all
> projects), or would svn-fe/git-fast-import filter down to
> subdirectories of interest?
Full history. Atleast that's what I was thinking about sometime ago.
> Along those lines I can contribute the following data point: my
> initial fast-import repository weighs in at 1.3G, while after my
> scripts run the final product is 659M (and no, they are not hard
> linking to each other). Unfortunately I don't have a good
> accounting of the size difference (obviously some is filtering down
> to a single SVN project).
Yeah, David reported similar statistics after repacking the ASF
repository.
-- Ram
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-20 13:42 ` Will Palmer
@ 2010-10-20 20:44 ` Jakub Narebski
2010-10-21 1:54 ` mrevilgnome
2010-10-21 9:08 ` Will Palmer
0 siblings, 2 replies; 52+ messages in thread
From: Jakub Narebski @ 2010-10-20 20:44 UTC (permalink / raw)
To: Will Palmer
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
Jonathan Nieder, David Michael Barr, Sverre Rabbelier,
Tomas Carnecky
On Wed, 20 Oct 2010, Will Palmer wrote:
> On Wed, 2010-10-20 at 04:59 -0700, Jakub Narebski wrote:
>> Will Palmer <wmpalmer@gmail.com> writes:
>>> On Tue, 2010-10-19 at 12:12 +0530, Ramkumar Ramachandra wrote:
>>>> Stephen Bash writes:
>>> ...
>>>>>
>>>>> I have 32 SVN revs in my history that touch multiple Git commit
>>>>> objects. The simplest example is
>>>>> svn mv svn://svnrepo/branches/badBranchName svn://svnrepo/branches/goodBranchName
>>>>> which creates a single SVN commit that touches two branches
>>>>> (badBranchName will have all it's contents deleted, goodBranchName
>>>>> will have an "empty commit" as described above). The more devious
>>>>> version is the SVN rev where a developer checked out / (yes, I'm not
>>>>> kidding) and proceeded to modify a single file on all branches in
>>>>> one commit. In our case, that one SVN rev touches 23 git commit
>>>>> objects. And while the latter is somewhat a corner case, the former
>>>>> is common and probably needs to be dealt with appropriately (it's
>>>>> kind of a stupid operation in Git-land, so maybe it can just be
>>>>> squashed).
>>>>
>>>> Ouch! Thanks for the illustrative example- I understand now. We have
>>>> to bend backwards to perform a one-to-one mapping. It's finally struck
>>>> me- one-to-one mapping is nearly impossible to achieve, and I don't
>>>> know if it makes sense to strive for it anymore. Looks like Jonathan
>>>> got it earlier.
>>>
>>> It's been a while since I was involved in this discussion, so maybe the
>>> design has changed by now, but I was under the impression that there
>>> would be one "one-to-one" mapping branch (which would never be checked
>>> out), containing the history of /, and that the "real" git branches,
>>> tags, etc, would be based on the trees originally referenced by the root
>>> checkout, with git-notes (or similar) being used to track the weirdness
>>> in mappings. How does the "multiple branches touched in a single commit"
>>> complicate anything other than the heuristics for automatic branch
>>> detection (which I assume nobody is at the stage of talking about yet).
>>
>> I think there might be a problem in that in git commit is defined by
>> its parents and its final state, while revision in Subversion is IIRC
>> defined by change. Isn't it?
>
> A "change" is a delta between one state and another, so each revision is
> dependent on those which came before it just as much as a a git commit
> is. An svn "revision" is a snapshot, regardless of how it is stored, ie,
> the "svn stores changes, git stores snapshots" is an implementation
> detail. It's a detail which makes a lot of things easier/faster in git
> than they would be in svn, but a mere detail none the less.
Thanks for the correction, and for explanation.
The problem with one-to-one [SVN revision]<->[Git commit] mapping in the
situation of Subversion mishandling described by Stephen Bash persist,
though the problem is not because "svn stores changes, git stores
snapshots", but because of widely different model of branches.
Subversion uses the inter-file branching model (Wikipedia says it was
"borrowed" from Perforce) to handle branches and tags. It uses "branches
are copies (folders)" paradigm, and technically it doesn't have separate
namespace for branches but have projects, branches, and projects'
filesystem hierarchy mixed together; what part of path is branch name
is defined by convention only. This model makes it easy to mess up
repository (because there are no technological barriers for going
against conventions, like mentioned all-branches change, or changing
tags, or reversed hierarchy or branches and projects).
Because (from what I understand) revisions in Subversion are whole
project all-branches snapshots, and because revision identifiers are
monotonically incrementing numbers, there is no inherent notion of
_parent_ of commit, like there is in Git. (I think that was the reason
why merge tracking was absent from Subversion until version 1.5, and
why mergeinfo is per-file rather than per-commit/per-revision property).
In Git commits store snapshot of top level of a project (contrary to
revisions in Subversion being snapshot of top level of repository tree,
all branches and tags in it). Each commit in Git also stores its parent
or parents. Those commit-to-parent links make up DAG (Directed Acyclic
Graph) of revisions. Branches in Git reside in separate namespace,
and are live pointers (like e.g. top pointer in stack implementations)
to commits; commit that branch points to (the tip of branch) marks out
subset of DAG of revisions: all descendants of given commits - this form
a line of development i.e. branch.
What is important here is that commit is defined by the snapshot of
top tree, and by its parents. Different top tree and/or different
parent(s) means that commit must be different.
Now take a look at the situation described by Stephen Bash. Lets
assume that we have branches in our Subversion repository branches
'foo' and 'bar' that diverged at revision number 1, that revision
2 was only on branch 'foo', revision 3 was only on branch 'bar',
and that revision 4 is mishandled edit of file across all branches.
Let's try to draw it on ASCII-art diagram (fixed-width font required).
--- [1]-----[2]---|||---[ ]----|||----|||---[7] <=== foo
\ [4]
\----|||---[3]---[ ]----[5]----[6] <=== bar
I marked by '|||' here that given revision doesn't change anything
on given branch (in given subdirectory of repository tree).
Now, from what I understand of Subversion model, when one asks for
history of branch 'foo' in Subversion, it would return all revisions
that modify 'project/branch/foo' or 'branch/foo/project', and only
those that modify it (similarly to how path limiting in
`git log <path>` works). For branch 'foo' it would be revisions
1, 2, 4, 7; for branch 'bar' it would be revisions 1, 3, 4, 5, 6.
Am I understanding it correctly?
Now in Git we don't have 'project/branch/foo/xxx', we have only
top tree of a project. Therefore we cannot represent revision
4 as single git commit. To have similar situation, i.e. commits
1, 2, 4', 7 on branch 'foo', and commits 1, 3, 4'', 5, 6 on branch
'bar', we would have to have the following graph of revisions
--- 1--<--2--<--4'---<--7 <=== foo
\
-<--3--<--4''--<--5--<--6 <=== bar
I uses --<-- here to denote that it is actual directed link.
Commits 4' and 4'' can have different trees, and have different
parents.
So to have the same results for 'svn log' when on branchs 'foo' and
'bar' (however you switch branches in subversion), or
'svn log <foo URL>' and 'svn log <bar URL>' like for 'git log foo'
and 'git log bar' in the [mishandling] situation described above
you have to map single all-branches revision 4 in Subversion into
two commits 4' and 4'' in Git.
Please correct me if I am wrong about Subversion model.
>
> The difference of course is that the "name" of an svn revision stays the
> same even if aspects of that revision (for example, the commit message)
> are changed, while the "name" of a git commit is dependent on everything
> that makes up a commit. In git terms, changing a commit message is
> considered to be history rewriting, whereas in svn terms it is merely
> something which happens occasionally as part of regularly maintained
> repository.
>
> the git Philosophy is ingrained in its object model: If you change
> something which led to a state, you change the state itself. I don't
> think there should be an attempt to work-around that philosophy when
> talking to external repositories. That is to say: if a commit message
> (or other revprop) in history changes, we want to treat it as if we were
> recovering from an upstream rebase. Of course, a problem in that could
> very well be "how would we know about it?", which is a good question,
> but one not directly related to [revision+directory]<->[commit]
> mappings, afaik ;)
Better solution, actually proposed in separate subthread, is to make use
of new 'git replace' / 'refs/replaces/*' feature in Git, creating
replacement for revision which changed some property retroactively...
...if Subversion actually offer any way to ask for changed properties.
Thankfully from what I understand from comments in this thread this
feature of being able to change revision properties like commit message
or authorship is by default turned off in Subversion.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-20 20:44 ` Jakub Narebski
@ 2010-10-21 1:54 ` mrevilgnome
2010-10-21 8:16 ` Jakub Narebski
2010-10-21 9:08 ` Will Palmer
1 sibling, 1 reply; 52+ messages in thread
From: mrevilgnome @ 2010-10-21 1:54 UTC (permalink / raw)
To: Jakub Narebski
Cc: Will Palmer, Ramkumar Ramachandra, Stephen Bash, git,
Jonathan Nieder, David Michael Barr, Sverre Rabbelier,
Tomas Carnecky
I agree. The repository that I'm interested in converting has
branches all over the place /sandbox/, /sandbox/<username>/*,
/stable/MAIN/*, /stable/Features/*, /features/*, /branches/*, etc...
Because subversion didn't enforce the convention it was all to easy to
ignore when our questionable branching strategy was created. Instead
of expecting sub-folders of a particular path to be a branch is there
something that we can key off of in the dumpfile? Are copy operations
notated in some fashion?
> Subversion uses the inter-file branching model (Wikipedia says it was
> "borrowed" from Perforce) to handle branches and tags. It uses "branches
> are copies (folders)" paradigm, and technically it doesn't have separate
> namespace for branches but have projects, branches, and projects'
> filesystem hierarchy mixed together; what part of path is branch name
> is defined by convention only. This model makes it easy to mess up
> repository (because there are no technological barriers for going
> against conventions, like mentioned all-branches change, or changing
> tags, or reversed hierarchy or branches and projects).
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-21 1:54 ` mrevilgnome
@ 2010-10-21 8:16 ` Jakub Narebski
2010-10-21 13:49 ` Stephen Bash
0 siblings, 1 reply; 52+ messages in thread
From: Jakub Narebski @ 2010-10-21 8:16 UTC (permalink / raw)
To: mrevilgnome
Cc: Will Palmer, Ramkumar Ramachandra, Stephen Bash, git,
Jonathan Nieder, David Michael Barr, Sverre Rabbelier,
Tomas Carnecky
On Thu, 21 Oct 2010, mrevilgnome wrote:
> Jakub Narebski wrote:
> > Subversion uses the inter-file branching model (Wikipedia says it was
> > "borrowed" from Perforce) to handle branches and tags. It uses "branches
> > are copies (folders)" paradigm, and technically it doesn't have separate
> > namespace for branches but have projects, branches, and projects'
> > filesystem hierarchy mixed together; what part of path is branch name
> > is defined by convention only. This model makes it easy to mess up
> > repository (because there are no technological barriers for going
> > against conventions, like mentioned all-branches change, or changing
> > tags, or reversed hierarchy or branches and projects).
>
> I agree. The repository that I'm interested in converting has
> branches all over the place /sandbox/, /sandbox/<username>/*,
> /stable/MAIN/*, /stable/Features/*, /features/*, /branches/*, etc...
> Because subversion didn't enforce the convention it was all to easy to
> ignore when our questionable branching strategy was created. Instead
> of expecting sub-folders of a particular path to be a branch is there
> something that we can key off of in the dumpfile? Are copy operations
> notated in some fashion?
Actually it shouldn't be that hard to implement, it it isn't already
implemented in svn-fe.
We don't need to have copy operations notated in some fashion; it should
be enough to tell svn-fe where the top directory of project is in
repository tree hierarchy (e.g. that it is at /stable/MAIN/* at
revision 1). git-fe can/could use then 'tree' movement detection that
'subtree' merge strategy uses.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-20 20:44 ` Jakub Narebski
2010-10-21 1:54 ` mrevilgnome
@ 2010-10-21 9:08 ` Will Palmer
2010-10-21 14:00 ` Stephen Bash
2010-10-21 15:52 ` Jakub Narebski
1 sibling, 2 replies; 52+ messages in thread
From: Will Palmer @ 2010-10-21 9:08 UTC (permalink / raw)
To: Jakub Narebski
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
Jonathan Nieder, David Michael Barr, Sverre Rabbelier,
Tomas Carnecky
On Wed, 2010-10-20 at 22:44 +0200, Jakub Narebski wrote:
> Because (from what I understand) revisions in Subversion are whole
> project all-branches snapshots, and because revision identifiers are
> monotonically incrementing numbers, there is no inherent notion of
> _parent_ of commit, like there is in Git. (I think that was the reason
> why merge tracking was absent from Subversion until version 1.5, and
> why mergeinfo is per-file rather than per-commit/per-revision property).
>
To clarify, I was saying that there is a "parent" of each SVN commit, in
the top-level sense. This can be easily converted into a "whole
repository" ("svnroot") tree in git. Of course, this isn't useful for
actual work, but it's a good middle-layer, from which other more-useful
things can be derived.
In terms of converting the svnroot git history into actual branches,
there are several options for mapping things. Ignoring merges for a
moment, we could (for example) notice when two trees (as in tree
objects) are very similar at some point in history, and decide that
those are probably branches. It's tedious, but still fairly simple, to
walk the history and build a new history consisting only of edits to a
subtree (even if the commit messages don't always make sense out of
context). It really doesn't matter one lick whether a single svn commit
touched multiple generated git commits.
Of course, "ignoring merges" is temporary and a total cop-out, but I
wouldn't for a moment pretend that converting svn branches into git
branches is difficult.
>
> In Git commits store snapshot of top level of a project (contrary to
> revisions in Subversion being snapshot of top level of repository tree,
> all branches and tags in it). Each commit in Git also stores its parent
> or parents. Those commit-to-parent links make up DAG (Directed Acyclic
> Graph) of revisions. Branches in Git reside in separate namespace,
> and are live pointers (like e.g. top pointer in stack implementations)
> to commits; commit that branch points to (the tip of branch) marks out
> subset of DAG of revisions: all descendants of given commits - this form
> a line of development i.e. branch.
>
> What is important here is that commit is defined by the snapshot of
> top tree, and by its parents. Different top tree and/or different
> parent(s) means that commit must be different.
>
>
> Now take a look at the situation described by Stephen Bash. Lets
> assume that we have branches in our Subversion repository branches
> 'foo' and 'bar' that diverged at revision number 1, that revision
> 2 was only on branch 'foo', revision 3 was only on branch 'bar',
> and that revision 4 is mishandled edit of file across all branches.
>
> Let's try to draw it on ASCII-art diagram (fixed-width font required).
>
> --- [1]-----[2]---|||---[ ]----|||----|||---[7] <=== foo
> \ [4]
> \----|||---[3]---[ ]----[5]----[6] <=== bar
>
> I marked by '|||' here that given revision doesn't change anything
> on given branch (in given subdirectory of repository tree).
>
> Now, from what I understand of Subversion model, when one asks for
> history of branch 'foo' in Subversion, it would return all revisions
> that modify 'project/branch/foo' or 'branch/foo/project', and only
> those that modify it (similarly to how path limiting in
> `git log <path>` works). For branch 'foo' it would be revisions
> 1, 2, 4, 7; for branch 'bar' it would be revisions 1, 3, 4, 5, 6.
>
> Am I understanding it correctly?
Sounds right to me
>
>
> Now in Git we don't have 'project/branch/foo/xxx', we have only
> top tree of a project. Therefore we cannot represent revision
> 4 as single git commit. To have similar situation, i.e. commits
> 1, 2, 4', 7 on branch 'foo', and commits 1, 3, 4'', 5, 6 on branch
> 'bar', we would have to have the following graph of revisions
>
>
> --- 1--<--2--<--4'---<--7 <=== foo
> \
> -<--3--<--4''--<--5--<--6 <=== bar
>
> I uses --<-- here to denote that it is actual directed link.
>
> Commits 4' and 4'' can have different trees, and have different
> parents.
>
>
> So to have the same results for 'svn log' when on branchs 'foo' and
> 'bar' (however you switch branches in subversion), or
> 'svn log <foo URL>' and 'svn log <bar URL>' like for 'git log foo'
> and 'git log bar' in the [mishandling] situation described above
> you have to map single all-branches revision 4 in Subversion into
> two commits 4' and 4'' in Git.
>
>
> Please correct me if I am wrong about Subversion model.
Also correct. One SVN commit would logically map to several git commits.
It's best to think in terms of:
([svn commit] + [svn path]) -> [git commit] (or git tag, if we can get
the heuristics right)
>
> >
> > The difference of course is that the "name" of an svn revision stays the
> > same even if aspects of that revision (for example, the commit message)
> > are changed, while the "name" of a git commit is dependent on everything
> > that makes up a commit. In git terms, changing a commit message is
> > considered to be history rewriting, whereas in svn terms it is merely
> > something which happens occasionally as part of regularly maintained
> > repository.
> >
> > the git Philosophy is ingrained in its object model: If you change
> > something which led to a state, you change the state itself. I don't
> > think there should be an attempt to work-around that philosophy when
> > talking to external repositories. That is to say: if a commit message
> > (or other revprop) in history changes, we want to treat it as if we were
> > recovering from an upstream rebase. Of course, a problem in that could
> > very well be "how would we know about it?", which is a good question,
> > but one not directly related to [revision+directory]<->[commit]
> > mappings, afaik ;)
>
> Better solution, actually proposed in separate subthread, is to make use
> of new 'git replace' / 'refs/replaces/*' feature in Git, creating
> replacement for revision which changed some property retroactively...
I'm not entirely familiar with the git replace mechanism, but wouldn't
that mean that repository git-A (cloned from SVN before the property
change) and repository git-B (cloned from SVN after the property change)
would be unable to merge with each-other?
In my mind, if it would be a rebase when it happens in git-land, it
should be a rebase when it happens in
mechanism-to-make-external-repository-act-just-like-git land.
>
> ...if Subversion actually offer any way to ask for changed properties.
> Thankfully from what I understand from comments in this thread this
> feature of being able to change revision properties like commit message
> or authorship is by default turned off in Subversion.
>
Any sufficiently large SVN-tracked project will use all of SVN's
features, whether the maintainer remembers or not ;)
Certainly it could be a "few and far between" thing, which doesn't need
to be handled to get going / usable (especially since creating a fresh
clone is so much faster than with git-svn). I don't know the internals
of SVN beyond what was mentioned in the manual 5 or so years ago, but I
assume you'd need to pretty much iterate over the entire history in
either a slow, git-svn like manner, or a wasteful, "download everything
to check a few things" manner, just in order to check that your
properties are up-to-date. Perhaps I'm thinking of these things wrongly,
and there's actually a simple log-based mechanism for checking such
things which would be fast enough to work into regular git-gc-ish
maintenance.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-21 8:16 ` Jakub Narebski
@ 2010-10-21 13:49 ` Stephen Bash
0 siblings, 0 replies; 52+ messages in thread
From: Stephen Bash @ 2010-10-21 13:49 UTC (permalink / raw)
To: Jakub Narebski
Cc: Will Palmer, Ramkumar Ramachandra, git, Jonathan Nieder,
David Michael Barr, Sverre Rabbelier, Tomas Carnecky, mrevilgnome
----- Original Message -----
> From: "Jakub Narebski" <jnareb@gmail.com>
> To: "mrevilgnome" <mrevilgnome@gmail.com>
> Sent: Thursday, October 21, 2010 4:16:46 AM
> Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
>
> > Are copy operations notated in some fashion?
Yes, copies create a special pair of properties in SVN: copyfrom_rev and copyfrom_path. Unfortunately SVN users use the copy operation for non-branching purposes, so some amount of filtering is required. I posted a script earlier in this thread that used one set of heuristics based on my local SVN repository, but I'm not claiming it will work for everyone.
> Actually it shouldn't be that hard to implement, if it isn't already
> implemented in svn-fe.
David just brought up teaching svn-fe to emit the copyfrom properties to help git do branch mapping. I think there's still a lot of effort in creating a good mapping algorithm, but the pieces are coming together.
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-21 9:08 ` Will Palmer
@ 2010-10-21 14:00 ` Stephen Bash
2010-10-21 18:37 ` Jakub Narebski
2010-10-21 15:52 ` Jakub Narebski
1 sibling, 1 reply; 52+ messages in thread
From: Stephen Bash @ 2010-10-21 14:00 UTC (permalink / raw)
To: Will Palmer
Cc: Ramkumar Ramachandra, Matt Stump, git, Jonathan Nieder,
David Michael Barr, Sverre Rabbelier, Tomas Carnecky,
Jakub Narebski
----- Original Message -----
> From: "Will Palmer" <wmpalmer@gmail.com>
> To: "Jakub Narebski" <jnareb@gmail.com>
> Sent: Thursday, October 21, 2010 5:08:17 AM
> Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
> On Wed, 2010-10-20 at 22:44 +0200, Jakub Narebski wrote:
>
> It's tedious, but still fairly simple, to
> walk the history and build a new history consisting only of edits to a
> subtree (even if the commit messages don't always make sense out of
> context). It really doesn't matter one lick whether a single svn
> commit touched multiple generated git commits.
After reading your first entry into this thread, this is certainly what I was envisioning. I'm still a little worried about initial clone size (over the weekend I had to wait over an hour for my full-svn-history git repository to clone from work to home), but it is certainly an intriguing idea. With appropriate mapping information (copyfrom properties and some user input) I think you could create a very convincing Git history by creating commit objects using subtrees of the SVN-imported history. I had been thinking about a very similar solution, but I was planning on pruning the original SVN-imported commit objects to save space...
> Of course, "ignoring merges" is temporary and a total cop-out
This is still bugging me... Even with svn mergeinfo (which I think is a small percentage of the SVN revisions in the world), IMO an SVN merge is *not* a Git merge. I think of it as a git cherry-pick (someone correct me if this mental model is wrong). The key point in my mind is SVN merge doesn't have to merge the entire branch history. Perhaps some heuristics can be applied in Git to decide if an SVN merge is a "true merge" or a cherry-pick? But I have a nagging feeling that in the end the model mismatch is going to be very hard to overcome.
Thanks,
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-21 9:08 ` Will Palmer
2010-10-21 14:00 ` Stephen Bash
@ 2010-10-21 15:52 ` Jakub Narebski
2010-10-21 16:16 ` Jonathan Nieder
1 sibling, 1 reply; 52+ messages in thread
From: Jakub Narebski @ 2010-10-21 15:52 UTC (permalink / raw)
To: Will Palmer
Cc: Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
Jonathan Nieder, David Michael Barr, Sverre Rabbelier,
Tomas Carnecky
On Thu, 21 Oct 2010, Will Palmer wrote:
> On Wed, 2010-10-20 at 22:44 +0200, Jakub Narebski wrote:
> > Because (from what I understand) revisions in Subversion are whole
> > project all-branches snapshots, and because revision identifiers are
> > monotonically incrementing numbers, there is no inherent notion of
> > _parent_ of commit, like there is in Git. (I think that was the reason
> > why merge tracking was absent from Subversion until version 1.5, and
> > why mergeinfo is per-file rather than per-commit/per-revision property).
>
> To clarify, I was saying that there is a "parent" of each SVN commit, in
> the top-level sense. This can be easily converted into a "whole
> repository" ("svnroot") tree in git. Of course, this isn't useful for
> actual work, but it's a good middle-layer, from which other more-useful
> things can be derived.
"Whole repository hierarchy (snvroot) snapshots" are useless without
extra work; Git needs "whole project" snapshots for its commits.
But the whole long description of "branching" model in Subversion was
meant as intro for explanation why there can be mishandled commits
in Subversion, which make it impossible to have 1-to-1 SVN revision to
Git commit mapping.
> In terms of converting the svnroot git history into actual branches,
> there are several options for mapping things. Ignoring merges for a
> moment, we could (for example) notice when two trees (as in tree
> objects) are very similar at some point in history, and decide that
> those are probably branches.
Actually as Stephen Bash wrote in his response creating branches in
Subversion generates 'copy' operations in svndump... we have to filter
out 'copy' operations which do not create new branches, though.
> It's tedious, but still fairly simple, to
> walk the history and build a new history consisting only of edits to a
> subtree (even if the commit messages don't always make sense out of
> context). It really doesn't matter one lick whether a single svn commit
> touched multiple generated git commits.
We would have to ensure that commits in Git in branch 'foo' are the same
as history of 'project/branches/foo' subtree in svnroot in Subversion.
Otherwise we would either have different history in Git and in Subversion,
or we would have screwed up DAG of revisions in Git.
> Of course, "ignoring merges" is temporary and a total cop-out, but I
> wouldn't for a moment pretend that converting svn branches into git
> branches is difficult.
I don't think the most common "sane" Subversion merge case would be
difficult to translate into merge commit in Git: the svn:mergeinfo
property would have common revisions for all affected files/directories.
The problem is that like it is possible to mishandle commit like described
by Stephen Bash by creating all-branches revision, it is also possible
to mishandle merge in Subversion, creating revision where different files
are merged from different branches: such thing does not have easy
translation to Git commit-level rather than file-level merge tracking.
[...]
> > So to have the same results for 'svn log' when on branchs 'foo' and
> > 'bar' (however you switch branches in subversion), or
> > 'svn log <foo URL>' and 'svn log <bar URL>' like for 'git log foo'
> > and 'git log bar' in the [mishandling] situation described above
> > you have to map single all-branches revision 4 in Subversion into
> > two commits 4' and 4'' in Git.
> >
> >
> > Please correct me if I am wrong about Subversion model.
>
> Also correct. One SVN commit would logically map to several git commits.
> It's best to think in terms of:
> ([svn commit] + [svn path]) -> [git commit] (or git tag, if we can get
> the heuristics right)
If I remember correctly some of discussion was whether there can truly
be irrecovable situation where single SVN revision *must* be mapped into
more than one Git commit (one-to-many mapping).
> > > The difference of course is that the "name" of an svn revision stays the
> > > same even if aspects of that revision (for example, the commit message)
> > > are changed, while the "name" of a git commit is dependent on everything
> > > that makes up a commit. In git terms, changing a commit message is
> > > considered to be history rewriting, whereas in svn terms it is merely
> > > something which happens occasionally as part of regularly maintained
> > > repository.
> > >
> > > the git Philosophy is ingrained in its object model: If you change
> > > something which led to a state, you change the state itself. I don't
> > > think there should be an attempt to work-around that philosophy when
> > > talking to external repositories. That is to say: if a commit message
> > > (or other revprop) in history changes, we want to treat it as if we were
> > > recovering from an upstream rebase. Of course, a problem in that could
> > > very well be "how would we know about it?", which is a good question,
> > > but one not directly related to [revision+directory]<->[commit]
> > > mappings, afaik ;)
> >
> > Better solution, actually proposed in separate subthread, is to make use
> > of new 'git replace' / 'refs/replaces/*' feature in Git, creating
> > replacement for revision which changed some property retroactively...
>
> I'm not entirely familiar with the git replace mechanism, but wouldn't
> that mean that repository git-A (cloned from SVN before the property
> change) and repository git-B (cloned from SVN after the property change)
> would be unable to merge with each-other?
> In my mind, if it would be a rebase when it happens in git-land, it
> should be a rebase when it happens in
> mechanism-to-make-external-repository-act-just-like-git land.
Note that there is problem with possibly changing svn:log, svn:author and
svn:date revision properties is only when there is ongoing interaction
between Subversion repository (or mirror) and Git repository (or mirror).
There is no problem with this issue when doing one-shot conversion.
The major problem is that svn:log etc. are _unversioned_ properties (see
http://svnbook.red-bean.com/en/1.5/svn.ref.properties.html), so I am not
sure if there is a way for Subversion server to tell that some svn:log
properties changed. Perhaps there is a log, even if properties are
unversioned... otherwise we would have to detect somehow that properties
changed.
But let's assume that we have a way of notifying or noticing that e.g.
svn:log property changed.
Say that svn:log property for revision 'n was A at the time Git fetched
from SVN repository, and SVN revision 'n' is mapped to commit AA with
commit message A.
Later we fetch again from SVN repository, and besides new revisions to
be converted we notice somehow that svn:log property for revision 'n'
changed from A to B.
We now create replacement commit BB in Git, with the same Git parent
as commit AA, and with commit message changed to BB. Then we add
commit BB as replacement for AA:
$ git replace -f AA BB
(or its low level equivalent, or its batch equivalent when it exists).
This replacement is saved as a ref in 'refs/replaces/*' namespace. All
git commands (except some plumbing perhaps, and unless you pass
'--no-replace-objects' option to git wrapper) would then work as if
commit AA was replaced by commit BB; in particular 'git show AA' and
'git log' would show BB version.
Because replacements are stored as refs in 'refs/replaces/*' namespace,
it is simple to transfer them. Each repository that fetches those refs
(+refs/replaces/*:refs/replaces/*) would see replaced contents. Those
that do not fetch it would see old contents (and perhaps would have
problems like iteracting with SVN repository).
Alternate solution, though not as natively nice, would be to have empty
or placeholder commit, and store true commit message in notes for commit
AA, i.e. the message A would be in git note for AA. Changing commit
message would mean changing note: after change commit AA would have a
commit-message note with contents B.
If changes to unversioned revision properties are rare, then replacement
technique is much superior to using notes, which generates unnatural git
repository. When changing commit messages (svn:log) and the like are
common and often, which would result in great many replacements, the
notes technique could be better because of performance reasons.
> > ...if Subversion actually offer any way to ask for changed properties.
> > Thankfully from what I understand from comments in this thread this
> > feature of being able to change revision properties like commit message
> > or authorship is by default turned off in Subversion.
>
> Any sufficiently large SVN-tracked project will use all of SVN's
> features, whether the maintainer remembers or not ;)
Heh.
> Certainly it could be a "few and far between" thing, which doesn't need
> to be handled to get going / usable (especially since creating a fresh
> clone is so much faster than with git-svn). I don't know the internals
> of SVN beyond what was mentioned in the manual 5 or so years ago, but I
> assume you'd need to pretty much iterate over the entire history in
> either a slow, git-svn like manner, or a wasteful, "download everything
> to check a few things" manner, just in order to check that your
> properties are up-to-date. Perhaps I'm thinking of these things wrongly,
> and there's actually a simple log-based mechanism for checking such
> things which would be fast enough to work into regular git-gc-ish
> maintenance.
Again: svn:log, svn:author and svn:date are Unversioned Properties, but
perhaps Subvrsion repository stores log of changes somewhere (similarly
to git reflog, though hopefully not expired too early).
P.S. The later in this thread, the more I see how utterly wrong
Subversion model of version control is (branches, tags, merges).
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-21 15:52 ` Jakub Narebski
@ 2010-10-21 16:16 ` Jonathan Nieder
0 siblings, 0 replies; 52+ messages in thread
From: Jonathan Nieder @ 2010-10-21 16:16 UTC (permalink / raw)
To: Jakub Narebski
Cc: Will Palmer, Ramkumar Ramachandra, Stephen Bash, Matt Stump, git,
David Michael Barr, Sverre Rabbelier, Tomas Carnecky
Jakub Narebski wrote:
> The major problem is that svn:log etc. are _unversioned_ properties (see
> http://svnbook.red-bean.com/en/1.5/svn.ref.properties.html), so I am not
> sure if there is a way for Subversion server to tell that some svn:log
> properties changed. Perhaps there is a log, even if properties are
> unversioned... otherwise we would have to detect somehow that properties
> changed.
There has been brief discussion of that possibility on the Subversion
list [1]:
"What we might need is an RA call that has
the server provide the N last revisions to have undergone revprop edits..."
I'm guessing that there is not such a log now but the developers might
be open to a patch adding such a log (for the sake of svnsync and
similar use cases, like this one).
> Later we fetch again from SVN repository, and besides new revisions to
> be converted we notice somehow that svn:log property for revision 'n'
> changed from A to B.
>
> We now create replacement commit BB in Git, with the same Git parent
> as commit AA, and with commit message changed to BB. Then we add
> commit BB as replacement for AA:
>
> $ git replace -f AA BB
Yes, exactly. In some cases, this "git replace" step would have to be
accomplished by a separate command (or even "by hand") to get the job
done:
alice> git clone svn://svn.example.com/
upstream> svnadmin propedit ...
bob> git clone svn://svn.example.com/
In this situation, alice and bob have diverging histories, just as
if upstream had rewritten history (because, well, upstream has).
Now if alice fetches from bob and notices that, then she must do
alice> git replace AA BB
(or its user-friendly equivalent, or a batch equivalent to search for
and handle cases like this).
[...]
> If changes to unversioned revision properties are rare, then replacement
> technique is much superior to using notes, which generates unnatural git
> repository. When changing commit messages (svn:log) and the like are
> common and often, which would result in great many replacements, the
> notes technique could be better because of performance reasons.
Exactly. Well, one can mitigate the performance problems by running
"git filter-branch" every once in a while. :)
Regards,
Jonathan
[1] http://thread.gmane.org/gmane.comp.version-control.subversion.devel/122840/focus=122944
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-21 14:00 ` Stephen Bash
@ 2010-10-21 18:37 ` Jakub Narebski
2010-10-21 21:27 ` Stephen Bash
0 siblings, 1 reply; 52+ messages in thread
From: Jakub Narebski @ 2010-10-21 18:37 UTC (permalink / raw)
To: Stephen Bash
Cc: Will Palmer, Ramkumar Ramachandra, Matt Stump, git,
Jonathan Nieder, David Michael Barr, Sverre Rabbelier,
Tomas Carnecky
On Thu, 21 Oct 2010, Stephen Bash wrote:
> Will Palmer <wmpalmer@gmail.com> wrote:
> > Of course, "ignoring merges" is temporary and a total cop-out
>
> This is still bugging me... Even with svn mergeinfo (which I think
> is a small percentage of the SVN revisions in the world),
>From what I understand to have svn:mergeinfo you have to have version >=
1.5 of Subversion installed on server, and to use it also >= 1.5
client.
> IMO an SVN merge is *not* a Git merge. I think of it as a git
> cherry-pick (someone correct me if this mental model is wrong).
> The key point in my mind is SVN merge doesn't have to merge the entire
> branch history. Perhaps some heuristics can be applied in Git to
> decide if an SVN merge is a "true merge" or a cherry-pick? But I have
> a nagging feeling that in the end the model mismatch is going to be
> very hard to overcome.
Hopefully in most common situations (i.e. SVN repository is not
mishandled) the svn:mergeinfo would be _only_ on branch folders
("branches/<branchname>") and inherited downwards. This should be
fairly easy, I think, to translate to git merges (merge commits).
But because Subversion doesn't impose strict separation between branch
namespace and in-repository paths, somebody somewhere would certainly
at some time screw this up. And only then we would have to rely on
subtree merge / git-subtree split similarity detection.
BTW. Subversion doesn't have "svn cherry-pick", nor equivalent to
"git reset" == "git cherry-pick -R"... well, at least I don't think it
has.
......................................................................
Warning! Rant ahead!
<rant skip="if needed">
I have read some documentation about svn:mergeinfo property:
http://svnbook.red-bean.com/en/1.5/svn.branchmerge.basicmerging.html
http://www.collab.net/community/subversion/articles/merge-info.html
I see how "branches are folders" model, without a concept of version
included in (belonging to) some branch and without the concept of
'previous version in the same line of development' leads to such
strange, bizzare things.
First, svn:mergeinfo is not about tracking which commits (which parents)
were involved in creating given version, like in Git (where merge
commit that was result of merging branch 'bar' into 'foo' has commits
which were then tips of 'bar' and of 'foo' as two parents of commit
representing result of merge).
No, svn:mergeinfo is ass-backwards solution to the problem that "merge
tracking" solves, namely that of repeated merging. Let's take a look
at the following situation:
---1---B---2---3---M1--4---5---M2 <-- foo
\ / /
\-a---b-/-----c---d-/ <-- bar
B is branching point, M1 and M2 are merge commits.
In Git, and I assume that also in Subversion, when doing merge M1, the
VCS notices that from revision B branches 'foo' and 'bar' have common
commits (in git we say that merge base of 'foo' and 'bar' at the point
of doing merge M1 is commit B). VCS it know then that it has to
integrate changes that were made on branch 'bar' since cleft point B,
i.e. changes brought by revisions 'a' and 'b', with changes made on
branch 'foo' since B, i.e. changes brought by revisions '2' and '3'.
Git does that by running 3-way merge (same as rcsmerge / diff3 merge)
with '3' as ours aka mine version, 'b' as theirs aka yours version,
and 'B' as ancestor aka older version; I assume that Subversion does
the same thing, or equivalent.
Now here is where things begin to be different in Git and in Subversion.
In Git, commit 'M1' with merge resolution has simply two parents: '3'
and 'b'.
In Subversion there is no such thing like parent of revision. Instead
of this SVN records that it integrated changes brought by revisions 'a'
and 'b' into 'M1', which means that from revision 'M1' the branch
folder ("project/branches/foo") acquires svn:mergeinfo property with
the contents '/branches/bar:B-b' (B-b is a:b, i.e. range from B to b,
excluding B). PLEASE CORRECT ME IF I AM MISTAKEN.
Note branch info in svn:mergeinfo property. Note the revision range
instead of just its endpoint 'b'. Note lack of reference to what would
be first parent in git merge commit, i.e. '3'.
Let's take a look what happens at point M2 (i.e. second merge) in Git
and in Subversion.
In Git it is easy. Git calculates merge base by travelling parentage
links (which include all parents of merge commits), and notices that at
point 'M2' merge base, i.e. first common ancestor of branches 'foo' and
'bar' is commit 'b'. It then runs 3-way merge with '5' as ours, 'd' as
theirs, and 'b' as ancestor, and records merge commit with parents '5'
and 'd'.
Subversion instead examines svn:mergeinfo property to check what it
already merged in, and somehow notices that it has to integrate changes
c+d made on branch 'bar' (but not a+b+c+d, as a+b were already
integrated) with changes 4+5 on branch 'foo'. Probably it somehow
notices that 'b' is common ancestor. But you can see how this
mechanism is fraught with peril and can break easily in more complex
situations ("The 1.5 release of merge tracking has basic support for
common scenarios; we will be extending the feature in upcoming
releases."... they hope!).
Subversion then updates svn:mergeinfo property at branch 'foo'.
</rant>
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-21 18:37 ` Jakub Narebski
@ 2010-10-21 21:27 ` Stephen Bash
2010-10-21 22:49 ` Jakub Narebski
0 siblings, 1 reply; 52+ messages in thread
From: Stephen Bash @ 2010-10-21 21:27 UTC (permalink / raw)
To: Jakub Narebski
Cc: Will Palmer, Ramkumar Ramachandra, Matt Stump, git,
Jonathan Nieder, David Michael Barr, Sverre Rabbelier,
Tomas Carnecky
----- Original Message -----
> From: "Jakub Narebski" <jnareb@gmail.com>
> To: "Stephen Bash" <bash@genarts.com>
> Sent: Thursday, October 21, 2010 2:37:07 PM
> Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
>
> > > Of course, "ignoring merges" is temporary and a total cop-out
> >
> > This is still bugging me... Even with svn mergeinfo (which I think
> > is a small percentage of the SVN revisions in the world),
>
> From what I understand to have svn:mergeinfo you have to have version
> >= 1.5 of Subversion installed on server, and to use it also >= 1.5
> client.
Correct. I can't find a release date for 1.5, but my impression is a lot of history in SVN repositories pre-dates 1.5 (especially since it required *both* the client and the server to be updated). That impression is mostly based on my own experience... Using Subversion heavily from 2003 to late 2009 my memory is mostly of 1.3 and 1.4 -- I probably only upgraded if I was setting up a new machine or some fancy new tool I was using required the newest version.
> But because Subversion doesn't impose strict separation between branch
> namespace and in-repository paths, somebody somewhere would certainly
> at some time screw this up. And only then we would have to rely on
> subtree merge / git-subtree split similarity detection.
I don't have much experience with subtree merge... It's possible that will improve the situation.
> BTW. Subversion doesn't have "svn cherry-pick", nor equivalent to
> "git reset" == "git cherry-pick -R"... well, at least I don't think it
> has.
See below...
> I have read some documentation about svn:mergeinfo property:
> http://svnbook.red-bean.com/en/1.5/svn.branchmerge.basicmerging.html
I guess this the first time I've read the 1.5 version of the SVN Book. This has consequences below...
> ---1---B---2---3---M1--4---5---M2 <-- foo
> \ / /
> \-a---b-/-----c---d-/ <-- bar
>
> B is branching point, M1 and M2 are merge commits.
>
> In Git, and I assume that also in Subversion, when doing merge M1, the
> VCS notices that from revision B branches 'foo' and 'bar' have common
> commits (in git we say that merge base of 'foo' and 'bar' at the point
> of doing merge M1 is commit B).
I'm going to take a little liberty with SVN revisions because I've always thought of SVN revisions as before and after the change, so a:b in SVN is the change introduced in b, but since we're on the Git list, in the following examples I will use a:b to mean the changes introduced in both a and b. (Since it was introduced, I've always read "svn diff -c rev" as "svn diff -r rev-1:rev")
Back to the task at hand... having read the 1.5 SVN docs, I have no idea how this works now (big caveat!!!), but prior to 1.5 M1 would have been
svn switch svn://path/to/foo
svn merge -ra:b svn://path/to/bar destination-path
which is "Take the changes introduced in revisions a through b, and apply them to the destination-path". This is why I think of SVN merges as cherry-picks -- I was allowed to specify exactly what changesets I wanted merge to work on. To truly illustrate this, consider a' is in between a and b:
---1---B---2---3-------M1--4---5---M2 <-- foo
\ / /
\-a---a'---b-/-----c---d-/ <-- bar
I could
svn switch svn://path/to/foo
svn merge -ra':b svn://path/to/bar destination-path
and "a" would never be merged back to foo. The concept of *not* specifying revision numbers to merge is new in 1.5. See
http://svnbook.red-bean.com/en/1.4/svn.branchmerge.copychanges.html
This is what scares me about mapping SVN merges to Git merges. It seems post-1.5 merges have a lot more in common with Git than pre-1.5 (though mergeinfo is still brain damaged -- easy branching and merging is why I switched!), but I think we still need to support pre-1.5.
Thanks,
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-21 21:27 ` Stephen Bash
@ 2010-10-21 22:49 ` Jakub Narebski
2010-10-21 23:26 ` Stephen Bash
0 siblings, 1 reply; 52+ messages in thread
From: Jakub Narebski @ 2010-10-21 22:49 UTC (permalink / raw)
To: Stephen Bash
Cc: Will Palmer, Ramkumar Ramachandra, Matt Stump, git,
Jonathan Nieder, David Michael Barr, Sverre Rabbelier,
Tomas Carnecky
On Thu, 21 Oct 2010, Stephen Bash wrote:
> Jakub Narebski <jnareb@gmail.com> wrote:
> > But because Subversion doesn't impose strict separation between branch
> > namespace and in-repository paths, somebody somewhere would certainly
> > at some time screw this up. And only then we would have to rely on
> > subtree merge / git-subtree split similarity detection.
>
> I don't have much experience with subtree merge... It's possible
> that will improve the situation.
I mean here the method used by "subtree" merge strategy, not by subtree
merge itself, i.e. the mechanism which make git apply changes to subtree
merged subproject at correct place.
> > BTW. Subversion doesn't have "svn cherry-pick", nor equivalent to
> > "git reset" == "git cherry-pick -R"... well, at least I don't think it
> > has.
>
> See below...
Ah, I understand now that 'svn merge' (which is rather like 'cvs update')
can be used for cherry picking.
Sidenote: in Git cherry picking picks up change and applies it on top
of current branch as one would apply a patch. This is quite different
from merge, where you find comon ancestor and then perform 3-way merge
(ours, theirs, ancestor). Is merging in Subversion using 3-way merge
(like 'cvs update -j ... -j ...' is), or re-applying changes?
> > I have read some documentation about svn:mergeinfo property:
> > http://svnbook.red-bean.com/en/1.5/svn.branchmerge.basicmerging.html
>
> I guess this the first time I've read the 1.5 version of the SVN Book.
> This has consequences below...
Errr... what consequences? a:b vs a-b being closed (inclusive) or open
(exclusive) from one or other end?
> > ---1---B---2---3---M1--4---5---M2 <-- foo
> > \ / /
> > \-a---b-/-----c---d-/ <-- bar
> >
> > B is branching point, M1 and M2 are merge commits.
> >
> > In Git, and I assume that also in Subversion, when doing merge M1, the
> > VCS notices that from revision B branches 'foo' and 'bar' have common
> > commits (in git we say that merge base of 'foo' and 'bar' at the point
> > of doing merge M1 is commit B).
>
> I'm going to take a little liberty with SVN revisions because I've
> always thought of SVN revisions as before and after the change, so a:b
> in SVN is the change introduced in b, but since we're on the Git list,
> in the following examples I will use a:b to mean the changes
> introduced in both a and b. (Since it was introduced, I've always
> read "svn diff -c rev" as "svn diff -r rev-1:rev")
"git show rev" always show changes to parent, i.e. the same as
"git diff rev^ rev" (rev^ ~= rev-1, if rev is not merge commit).
> Back to the task at hand... having read the 1.5 SVN docs, I have no
> idea how this works now (big caveat!!!), but prior to 1.5 M1 would
> have been
>
> svn switch svn://path/to/foo
> svn merge -ra:b svn://path/to/bar destination-path
>
> which is "Take the changes introduced in revisions a through b, and
> apply them to the destination-path". This is why I think of SVN
> merges as cherry-picks -- I was allowed to specify exactly what
> changesets I wanted merge to work on.
On one hand side you "were allowed to specify exactly what changesets
you wanted to merge to work on", on the other hand side you *had* to
specify what changesets etc.
So it was "make branching easy and O(1)"... and they forgot that
branching standalone doesn't make much sense, and that easy *merging*
is also required. Merging in pre 1.5 times is as bad as in CVS.
> To truly illustrate this, consider a' is in between a and b:
>
> ---1---B---2---3-------M1--4---5---M2 <-- foo
> \ / /
> \-a---a'---b-/-----c---d-/ <-- bar
>
> I could
>
> svn switch svn://path/to/foo
> svn merge -ra':b svn://path/to/bar destination-path
>
> and "a" would never be merged back to foo.
Such merge would be hard to represent in Git, I think.
> The concept of *not* specifying revision numbers to merge is new
> in 1.5. See
>
> http://svnbook.red-bean.com/en/1.4/svn.branchmerge.copychanges.html
>
> This is what scares me about mapping SVN merges to Git merges. It
> seems post-1.5 merges have a lot more in common with Git than pre-1.5
> (though mergeinfo is still brain damaged -- easy branching and merging
> is why I switched!), but I think we still need to support pre-1.5.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-21 22:49 ` Jakub Narebski
@ 2010-10-21 23:26 ` Stephen Bash
2010-10-22 10:38 ` Jakub Narebski
0 siblings, 1 reply; 52+ messages in thread
From: Stephen Bash @ 2010-10-21 23:26 UTC (permalink / raw)
To: Jakub Narebski
Cc: Will Palmer, Ramkumar Ramachandra, Matt Stump, git,
Jonathan Nieder, David Michael Barr, Sverre Rabbelier,
Tomas Carnecky
----- Original Message -----
> From: "Jakub Narebski" <jnareb@gmail.com>
> To: "Stephen Bash" <bash@genarts.com>
> Sent: Thursday, October 21, 2010 6:49:32 PM
> Subject: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
>
> Ah, I understand now that 'svn merge' (which is rather like 'cvs
> update')
> can be used for cherry picking.
>
> Sidenote: in Git cherry picking picks up change and applies it on top
> of current branch as one would apply a patch.
Yes.
> This is quite different
> from merge, where you find comon ancestor and then perform 3-way merge
> (ours, theirs, ancestor).
Yes.
> Is merging in Subversion using 3-way merge
> (like 'cvs update -j ... -j ...' is), or re-applying changes?
Appears to the be 3-way merge if I'm reading the SVN archives correctly:
"It's a basic diff3 algorithm. 'man diff3' to learn about it and play
with GNU's implementation of diff3."
http://svn.haxx.se/users/archive-2005-03/1232.shtml
So my *guess* is they derive a common ancestor from their copy information, but I'm sure someone else more knowledgable could say more about that.
> > > I have read some documentation about svn:mergeinfo property:
> > > http://svnbook.red-bean.com/en/1.5/svn.branchmerge.basicmerging.html
> >
> > I guess this the first time I've read the 1.5 version of the SVN
> > Book.
> > This has consequences below...
>
> Errr... what consequences? a:b vs a-b being closed (inclusive) or open
> (exclusive) from one or other end?
No, just that post-1.5 merges do actually start to look more like Git merges.
> > Back to the task at hand... having read the 1.5 SVN docs, I have no
> > idea how this works now (big caveat!!!), but prior to 1.5 M1 would
> > have been
> >
> > svn switch svn://path/to/foo
> > svn merge -ra:b svn://path/to/bar destination-path
> >
> > which is "Take the changes introduced in revisions a through b, and
> > apply them to the destination-path". This is why I think of SVN
> > merges as cherry-picks -- I was allowed to specify exactly what
> > changesets I wanted merge to work on.
>
> On one hand side you "were allowed to specify exactly what changesets
> you wanted to merge to work on", on the other hand side you *had* to
> specify what changesets etc.
My point is because the user was required to specify the revisions to merge, I don't think an automated tool (i.e. the mapper) can make assumptions about what was actually merged in any given revision.
> > To truly illustrate this, consider a' is in between a and b:
> >
> > ---1---B---2---3-------M1--4---5---M2 <-- foo
> > \ / /
> > \-a---a'---b-/-----c---d-/ <-- bar
> >
> > I could
> >
> > svn switch svn://path/to/foo
> > svn merge -ra':b svn://path/to/bar destination-path
> >
> > and "a" would never be merged back to foo.
>
> Such merge would be hard to represent in Git, I think.
I agree.
Thanks,
Stephen
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)
2010-10-21 23:26 ` Stephen Bash
@ 2010-10-22 10:38 ` Jakub Narebski
0 siblings, 0 replies; 52+ messages in thread
From: Jakub Narebski @ 2010-10-22 10:38 UTC (permalink / raw)
To: Stephen Bash
Cc: Will Palmer, Ramkumar Ramachandra, Matt Stump, git,
Jonathan Nieder, David Michael Barr, Sverre Rabbelier,
Tomas Carnecky
On Fri, 22 Oct 2010, Stephen Bash wrote:
> ----- Original Message -----
> > From: "Jakub Narebski" <jnareb@gmail.com>
> > Ah, I understand now that 'svn merge' (which is rather like 'cvs
> > update') can be used for cherry picking.
> >
> > Sidenote: in Git cherry picking picks up change and applies it on top
> > of current branch as one would apply a patch.
>
> Yes.
>
> > This is quite different
> > from merge, where you find comon ancestor and then perform 3-way merge
> > (ours, theirs, ancestor).
>
> Yes.
Well, I guess that 'svn merge -rN' (merging in a single revision) works
similarly to how git-cherry-pick works.
> > Is merging in Subversion using 3-way merge
> > (like 'cvs update -j ... -j ...' is), or re-applying changes?
>
> Appears to the be 3-way merge if I'm reading the SVN archives correctly:
> "It's a basic diff3 algorithm. 'man diff3' to learn about it and play
> with GNU's implementation of diff3."
> http://svn.haxx.se/users/archive-2005-03/1232.shtml
>
> So my *guess* is they derive a common ancestor from their copy
> information, but I'm sure someone else more knowledgable could say
> more about that.
I guess that in Subversion <= 1.4 it takes N in 'svn merge -rN:M' as an
ancestor version for 3-way merge, and that in Subversion >= 1.5 it takes
last merged in state (from 'svn:mergeinfo' property[1]) if branch is
merged subsequent time, or first common revision for both branches[2]
if it is first merge.
[1] The 'svn:mergeinfo' is about "merged-in tracking" rather than about
"merge tracking". Though change in 'svn:mergeinfo' indicates a
merge commit.
[2] I guess this is to be able to find such common ancestor (common
revision) on first merge is the reason why merging branch into trunk
.---B---.---.---.---M---.
\ /
\---.---.---/
and merging trunk into branch
.---B---.---.---.---.---.
\ \
\---.---.---M---.
needs a manual (by the way of '--reintegrate' option) distinguishing.
> > > > I have read some documentation about svn:mergeinfo property:
> > > > http://svnbook.red-bean.com/en/1.5/svn.branchmerge.basicmerging.html
> > >
> > > I guess this the first time I've read the 1.5 version of the SVN
> > > Book.
> > > This has consequences below...
> >
> > Errr... what consequences? a:b vs a-b being closed (inclusive) or open
> > (exclusive) from one or other end?
>
> No, just that post-1.5 merges do actually start to look more like Git
> merges.
Well, at least they can be unambigously detected, instead of relying on
parsing commit message of merge commit.
> > > Back to the task at hand... having read the 1.5 SVN docs, I have no
> > > idea how this works now (big caveat!!!), but prior to 1.5 M1 would
> > > have been
> > >
> > > svn switch svn://path/to/foo
> > > svn merge -ra:b svn://path/to/bar destination-path
> > >
> > > which is "Take the changes introduced in revisions a through b, and
> > > apply them to the destination-path". This is why I think of SVN
> > > merges as cherry-picks -- I was allowed to specify exactly what
> > > changesets I wanted merge to work on.
> >
> > On one hand side you "were allowed to specify exactly what changesets
> > you wanted to merge to work on", on the other hand side you *had* to
> > specify what changesets etc.
>
> My point is because the user was required to specify the revisions
> to merge, I don't think an automated tool (i.e. the mapper) can make
> assumptions about what was actually merged in any given revision.
The problem is with even detecting that it was a merge and not ordinary
commit (well, unless some commit convention was used for merge commits,
but how likely that is that it was applied thoroughly, consistently, and
without mistakes that would trip a parser of a merge detector).
> > > To truly illustrate this, consider a' is in between a and b:
> > >
> > > ---1---B---2---3-------M1--4---5---M2 <-- foo
> > > \ / /
> > > \-a---a'---b-/-----c---d-/ <-- bar
> > >
> > > I could
> > >
> > > svn switch svn://path/to/foo
> > > svn merge -ra':b svn://path/to/bar destination-path
> > >
> > > and "a" would never be merged back to foo.
> >
> > Such merge would be hard to represent in Git, I think.
>
> I agree.
Well, at least in a way that merge in git would consider the same
revisions as already applied as Subversion would when merging.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2010-10-22 10:38 UTC | newest]
Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-13 15:44 Speeding up the initial git-svn fetch Matt Stump
2010-10-13 16:02 ` Stephen Bash
2010-10-13 17:47 ` Matt Stump
2010-10-13 18:18 ` Stephen Bash
2010-10-14 16:22 ` Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch) Stephen Bash
2010-10-14 16:34 ` Jonathan Nieder
2010-10-14 20:07 ` Sverre Rabbelier
2010-10-15 14:50 ` Stephen Bash
2010-10-15 23:39 ` Sverre Rabbelier
2010-10-16 0:16 ` Stephen Bash
2010-10-17 2:25 ` Sverre Rabbelier
2010-10-17 3:33 ` David Michael Barr
2010-10-18 5:17 ` Ramkumar Ramachandra
2010-10-18 7:31 ` Jonathan Nieder
2010-10-18 16:38 ` Ramkumar Ramachandra
2010-10-18 16:46 ` Sverre Rabbelier
2010-10-18 16:56 ` Jonathan Nieder
2010-10-18 17:16 ` Ramkumar Ramachandra
2010-10-18 17:18 ` Sverre Rabbelier
2010-10-18 17:28 ` Jonathan Nieder
2010-10-18 18:10 ` Sverre Rabbelier
2010-10-18 18:13 ` Jonathan Nieder
2010-10-18 18:20 ` Sverre Rabbelier
2010-10-18 18:25 ` Jonathan Nieder
2010-10-18 18:35 ` Sverre Rabbelier
2010-10-18 19:33 ` Jonathan Nieder
2010-10-19 3:08 ` Ramkumar Ramachandra
2010-10-19 0:40 ` Stephen Bash
2010-10-19 1:42 ` Stephen Bash
2010-10-19 6:42 ` Ramkumar Ramachandra
2010-10-19 13:33 ` Stephen Bash
2010-10-19 14:28 ` David Michael Barr
2010-10-19 14:57 ` Stephen Bash
2010-10-20 8:39 ` Will Palmer
2010-10-20 11:59 ` Jakub Narebski
2010-10-20 13:42 ` Will Palmer
2010-10-20 20:44 ` Jakub Narebski
2010-10-21 1:54 ` mrevilgnome
2010-10-21 8:16 ` Jakub Narebski
2010-10-21 13:49 ` Stephen Bash
2010-10-21 9:08 ` Will Palmer
2010-10-21 14:00 ` Stephen Bash
2010-10-21 18:37 ` Jakub Narebski
2010-10-21 21:27 ` Stephen Bash
2010-10-21 22:49 ` Jakub Narebski
2010-10-21 23:26 ` Stephen Bash
2010-10-22 10:38 ` Jakub Narebski
2010-10-21 15:52 ` Jakub Narebski
2010-10-21 16:16 ` Jonathan Nieder
2010-10-20 14:05 ` Ramkumar Ramachandra
2010-10-20 14:21 ` Stephen Bash
2010-10-20 16:56 ` Ramkumar Ramachandra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).