* [PATCH] contrib/svn-fe: Fast script to remap svn history
@ 2010-10-07 6:06 David Barr
2010-10-07 6:29 ` Sverre Rabbelier
2010-11-21 5:17 ` Jonathan Nieder
0 siblings, 2 replies; 7+ messages in thread
From: David Barr @ 2010-10-07 6:06 UTC (permalink / raw)
To: Git Mailing List
Cc: Jonathan Nieder, Sverre Rabbelier, Ramkumar Ramachandra,
David Barr
This python script walks the commit sequence imported by svn-fe.
For each commit, it tries to identify the branch that was changed.
Commits are rewritten to be rooted according to the standard layout.
A basic heuristic of matching trees is used to find parents for the
first commit in a branch and for tags.
Signed-off-by: David Barr <david.barr@cordelta.com>
---
contrib/svn-fe/svn-filter-root.py | 107 +++++++++++++++++++++++++++++++++++++
fast-import.c | 9 +++
2 files changed, 116 insertions(+), 0 deletions(-)
create mode 100755 contrib/svn-fe/svn-filter-root.py
diff --git a/contrib/svn-fe/svn-filter-root.py b/contrib/svn-fe/svn-filter-root.py
new file mode 100755
index 0000000..72d248f
--- /dev/null
+++ b/contrib/svn-fe/svn-filter-root.py
@@ -0,0 +1,107 @@
+#!/usr/bin/python
+from subprocess import *
+import re
+import os
+
+subroot_re = re.compile("^trunk|^branches/[^/]*|^tags/[^/]*")
+
+tree_re = re.compile("^tree ([0-9a-f]{40})", flags=re.MULTILINE)
+parent_re = re.compile("^parent ([0-9a-f]{40})", flags=re.MULTILINE)
+author_re = re.compile("^author (.*)$", flags=re.MULTILINE)
+committer_re = re.compile("^committer (.*)$", flags=re.MULTILINE)
+
+git_svn_id_re = re.compile("^git-svn-id[^@]*", flags=re.MULTILINE)
+
+ref_commit = {}
+tree_commit = {}
+count = 1
+
+# Open a cat-file process for subtree lookups
+subtree_process = Popen(["git","cat-file","--batch-check"], stdin=PIPE, stdout=PIPE)
+
+# Iterate over commits from subversion imported with svn-fe
+revlist = Popen(["git","rev-list","--reverse","--topo-order","--default","HEAD"], stdout=PIPE)
+cat_file = Popen(["git","cat-file","--batch"], stdin=revlist.stdout, stdout=PIPE)
+object_header = cat_file.stdout.readline().strip().split(" ");
+while len(object_header) == 3:
+ object_body = cat_file.stdout.read(int(object_header[2]))
+ cat_file.stdout.read(1)
+ git_commit = object_header[0]
+ (commit_header, blank_line, commit_message) = object_body.partition("\n\n")
+ object_header = cat_file.stdout.readline().strip().split(" ");
+
+ author = author_re.search(commit_header).group()
+ committer = committer_re.search(commit_header).group()
+
+ # Diff against the empty tree if no parent
+ match = parent_re.search(commit_header)
+ if match:
+ parent = match.group(1)
+ else:
+ parent = "4b825dc642cb6eb9a060e54bf8d69288fbee4904"
+
+ # Find a common path prefix in the changes for the revision
+ subroot = ""
+ changes = Popen(["git","diff","--name-only",parent,git_commit], stdout=PIPE)
+ for path in changes.stdout:
+ match = subroot_re.match(path)
+ if match:
+ subroot = match.group()
+ changes.terminate()
+ break
+
+ # Attempt to rewrite the commit on top of the matching branch
+ if subroot == "":
+ print "progress Weird commit - no subroot."
+ else:
+ # Rewrite git-svn-id in the log to point to the subtree
+ commit_message = git_svn_id_re.sub('\g<0>/'+subroot, commit_message)
+ subtree_process.stdin.write(git_commit+":"+subroot+"\n")
+ subtree_process.stdin.flush()
+ subtree_line = subtree_process.stdout.readline()
+ if re.match("^.*missing$", subtree_line):
+ print "progress Weird commit - invalid subroot"
+ continue
+ subtree = subtree_line[0:40]
+ # Map the svn tag/branch name to a git-friendly one
+ ref = "refs/heads/" + re.sub(" ", "%20", subroot)
+ # Choose a parent for the rewritten commit
+ if ref in ref_commit:
+ parent = ref_commit[ref]
+ elif subtree in tree_commit:
+ parent = tree_commit[subtree]
+ else:
+ parent = ""
+ # Update tags if necessary
+ if re.match("^refs/heads/tags/", ref):
+ if parent == "":
+ print "progress Weird tag - no matching commit."
+ else:
+ tagname = ref[16:]
+ print "tag "+tagname
+ print "from "+parent
+ print "tagger "+committer[10:]
+ print "data "+str(len(commit_message))
+ print commit_message
+ else:
+ # Default to trunk if the branch is new
+ if parent == "" and "refs/heads/trunk" in ref_commit:
+ parent = ref_commit["refs/heads/trunk"]
+ print "commit "+ref
+ print "mark :"+str(count)
+ print author
+ print committer
+ print "data "+str(len(commit_message))
+ print commit_message
+ if parent != "":
+ print "from "+parent
+ print "M 040000 "+subtree+" \"\""
+ commit = ":"+str(count)
+ # Advance the matching branch
+ ref_commit[ref] = commit
+ # Update latest commit by tree to drive parent matching
+ tree_commit[subtree] = commit
+ print "progress " + str(count)
+ count = count + 1
+
+subtree_process.terminate()
diff --git a/fast-import.c b/fast-import.c
index 2317b0f..8f68a89 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1454,6 +1454,15 @@ static int tree_content_set(
n = slash1 - p;
else
n = strlen(p);
+ if (!slash1 && !n) {
+ if (!S_ISDIR(mode))
+ die("Root cannot be a non-directory");
+ hashcpy(root->versions[1].sha1, sha1);
+ if (root->tree)
+ release_tree_content_recursive(root->tree);
+ root->tree = subtree;
+ return 1;
+ }
if (!n)
die("Empty path component found in input");
if (!slash1 && !S_ISDIR(mode) && subtree)
--
1.7.3.4.g45608.dirty
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] contrib/svn-fe: Fast script to remap svn history
2010-10-07 6:06 [PATCH] contrib/svn-fe: Fast script to remap svn history David Barr
@ 2010-10-07 6:29 ` Sverre Rabbelier
2010-10-07 7:17 ` David Michael Barr
2010-10-07 8:28 ` Jonathan Nieder
2010-11-21 5:17 ` Jonathan Nieder
1 sibling, 2 replies; 7+ messages in thread
From: Sverre Rabbelier @ 2010-10-07 6:29 UTC (permalink / raw)
To: David Barr; +Cc: Git Mailing List, Jonathan Nieder, Ramkumar Ramachandra
Heya,
On Thu, Oct 7, 2010 at 08:06, David Barr <david.barr@cordelta.com> wrote:
> This python script walks the commit sequence imported by svn-fe.
> For each commit, it tries to identify the branch that was changed.
> Commits are rewritten to be rooted according to the standard layout.
> A basic heuristic of matching trees is used to find parents for the
> first commit in a branch and for tags.
Nice, how easy would it be to extend it to deal with other layouts?
> diff --git a/fast-import.c b/fast-import.c
> index 2317b0f..8f68a89 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -1454,6 +1454,15 @@ static int tree_content_set(
> n = slash1 - p;
> else
> n = strlen(p);
> + if (!slash1 && !n) {
> + if (!S_ISDIR(mode))
> + die("Root cannot be a non-directory");
> + hashcpy(root->versions[1].sha1, sha1);
> + if (root->tree)
> + release_tree_content_recursive(root->tree);
> + root->tree = subtree;
> + return 1;
> + }
> if (!n)
> die("Empty path component found in input");
> if (!slash1 && !S_ISDIR(mode) && subtree)
What is this hunk about?
--
Cheers,
Sverre Rabbelier
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] contrib/svn-fe: Fast script to remap svn history
2010-10-07 6:29 ` Sverre Rabbelier
@ 2010-10-07 7:17 ` David Michael Barr
2010-10-07 8:28 ` Jonathan Nieder
1 sibling, 0 replies; 7+ messages in thread
From: David Michael Barr @ 2010-10-07 7:17 UTC (permalink / raw)
To: Sverre Rabbelier; +Cc: Git Mailing List, Jonathan Nieder, Ramkumar Ramachandra
Hi,
>> This python script walks the commit sequence imported by svn-fe.
>> For each commit, it tries to identify the branch that was changed.
>> Commits are rewritten to be rooted according to the standard layout.
>> A basic heuristic of matching trees is used to find parents for the
>> first commit in a branch and for tags.
>
> Nice, how easy would it be to extend it to deal with other layouts?
I think its just a matter of adjusting the regular expression to match roots
and the mapping from roots to refs.
>> diff --git a/fast-import.c b/fast-import.c
>> index 2317b0f..8f68a89 100644
>> --- a/fast-import.c
>> +++ b/fast-import.c
>> @@ -1454,6 +1454,15 @@ static int tree_content_set(
>> n = slash1 - p;
>> else
>> n = strlen(p);
>> + if (!slash1 && !n) {
>> + if (!S_ISDIR(mode))
>> + die("Root cannot be a non-directory");
>> + hashcpy(root->versions[1].sha1, sha1);
>> + if (root->tree)
>> + release_tree_content_recursive(root->tree);
>> + root->tree = subtree;
>> + return 1;
>> + }
>> if (!n)
>> die("Empty path component found in input");
>> if (!slash1 && !S_ISDIR(mode) && subtree)
>
> What is this hunk about?
My bad, that belongs in a separate commit. I'll break it out after review.
The subject would read: "fast-import: Allow filemodify to set the root".
--
David Barr
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] contrib/svn-fe: Fast script to remap svn history
2010-10-07 6:29 ` Sverre Rabbelier
2010-10-07 7:17 ` David Michael Barr
@ 2010-10-07 8:28 ` Jonathan Nieder
1 sibling, 0 replies; 7+ messages in thread
From: Jonathan Nieder @ 2010-10-07 8:28 UTC (permalink / raw)
To: Sverre Rabbelier; +Cc: David Barr, Git Mailing List, Ramkumar Ramachandra
Sverre Rabbelier wrote:
> On Thu, Oct 7, 2010 at 08:06, David Barr <david.barr@cordelta.com> wrote:
>> --- a/fast-import.c
>> +++ b/fast-import.c
>> @@ -1454,6 +1454,15 @@ static int tree_content_set(
>> n = slash1 - p;
>> else
>> n = strlen(p);
>> + if (!slash1 && !n) {
>> + if (!S_ISDIR(mode))
>> + die("Root cannot be a non-directory");
>> + hashcpy(root->versions[1].sha1, sha1);
>> + if (root->tree)
>> + release_tree_content_recursive(root->tree);
>> + root->tree = subtree;
>> + return 1;
>> + }
>> if (!n)
>> die("Empty path component found in input");
>> if (!slash1 && !S_ISDIR(mode) && subtree)
>
> What is this hunk about?
Ooh, ack for this part (though I agree with you that it ought to be
explained in the log message).
Most git commands do their writing to the object db via the index and
loose objects. When you just have a pile of trees you want to convert
into commits, this is wasteful; for performance-critical operations
like filter-branch --subdirectory-filter, one might want a sort of
hash-object --batch-to-pack to write a pack directly.
Fortunately we have fast-import (which is one of the only git commands
that will write to a pack directly) but there is not an advertised way
to tell fast-import to use a given tree for its commits. So in
current git, one has the unpleasant choice of writing loose objects
without parsing the trees or writing straight to pack but having to
parse trees to do it.
This patch changes that, by allowing
M 040000 <tree id> ""
as a filemodify line in a commit to reset to a particular tree without
any need to unpack it. For example,
M 040000 4b825dc642cb6eb9a060e54bf8d69288fbee4904 ""
is a synonym for the deleteall command.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] contrib/svn-fe: Fast script to remap svn history
2010-10-07 6:06 [PATCH] contrib/svn-fe: Fast script to remap svn history David Barr
2010-10-07 6:29 ` Sverre Rabbelier
@ 2010-11-21 5:17 ` Jonathan Nieder
2010-11-22 14:01 ` Stephen Bash
1 sibling, 1 reply; 7+ messages in thread
From: Jonathan Nieder @ 2010-11-21 5:17 UTC (permalink / raw)
To: David Barr
Cc: Git Mailing List, Sverre Rabbelier, Ramkumar Ramachandra,
Eric Wong
Hi David,
David Barr wrote:
> This python script walks the commit sequence imported by svn-fe.
> For each commit, it tries to identify the branch that was changed.
> Commits are rewritten to be rooted according to the standard layout.
I like the idea and especially that the heuristics are simple.
Maybe this could be made git-agnostic using the new ls-tree command
you are introducing in fast-import? Though it would need to get a
revision list from somewhere. Alternatively, do you think it would
make sense for something like this to be implemented as a filter or
observer of the fast-import stream as it is generated during an
import?
> A basic heuristic of matching trees is used to find parents for the
> first commit in a branch and for tags.
More precisely, the rule used is:
> + # Find a common path prefix in the changes for the revision
> + subroot = ""
> + changes = Popen(["git","diff","--name-only",parent,git_commit], stdout=PIPE)
> + for path in changes.stdout:
> + match = subroot_re.match(path)
> + if match:
> + subroot = match.group()
> + changes.terminate()
> + break
The first change lying in one of
trunk
branch/*
tags/*
determines the branch. When a branch is renamed, this has a 50/50
chance of choosing the right branch.
> + # Choose a parent for the rewritten commit
> + if ref in ref_commit:
> + parent = ref_commit[ref]
> + elif subtree in tree_commit:
> + parent = tree_commit[subtree]
> + else:
> + parent = ""
If this is a live branch, the parent is the last commit from that
branch. Otherwise, we take the last commit whose resulting tree
looked like this one. Or...
> + # Default to trunk if the branch is new
> + if parent == "" and "refs/heads/trunk" in ref_commit:
> + parent = ref_commit["refs/heads/trunk"]
... if all else fails, we take the tip commit on the trunk.
For comparison, here's the git-svn rule:
> # look for a parent from another branch:
> my @b_path_components = split m#/#, $self->{path};
Among the paths above this commit's base directory [if this is
branches/foo, examine first branches/foo, then branches, then /]:
> while (@b_path_components) {
> $i = $paths->{'/'.join('/', @b_path_components)};
> last if $i && defined $i->{copyfrom_path};
> unshift(@a_path_components, pop(@b_path_components));
> }
> return undef unless defined $i && defined $i->{copyfrom_path};
Find the first one with copyfrom information (i.e., that was
renamed or copied from another rev in this revision).
> my $branch_from = $i->{copyfrom_path};
> if (@a_path_components) {
> print STDERR "branch_from: $branch_from => ";
> $branch_from .= '/'.join('/', @a_path_components);
> print STDERR $branch_from, "\n";
> }
Build back up the URL (so if branches was renamed to Branches but
branches/foo had no copyfrom information, we look for Branches/foo).
[...]
> my $gs = $self->other_gs($new_url, $url,
> $branch_from, $r, $self->{ref_id});
> my ($r0, $parent) = $gs->find_rev_before($r, 1);
Find the last revision that changed that path and record it.
Maybe we could benefit from including the copyfrom information in the
fast-import stream output by svn-fe somehow? The simplest way to do
this would be some specially formatted comments. An alternative (in
the spirit of Sam's earlier suggestions) might be to represent it in
the tree svn-fe creates, for example by introducing dummy
foo.copiedfrom
symlinks.
Thanks, that was interesting.
Jonathan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] contrib/svn-fe: Fast script to remap svn history
2010-11-21 5:17 ` Jonathan Nieder
@ 2010-11-22 14:01 ` Stephen Bash
2010-11-22 17:42 ` Jonathan Nieder
0 siblings, 1 reply; 7+ messages in thread
From: Stephen Bash @ 2010-11-22 14:01 UTC (permalink / raw)
To: Jonathan Nieder
Cc: Git Mailing List, Sverre Rabbelier, Ramkumar Ramachandra,
Eric Wong, David Barr
----- Original Message -----
> From: "Jonathan Nieder" <jrnieder@gmail.com>
> Sent: Sunday, November 21, 2010 12:17:34 AM
> Subject: Re: [PATCH] contrib/svn-fe: Fast script to remap svn history
>
> Maybe we could benefit from including the copyfrom information in the
> fast-import stream output by svn-fe somehow?
This has been discussed (and IMO it is essentially required to achieve high accuracy in the mapping):
http://thread.gmane.org/gmane.comp.version-control.git/158940/focus=159331
Thanks,
Stephen
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] contrib/svn-fe: Fast script to remap svn history
2010-11-22 14:01 ` Stephen Bash
@ 2010-11-22 17:42 ` Jonathan Nieder
0 siblings, 0 replies; 7+ messages in thread
From: Jonathan Nieder @ 2010-11-22 17:42 UTC (permalink / raw)
To: Stephen Bash
Cc: Git Mailing List, Sverre Rabbelier, Ramkumar Ramachandra,
Eric Wong, David Barr, Sam Vilain
Stephen Bash wrote:
> This has been discussed (and IMO it is essentially required to achieve high accuracy in the mapping):
>
> http://thread.gmane.org/gmane.comp.version-control.git/158940/focus=159331
I think the suggestion of that thread was (tweaked a little)
something like this:
- List of directories with copyfrom information.
Prune them so no listed directory is an ancestor of another.
The result would usually be a single directory name.
- Record that directory's (or those directories') copyfrom
information in the log message.
In general, I don't like limiting the information accessible to branch
mappers this way. Maybe a branch mapper would like to look at the
copyfrom information for files instead of directories. But this does
have the advantages of being simple and of not littering imported
trees with spurious files.
It also leaves open the question of how we would record unhandled node
properties (like svn:ignore and svn:eol) and empty directories, if at
all.
Probably in the end we will have to give up and provide multiple
options to choose between. :)
Jonathan
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-11-22 17:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-07 6:06 [PATCH] contrib/svn-fe: Fast script to remap svn history David Barr
2010-10-07 6:29 ` Sverre Rabbelier
2010-10-07 7:17 ` David Michael Barr
2010-10-07 8:28 ` Jonathan Nieder
2010-11-21 5:17 ` Jonathan Nieder
2010-11-22 14:01 ` Stephen Bash
2010-11-22 17:42 ` Jonathan Nieder
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).