git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GSoC update extra!] git-remote-svn: Week 8
@ 2010-06-24 13:33 Ramkumar Ramachandra
  2010-06-24 17:39 ` Jonathan Nieder
  0 siblings, 1 reply; 8+ messages in thread
From: Ramkumar Ramachandra @ 2010-06-24 13:33 UTC (permalink / raw)
  To: Git Mailing List
  Cc: David Michael Barr, Jonathan Nieder, Sverre Rabbelier,
	Shawn O. Pearce, Daniel Shahaf

Hi,

I built a client that generates a deltified dumpfile, but David's
exporter can't build full text from it. We got an idea, brought Sverre
into the loop and discussed it. David and I felt that the chatlog was
valuable enough to go on the Git mailing list for later reference, so
here it is.

This proposal will take a lot of time to implement. The current plan
is to get my client to dump full text (rather inefficiently), and get
something working while Sverre (and possibly others) work on this
proposal in the meantime.

Signed-off-by: Ramkumar Ramachandra <artagnon@gmail.com>
---
 notes |  117 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 117 insertions(+), 0 deletions(-)
 create mode 100644 notes

diff --git a/notes b/notes
new file mode 100644
index 0000000..224a5be
--- /dev/null
+++ b/notes
@@ -0,0 +1,117 @@
+14:19 *** SRabbelier JOIN
+14:19 <barrbrain> artagnon: backstory please :)
+14:20 <barrbrain> SRabbelier: welcome
+14:20 <SRabbelier> sup :)
+14:20 <artagnon> SRabbelier: Background -- I generated a delta dumpfile fine. barrbrain's exporter can't build full text from that, and I can't do it in my client either. We are looking for some kind of filesystem -- the Git
+                 store is the answer. Can we import revisions one by one through fast-import and ask Git to generate the full text?
+14:22 <artagnon> SRabbelier: Note that barrbrain's recent work on persistence makes it support incremental dumps, so throwing it into fast-import is not the problem at all.
+14:22 <SRabbelier> artagnon: well, that kind of goes against fast-import's nature, since if you do one revision at a time you lose the benefits of using it (low memory profile, etc)
+14:22 <artagnon> Hm.
+14:23 <artagnon> Is there some option to tweak it to create just loose objects and not pack them for every revision?
+14:23 <artagnon> And then pack them all together after we produce a full text in the end?
+14:23 <SRabbelier> artagnon: it doesn't create a packfile until you call 'checkpoint'
+14:24 <artagnon> Excellent. Then what's the problem? I won't call checkpoint until we get the full text out.
+14:25 <SRabbelier> perhaps I'm not quite understanding the problem then
+14:25 <SRabbelier> you can definitely call 'git cat-file' after each exported file
+14:25 <SRabbelier> and fast-import would just hang waiting for more input I guess
+14:26 <artagnon> barrbrain: Does that ^^ sound good?
+14:26 <barrbrain> artagnon: sounds good
+14:27 <artagnon> SRabbelier: Thanks, that's what we wanted to know :)
+14:27 <barrbrain> SRabbelier: how does the exporter know when the checkpoint has completed?
+14:27 <SRabbelier> barrbrain: it doesn't matter
+14:27 <SRabbelier> barrbrain: the object is available regardless of whether you checkpoint
+14:28 <barrbrain> oh really?
+14:28 <SRabbelier> barrbrain: it doesn't matter if the object is loose or in a pack
+14:28 <SRabbelier> barrbrain: it's transparent to whoever is calling git
+14:28 <artagnon> loose object and pack are equivalent to Git infrastructure
+14:28 <SRabbelier> right, only packs are faster
+14:28 <barrbrain> even in the middle of a fast-import?
+14:28 [artagnon nods]
+14:29 <barrbrain> the catch is that the only reference we have for a blob is a mark
+14:29 <artagnon> You have to keep all the marks until the fulltext is built?
+14:30 <SRabbelier> barrbrain: ah, well, that is a problem
+14:30 <SRabbelier> also, remind me why you need the fulltext?
+14:30 <barrbrain> the only way out of that is to hash them as we go
+14:31 <barrbrain> input stream is in deltas
+14:32 <artagnon> barrbrain: ... also, what about text-content-length? You need that to parse the dumpfile, right?
+14:32 <barrbrain> yeah
+14:32 <artagnon> Erm.
+14:32 <artagnon> It'll be horrible if replay doesn't supply that.
+14:32 <barrbrain> indeed
+14:33 <artagnon> I told SRabbelier about that too... loading the whole thing into memory is a terrible option.
+14:34 <SRabbelier> artagnon: well, git cat-file-ing it will require it to be loaded too?
+14:34 <artagnon> Oh, damn.
+14:35 <artagnon> SRabbelier: Is there a way to keep dynamically write it from a stream? I don't want git-cat-file then.
+14:35 <SRabbelier> artagnon: oh like that
+14:35 <SRabbelier> artagnon: uhm, I assume it streams it
+14:35 <SRabbelier> artagnon: but you'd have to double check that
+14:35 [artagnon checks]
+14:40 <barrbrain> so, if you have pipes to fast-import and cat-file --batch, can you read-write from the repo safely?
+14:40 <SRabbelier> barrbrain: yes, git is concurrency safe
+14:41 <barrbrain> what about the timing issue - if I write to fast import the blob might be delayed in a buffer somewhere
+14:41 <SRabbelier> barrbrain: I think what we'll need to do is to extend fast-import to also write the object names to stdout
+14:42 <SRabbelier> barrbrain: as soon as it's done writing the object
+14:42 <barrbrain> nvm, that's what the checkpoint/progress comment is all about
+14:42 <SRabbelier> barrbrain: and then you can wait till you get the name on stdout, and then you'll be sure that it's safe to git cat-file that object
+14:42 <SRabbelier> barrbrain: no
+14:42 <SRabbelier> barrbrain: checkpoint is only about packing
+14:42 <barrbrain> except no need for all that work
+14:42 <SRabbelier> barrbrain: you don't want to do checkpoint after each commit though
+14:43 <barrbrain> SRabbelier: I totally agree
+14:43 <SRabbelier> barrbrain: but, you can read the marks from the export file after a checkpoint, that is true
+14:43 <SRabbelier> so what I guess you could do
+14:43 <SRabbelier> is to keep feeding it commits till you need to get the hashes
+14:44 <SRabbelier> then do a checkpoint and a progress
+14:44 <SRabbelier> watch it's stdout till you see the progress message
+14:44 <barrbrain> it'd be nice to have a command to write the hash of a mark to stdout
+14:44 <SRabbelier> (so that you know that the checkpoint has completed)
+14:44 <SRabbelier> and then you can read the markfile
+14:44 <SRabbelier> barrbrain: if it would make things easier I could probably fairly easily hack fast-import to do that always
+14:45 <SRabbelier> barrbrain: so the other workflow I described, where it prints the hash of each mark ti stdiyt as it processes it
+14:45 <barrbrain> that's potentially a lot of detail in the output though :)
+14:45 <SRabbelier> barrbrain: yes, so it would be guarded by a flag
+14:45 <barrbrain> SRabbelier: oh, nice
+14:45 <SRabbelier> barrbrain: --print-marks
+14:45 <SRabbelier> or of cvourse
+14:45 <SRabbelier> barrbrain: in the stream you could say
+14:46 <SRabbelier> barrbrain: option git print-marks
+14:46 <artagnon> Better.
+14:46 <barrbrain> this is why we talk :)
+14:46 <artagnon> IRC is superb for some discussions :)
+14:46 <SRabbelier> ok, so you'd have to capture fast-import's stdout
+14:47 <SRabbelier> and you'd have to wait till you got the mark line
+14:47 <SRabbelier> and then you could feed that to the `git cat-file --batch` pipe
+14:47 <artagnon> SRabbelier: I hope it doesn't break the workflow of the remote helper.
+14:47 <SRabbelier> artagnon: ah, hmmm
+14:48 <SRabbelier> that's a problem
+14:48 <artagnon> I thought so.
+14:48 <artagnon> :|
+14:48 <SRabbelier> since in that case git starts the fast-import stream
+14:48 <artagnon> Exactly.
+14:48 <SRabbelier> **the fast-improt process
+14:49 <SRabbelier> well, I guess we could add a capability to the remote-helper
+14:49 <SRabbelier> that would make git just forward those marks to the remote-helper process?
+14:49 <SRabbelier> (on it's stdin)
+14:50 <SRabbelier> artagnon, barrbrain: that would work, no?
+14:50 [artagnon is thinking]
+14:51 <barrbrain> I think I've lost track of which pipe goes where
+14:51 <SRabbelier> artagnon: the remote-helper could just read from it's stdin, instead of from fast-import's stdout
+14:51 <artagnon> Right, got it.
+14:51 <artagnon> Yeah, a new capability should work.
+14:51 <SRabbelier> barrbrain: git-remote-svn would be hooked up both stdin and stdout to git
+14:51 <SRabbelier> barrbrain: when it starts, it tells git through a capability that it needs to have the marks hashes
+14:52 <SRabbelier> barrbrain: when git tells it to start exporting, git-remote-svn will write it's stream to stdout
+14:52 <SRabbelier> barrbrain: after each mark, git will read from the stdout of the fast-import process that it starts, and write that mark back to git-remote-svn's stdin
+14:53 <SRabbelier> barrbrain: so git-remote-svn can then read from it's stdin, and write that hash back to git cat-file --batch
+14:53 <barrbrain> right, can it then request blobs to be received on stdin later?
+14:53 <SRabbelier> barrbrain: yeah, that would work to, but I think it'd be better to keep it generic
+14:53 <SRabbelier> barrbrain: and instead have git write the hashes to git-remote-svn's stdin
+14:53 <SRabbelier> barrbrain: since then you can do whatever you want with those hashes
+14:54 <SRabbelier> barrbrain: for example, you might want to create a tag for those hashes, or such, in which case a hash is useful, and the contents of the blob isn't
+14:54 <barrbrain> or it just gets hashes and opens its own cat-file process?
+14:54 <SRabbelier> barrbrain: right
+14:54 <SRabbelier> barrbrain: that
+14:55 <SRabbelier> artagnon, barrbrain: anyway, if this is what we want, please mail the git list (especially Shawn), and propose this
+14:55 <SRabbelier> see what he thinks of it
+14:55 <SRabbelier> ask Shawn if he's ok with having git-fast-import learn a new '--print-marks' flag
+14:55 <SRabbelier> if so, I'll get on that :)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [GSoC update extra!] git-remote-svn: Week 8
  2010-06-24 13:33 [GSoC update extra!] git-remote-svn: Week 8 Ramkumar Ramachandra
@ 2010-06-24 17:39 ` Jonathan Nieder
  2010-06-24 18:07   ` Jonathan Nieder
  2010-06-30  2:20   ` Sam Vilain
  0 siblings, 2 replies; 8+ messages in thread
From: Jonathan Nieder @ 2010-06-24 17:39 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: Git Mailing List, David Michael Barr, Sverre Rabbelier,
	Shawn O. Pearce, Daniel Shahaf

Ramkumar Ramachandra wrote:

> David and I felt that the chatlog was
> valuable enough to go on the Git mailing list for later reference, so
> here it is.

Thanks. :)

> <artagnon> SRabbelier: Background -- I generated a delta dumpfile
> fine. barrbrain's exporter can't build full text from that, and I
> can't do it in my client either. We are looking for some kind of
> filesystem -- the Git store is the answer. Can we import revisions
> one by one through fast-import and ask Git to generate the full
> text?

Ah, this is something I was worried about with respect to persistence.
Git has all the blobs and all the trees, so except for the mapping
between marks, subversion revs, and git revs, svn-fe does not need to
persist much data at all.

Of course, that requires that the fast-import stream is going directly
to git.  fast-import streams can be used by other VCSes, too, but that
problem can be addressed later, I think.

> <barrbrain> so, if you have pipes to fast-import and cat-file
> --batch, can you read-write from the repo safely?
> <SRabbelier> barrbrain: yes, git is concurrency safe
> <barrbrain> what about the timing issue - if I write to fast import
> the blob might be delayed in a buffer somewhere
> <SRabbelier> barrbrain: I think what we'll need to do is to extend
> fast-import to also write the object names to stdout
> <SRabbelier> barrbrain: as soon as it's done writing the object

FWIW, I like the idea.

If you want to keep stdout unpolluted, this could work like

	git fast-import --print-marks=<fd>

We would have to make sure output to the fd is always flushed to
prevent deadlock.

> <barrbrain> I think I've lost track of which pipe goes where

Yeah.

> <SRabbelier> ask Shawn if he's ok with having git-fast-import learn a new '--print-marks' flag
> <SRabbelier> if so, I'll get on that :)

Thanks for the insights.

Regards,
Jonathan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GSoC update extra!] git-remote-svn: Week 8
  2010-06-24 17:39 ` Jonathan Nieder
@ 2010-06-24 18:07   ` Jonathan Nieder
  2010-06-24 21:32     ` Eric Wong
  2010-06-30  1:51     ` Sam Vilain
  2010-06-30  2:20   ` Sam Vilain
  1 sibling, 2 replies; 8+ messages in thread
From: Jonathan Nieder @ 2010-06-24 18:07 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: Git Mailing List, David Michael Barr, Sverre Rabbelier,
	Shawn O. Pearce, Daniel Shahaf, Eric Wong

Jonathan Nieder wrote:

> Git has all the blobs and all the trees, so except for the mapping
> between marks, subversion revs, and git revs, svn-fe does not need to
> persist much data at all.
> 
> Of course, that requires that the fast-import stream is going directly
> to git.

One more thought.

If we are tracking the history of separate subversion branches separately,
then reading back trees includes an oddity:

Suppose someone tries to reimplement git-svn on top of svn-fe[1].

 $ git svn --fe clone --stdlayout http://path/to/some/svn/root

Behind the scenes, git-svn processes the fast-import stream it is
receiving and writes its _own_ fast-import stream with paths munged
and commits split up into separate commits on each branch.  Good.

Now the oddity: suppose that in the repository, svn-fe finds an

 svn copy branches@r11 branches-old

operation.  In other words, it needs the tree for
http://path/to/some/svn/root/branches@r11.  This does not correspond
to a single git tree, since the content of each branch has been given
its own commit.

However, this does not seem to be fatal: one could just make
‘git svn --fe’ build a branch with the full history at the same time
as it builds the other branches.  Ugly, but I don’t see another way
around it without making svn-fe and ‘git svn --fe’ know more about
each other than I would like.

Jonathan

[1] Eric, we are discussing the remote-svn series[2] and especially
Ram, Sverre, and David’s recent comments[3].  Apologies for not
keeping you in the loop sooner; your insights have always been
helpful in the past.

As for the idea of reimplementing git-svn on top of svn-fe: yes, the
fast-import stream would need more information to support
--follow-parent, but that piece is not so hard to add AFAICT.  Of
course, I am mentioning this not because it is important to keep the
git-svn interface but because the --stdlayout feature is very useful
and we may want to port it over some day.

[2] http://thread.gmane.org/gmane.comp.version-control.git/149571
[3] http://thread.gmane.org/gmane.comp.version-control.git/149594

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GSoC update extra!] git-remote-svn: Week 8
  2010-06-24 18:07   ` Jonathan Nieder
@ 2010-06-24 21:32     ` Eric Wong
  2010-06-30  1:51     ` Sam Vilain
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Wong @ 2010-06-24 21:32 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Ramkumar Ramachandra, Git Mailing List, David Michael Barr,
	Sverre Rabbelier, Shawn O. Pearce, Daniel Shahaf

Jonathan Nieder <jrnieder@gmail.com> wrote:
> [1] Eric, we are discussing the remote-svn series[2] and especially
> Ram, Sverre, and David’s recent comments[3].  Apologies for not
> keeping you in the loop sooner; your insights have always been
> helpful in the past.

No worries Jonathan, I've been preoccupied with several other projects,
none of which require dealing with SVN :)

> As for the idea of reimplementing git-svn on top of svn-fe: yes, the
> fast-import stream would need more information to support
> --follow-parent, but that piece is not so hard to add AFAICT.  Of
> course, I am mentioning this not because it is important to keep the
> git-svn interface but because the --stdlayout feature is very useful
> and we may want to port it over some day.

Yes, I've always found --stdlayout very convenient and hope it
lives on.

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GSoC update extra!] git-remote-svn: Week 8
  2010-06-24 18:07   ` Jonathan Nieder
  2010-06-24 21:32     ` Eric Wong
@ 2010-06-30  1:51     ` Sam Vilain
  2010-06-30 12:45       ` Ramkumar Ramachandra
  1 sibling, 1 reply; 8+ messages in thread
From: Sam Vilain @ 2010-06-30  1:51 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Ramkumar Ramachandra, Git Mailing List, David Michael Barr,
	Sverre Rabbelier, Shawn O. Pearce, Daniel Shahaf, Eric Wong

On Thu, 2010-06-24 at 13:07 -0500, Jonathan Nieder wrote:
> operation.  In other words, it needs the tree for
> http://path/to/some/svn/root/branches@r11.  This does not correspond
> to a single git tree, since the content of each branch has been given
> its own commit.

I wrote at length about this near the beginning of the project;
essentially, figuring out whether particular paths are roots or not is
not defined, as SVN does not distinguish between them (a misfeature
cargo culted from Perforce).  It becomes a data mining problem, you have
this scattered data, and you have to find a history inside.

As I recommended before, it probably makes more sense to keep a "remote
tracking" branch which mirrors the *entire* repository, and sort out
efficient ways to convert SVN revision paths like the above into tree
IDs.

I consider it very important to separate the data import and tracking
stage from the data mining stage.

Once the data mining stage is well solved, then it makes sense to look
at ways that a tracking branch which only tracks a part of the
Subversion repository can be achieved.  In the simple case, where no
repository re-organisation or cross-project renames have occurred it is
relatively simple.  But in general I think this is a harder problem,
which cannot always be solved without intervention - and so not
necessary to be solved in short-term milestones.  As you are
discovering, it is a can of worms which you avoid if you know you always
have the complete SVN repository available.

Sam

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GSoC update extra!] git-remote-svn: Week 8
  2010-06-24 17:39 ` Jonathan Nieder
  2010-06-24 18:07   ` Jonathan Nieder
@ 2010-06-30  2:20   ` Sam Vilain
  1 sibling, 0 replies; 8+ messages in thread
From: Sam Vilain @ 2010-06-30  2:20 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Ramkumar Ramachandra, Git Mailing List, David Michael Barr,
	Sverre Rabbelier, Shawn O. Pearce, Daniel Shahaf

On Thu, 2010-06-24 at 12:39 -0500, Jonathan Nieder wrote:
> FWIW, I like the idea.
> 
> If you want to keep stdout unpolluted, this could work like
> 
> 	git fast-import --print-marks=<fd>
> 
> We would have to make sure output to the fd is always flushed to
> prevent deadlock.

That works, it could default to --print-marks=0 if STDIN happens to be a
socket.  Works for FastCGI :-)

Samn

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GSoC update extra!] git-remote-svn: Week 8
  2010-06-30  1:51     ` Sam Vilain
@ 2010-06-30 12:45       ` Ramkumar Ramachandra
  2010-07-01  3:38         ` Sam Vilain
  0 siblings, 1 reply; 8+ messages in thread
From: Ramkumar Ramachandra @ 2010-06-30 12:45 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Jonathan Nieder, Git Mailing List, David Michael Barr,
	Sverre Rabbelier, Shawn O. Pearce, Daniel Shahaf, Eric Wong

Hi Sam,

Sam Vilain writes:
> On Thu, 2010-06-24 at 13:07 -0500, Jonathan Nieder wrote:
> > operation.  In other words, it needs the tree for
> > http://path/to/some/svn/root/branches@r11.  This does not correspond
> > to a single git tree, since the content of each branch has been given
> > its own commit.
> 
> I wrote at length about this near the beginning of the project;
> essentially, figuring out whether particular paths are roots or not is
> not defined, as SVN does not distinguish between them (a misfeature
> cargo culted from Perforce).  It becomes a data mining problem, you have
> this scattered data, and you have to find a history inside.

Right. Implementing git-svn on top of git-remote-svn might not be a
bad idea.

> As I recommended before, it probably makes more sense to keep a "remote
> tracking" branch which mirrors the *entire* repository, and sort out
> efficient ways to convert SVN revision paths like the above into tree
> IDs.
> 
> I consider it very important to separate the data import and tracking
> stage from the data mining stage.

We're following this approach. At the moment, we're just focusing on
getting all the data directly from SVN into the Git store. Instead of
building trees for each SVN revision, we've found a way to do it
inside the Git object store: we're currently ironing out the details,
and I'll post an update about this shortly.

> Once the data mining stage is well solved, then it makes sense to look
> at ways that a tracking branch which only tracks a part of the
> Subversion repository can be achieved.  In the simple case, where no
> repository re-organisation or cross-project renames have occurred it is
> relatively simple.  But in general I think this is a harder problem,
> which cannot always be solved without intervention - and so not
> necessary to be solved in short-term milestones.  As you are
> discovering, it is a can of worms which you avoid if you know you always
> have the complete SVN repository available.

Right. I'm not convinced that it necessarily requires user
intervention though: can you systematically prove that enough
information is not available without user intervention using an
example? Or is it possible, but simply too difficult (and not worth
the effort) to mine out the data?

-- Ram

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GSoC update extra!] git-remote-svn: Week 8
  2010-06-30 12:45       ` Ramkumar Ramachandra
@ 2010-07-01  3:38         ` Sam Vilain
  0 siblings, 0 replies; 8+ messages in thread
From: Sam Vilain @ 2010-07-01  3:38 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: Jonathan Nieder, Git Mailing List, David Michael Barr,
	Sverre Rabbelier, Shawn O. Pearce, Daniel Shahaf, Eric Wong

On Wed, 2010-06-30 at 14:45 +0200, Ramkumar Ramachandra wrote: 
> > I wrote at length about this near the beginning of the project;
> > essentially, figuring out whether particular paths are roots or not is
> > not defined, as SVN does not distinguish between them (a misfeature
> > cargo culted from Perforce).  It becomes a data mining problem, you have
> > this scattered data, and you have to find a history inside.
> 
> Right. Implementing git-svn on top of git-remote-svn might not be a
> bad idea.

That's a good way to look at it, yes.  Probably git-svn has more
svn-specific code than import rules, so just the interface like
--stdlayout etc is worth keeping, as well as checking that the svn
import data miner could do all the same things as git-svn.

> > I consider it very important to separate the data import and tracking
> > stage from the data mining stage.
> 
> We're following this approach. At the moment, we're just focusing on
> getting all the data directly from SVN into the Git store. Instead of
> building trees for each SVN revision, we've found a way to do it
> inside the Git object store: we're currently ironing out the details,
> and I'll post an update about this shortly.

Of course no working copy need exist with these contents within; that
would hardly be 'cheap copy' would it?  :)  But it's probably worth
sticking to the standard tree/blob/commit object convention for ease of
debugging etc.

> > Once the data mining stage is well solved, then it makes sense to look
> > at ways that a tracking branch which only tracks a part of the
> > Subversion repository can be achieved.  In the simple case, where no
> > repository re-organisation or cross-project renames have occurred it is
> > relatively simple.  But in general I think this is a harder problem,
> > which cannot always be solved without intervention - and so not
> > necessary to be solved in short-term milestones.  As you are
> > discovering, it is a can of worms which you avoid if you know you always
> > have the complete SVN repository available.
> 
> Right. I'm not convinced that it necessarily requires user
> intervention though: can you systematically prove that enough
> information is not available without user intervention using an
> example? Or is it possible, but simply too difficult (and not worth
> the effort) to mine out the data?

Sure, well all you really need to do is try it with a few real-world
repositories.

But I can give you a few examples of where all attempts at heuristics
will fail.

The first is where someone puts a file somewhere in the repository,
perhaps a README.txt or something, somewhere outside the regular
location.

  r1:
  add /README.txt

Then, someone comes along and starts making their project:

  r2:
  add /trunk/README.txt

How do you know that the first commit is not part of any project, but
some out-of-band notes to people working with the repository?

The way I approached all this in my perforce converter (remember,
Perforce is like SVN in almost every way) is to progressively scan the
history and build up two tables which trace the "mined" history.

You can see the table definitions at
http://utsl.gen.nz/gitweb/?p=git-p4raw;a=blob;f=tables.sql;h=259c243;hb=7e4fc4a#l205

The first, change_branches - records that a logical branch exists at a
revision and path.

  (branchpath, change)

(you might want another 'column' in your conceptual data model: the
project name; I was dealing with a single project).

There are also cases where someone does something dumb, and then it is
repaired on the next commit.

eg

  /trunk/ProjectA
  /trunk/ProjectB
  /branches/ProjectA/foo
  /branches/ProjectB/bar

Someone comes along and does something like:

  rm /trunk
  mv /branches/ProjectA/foo /trunk

Whoops!  The /trunk path just got wiped.  How do we fix it?  In a hurry,
the system administrator checks out the old revision, tars them all up,
then uses 'svn add' to put them back.

  rm -r /trunk/*
  add /trunk/ProjectA
  add /trunk/ProjectB

After this, people working on it realise the mistake: the disconnected
history won't merge, etc.  But the change is permanent, and they work
around this error in the history.  They don't want to do the more
correct thing, which is restore the history from the broken commit,

  rm -r /trunk/*
  cp /trunk@42 /trunk

They don't want to do this because SVN has taught them that version
control is a fragile thing, and you don't want to monkey with it.
Because it can break and then your whole world is changed as the
precious black box which all your work is going into doesn't work quite
as before.  Because there is no "undo".  Because it has all these opaque
files inside it no-one can understand.  What happened before with the
rename upset and embarrassed you, and you don't want to risk making it
worse.

This sort of thing does actually happen.  The lesson is that you can't
trust heuristics, or the revision control breadcrumbs (copied-from etc)
to be perfect.  They are invisible - impossible to inspect directly
using the SVN command-line API, and impossible to revise once they are
there.  by contrast, with git we have grafts, refs/replace/OBJID,
filter-branch, rebase, etc.  We have visualization tools, git-add -p,
git gui.  We have an object store which is robust, simple and widely
understood.  We have a simple data model, so that the actual information
can be understood by people and not just buggy software.

Of course with SVN you have the fact that for the entire of its life as
a relevant version control system, it did not support merge tracking.
So, most history being imported will not have any reliable merge
information.  If you read early versions of the SVN manual, they
actually advocate recording, in natural language, a human-readable
description of the work done, in the commit message.  I've seen people
working around this lack of functionality by developing their own
systems, sometimes not even being able to reconstruct what what merged
where (eg, in parrot SVN).

Yet another situation is partial merging; unlike SVN, Perforce had
detailed merge tracking from the very beginning.  With Perforce it
worked on a per-file level only, so it is slightly different in that
respect.  But what you find is that sometimes, people will merge only a
part of another tree in to their "trunk" at a time.

  r45: merged
  /branches/ProjectA/src -> /trunk/ProjectA/dest

  r46: merged
  /branches/ProjectB/doc -> /trunk/ProjectB/doc

What I normally find in this case is that there is no useful history
recorded in those intermediate commits; they were just committing to
save their intermediate work from being lost.  This doesn't happen quite
so much in Perforce, because it has a concept of "index" missing
entirely from (the user API of) Subversion.  In that case, it makes more
sense to omit the intermediate commits, and simply record a single merge
and leave out the intermediate commits.

To work around parenting mistakes - both those caused by misuse and from
a lack of SVN functionality, you need to be able to readily and easily
revise the parent information.

To do this, the second important table in git-p4raw recorded parents;

  (branchpath, change, parent_branchpath, parent_change)
or, if I was stitching on a pre-perforce or otherwise manually converted
history;
  (branchpath, change, parent_sha1)

So, in the Perl code, I wrote a command called "find_branches", which
correlates the information already there in the database with the
changes for that revision, and progressively looks for new revisions.
It also creates provisional parent information based on the integration
breadcrumbs.

What I would then do is look at the result in 'gitk', and if there were
problems, they could usually be fixed by fiddling with the parent
information, rewinding the export (see 'unexport_commits') and
re-running it.  Sometimes this meant adding a missing merge parent,
sometimes my fuzzy logic for guessing the merge parents guessed badly.
Obviously, I was also developing the importer along the way, so as well
as data errors there were bugfixes to make, etc.

This was not arduous; the speed of postgres' query evaluation (with some
tuning) and git fast-import meant I was typically exporting at several
*hundred* commits per second.

As I had a facility to graft manually converted history using git-p4raw
(above, that's where the "parent" is a commit SHA1, not a Perforce
revision number and path), I even went back and found various changes in
Perforce that looked more like incremental Perl releases, and run the
script I had for pre-perforce history over the diff and changelog
contained within (by then, it had a personality; it was called the
Timinator: http://perl5.git.perl.org/perl.git/tag/timinator).

Anyway, with that information in place, you then have all the
information you need to do a test export.  The exporter already has all
of the blobs in the git repository; all it has to do is refer to these
in a fast-export stream.  It marks as it goes along; once it has
finished an export batch, it waits for the fast-import process to finish
successfully, reads all of the SHA1s corresponding to the marks it
already emitted, and then updates the database tables with the SHA1s
accordingly.  Due to extensive use of deferred check constraints, only
then will Postgres let it commit :-).  That way, when I hit "ctrl+c"
along the way, I knew everything was safe.  Restartability and
robustness in the face of crashes is very useful for this sort of tool.

Another strange case which affects some of the largest repositories in
the world I don't have an answer for, but suspect it can be represented
by either subtree merging or by submodules:

  mv /trunk/ProjectA /trunk/ProjectB/lib/ProjectA

"ProjectA" is now included in "ProjectB" - what is the intent of this?
The first possibility is a subtree merge, the second is that a submodule
is desired.  How to represent it in git will depend on what happens
later.  If the directory is moved or copied elsewhere, then it is
probably going to be better to represent it as a submodule.

And here, the lesson is: people use SVN in ways which defy a single
mapping into git.  This one in particular affects the KDE project
heavily, as directories are copied around extensively.  SVN can remember
the history and produce logs, but it requires the entire repository
available to be able to do so.  Thiago wrote a tool called
svn-fast-export-all, which hoped to parse the svnadmin dump file and
split the data into separate repositories as it went, but as it is a
very long batch job it is difficult to produce a high quality
conversion.

Important points to take from this;

  * model the source data cleanly, completely and robustly.
  * start with heuristics, hopefully they will work for people following
the SVN guide, but allow for human input for when it doesn't.
  * aim for quick export/rewind, and robust operation.
  * this will make it very easy for revisionists to clean up the
mistakes of the past

Keep up the good work!
Sam

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-07-01  3:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-24 13:33 [GSoC update extra!] git-remote-svn: Week 8 Ramkumar Ramachandra
2010-06-24 17:39 ` Jonathan Nieder
2010-06-24 18:07   ` Jonathan Nieder
2010-06-24 21:32     ` Eric Wong
2010-06-30  1:51     ` Sam Vilain
2010-06-30 12:45       ` Ramkumar Ramachandra
2010-07-01  3:38         ` Sam Vilain
2010-06-30  2:20   ` Sam Vilain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).