Re: cvs import

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: cvs import
       [not found] <45084400.1090906@bluegap.ch>
@ 2006-09-13 19:01 ` Jon Smirl
  2006-09-13 20:41   ` Martin Langhoff
  2006-09-13 22:52 ` Nathaniel Smith
  1 sibling, 1 reply; 38+ messages in thread
From: Jon Smirl @ 2006-09-13 19:01 UTC (permalink / raw)
  To: Markus Schiltknecht, Git Mailing List; +Cc: monotone-devel, dev

Let's copy the git list too and maybe we can come up with one importer
for everyone.

On 9/13/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
> Hi,
>
> I've been trying to understand the cvsimport algorithm used by monotone
> and wanted to adjust that to be more like the one in cvs2svn.
>
> I've had some problems with cvs2svn itself and began to question the
> algorithm used there. It turned out that the cvs2svn people have
> discussed an improved algorithms and are about to write a cvs2svn 2.0.
> The main problem with the current algorithm is that it depends on the
> timestamp information stored in the CVS repository.
>
> Instead, it would be much better to just take the dependencies of the
> revisions into account. Considering the timestamp an irrelevant (for the
> import) attribute of the revision.
>
> Now, that can be used to convert from CVS to about anything else.
> Obviously we were discussing about subversion, but then there was git,
> too. And monotone.
>
> I'm beginning to question if one could come up with a generally useful
> cleaned-and-sane-CVS-changeset-dump-format, which could then be used by
> importers to all sort of VCSes. This would make monotone's cvsimport
> function dependent on cvs2svn (and therefore python). But the general
> try-to-get-something-usefull-from-an-insane-CVS-repository-algorithm
> would only have to be written once.
>
> On the other hand, I see that lots of the cvsimport functionality for
> monotone has already been written (rcs file parsing, stuffing files,
> file deltas and complete revisions into the monotone database, etc..).
> Changing it to a better algorithm does not seem to be _that_ much work
> anymore. Plus the hard part seems to be to come up with a good
> algorithm, not implementing it. And we could still exchange our
> experience with the general algorithm with the cvs2svn people.
>
> Plus, the guy who mentioned git pointed out that git needs quite a
> different dump-format than subversion to do an efficient conversion. I
> think coming up with a generally-usable dump format would not be that easy.
>
> So you see, I'm slightly favoring the second implementation approach
> with a C++ implementation inside monotone.
>
> Thoughts or comments?
> Sorry, I forgot to mention some pointers:
>
> Here is the thread where I've started the discussion about the cvs2svn
> algorithm:
> http://cvs2svn.tigris.org/servlets/ReadMsg?list=dev&msgNo=1599
>
> And this is a proposal for an algorithm to do cvs imports independant of
> the timestamp:
> http://cvs2svn.tigris.org/servlets/ReadMsg?list=dev&msgNo=1451
>
> Markus
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cvs2svn.tigris.org
> For additional commands, e-mail: dev-help@cvs2svn.tigris.org
>
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-13 19:01 ` cvs import Jon Smirl
@ 2006-09-13 20:41   ` Martin Langhoff
  2006-09-13 21:04     ` Markus Schiltknecht
  2006-09-13 21:05     ` Markus Schiltknecht
  0 siblings, 2 replies; 38+ messages in thread
From: Martin Langhoff @ 2006-09-13 20:41 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Markus Schiltknecht, Git Mailing List, monotone-devel, dev

On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> Let's copy the git list too and maybe we can come up with one importer
> for everyone.

It's a really good idea. cvsps has been for a while a (limited, buggy)
attempt at that. One thing that bothers me in the cvs2svn algorithm is
that is not stable in its decisions about where the branching point is
-- run the import twice at different times and it may tell you that
the branching point has moved.

This is problematic for incremental imports. If we fudged that "it's
around *here*" we better remember we said that and not go changing our
story. Git is too smart for that ;-)


martin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-13 20:41   ` Martin Langhoff
@ 2006-09-13 21:04     ` Markus Schiltknecht
  2006-09-13 21:15       ` Oswald Buddenhagen
  2006-09-13 21:16       ` Martin Langhoff
  2006-09-13 21:05     ` Markus Schiltknecht
  1 sibling, 2 replies; 38+ messages in thread
From: Markus Schiltknecht @ 2006-09-13 21:04 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: dev, monotone-devel, Jon Smirl, Git Mailing List

Martin Langhoff wrote:
> On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
>> Let's copy the git list too and maybe we can come up with one importer
>> for everyone.
> 
> It's a really good idea. cvsps has been for a while a (limited, buggy)
> attempt at that. One thing that bothers me in the cvs2svn algorithm is
> that is not stable in its decisions about where the branching point is
> -- run the import twice at different times and it may tell you that
> the branching point has moved.

Huh? Really? Why is that? I don't see reasons for such a thing happening 
when studying the algorithm.

For sure the proposed dependency-resolving algorithm which does not rely 
on timestamps does not have that problem.

Regards

Markus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-13 21:04     ` Markus Schiltknecht
@ 2006-09-13 21:15       ` Oswald Buddenhagen
  2006-09-13 21:16       ` Martin Langhoff
  1 sibling, 0 replies; 38+ messages in thread
From: Oswald Buddenhagen @ 2006-09-13 21:15 UTC (permalink / raw)
  To: Markus Schiltknecht
  Cc: Martin Langhoff, Jon Smirl, Git Mailing List, monotone-devel, dev

On Wed, Sep 13, 2006 at 11:04:13PM +0200, Markus Schiltknecht wrote:
> Martin Langhoff wrote:
> >One thing that bothers me in the cvs2svn algorithm is
> >that is not stable in its decisions about where the branching point is
> >-- run the import twice at different times and it may tell you that
> >the branching point has moved.
> 
> Huh? Really? Why is that? I don't see reasons for such a thing happening 
> when studying the algorithm.
> 
that's certainly due to some hash being iterated. python intentionally
randomizes this to make wrong assumptions obvious.
there is actually a patch pending to improve the branch source selection
drastically. maybe this is affected as well.

> For sure the proposed dependency-resolving algorithm which does not rely 
> on timestamps does not have that problem.
> 
i think that's unrelated.

-- 
Hi! I'm a .signature virus! Copy me into your ~/.signature, please!
--
Chaos, panic, and disorder - my work here is done.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-13 21:04     ` Markus Schiltknecht
  2006-09-13 21:15       ` Oswald Buddenhagen
@ 2006-09-13 21:16       ` Martin Langhoff
  2006-09-14  4:17         ` Michael Haggerty
  1 sibling, 1 reply; 38+ messages in thread
From: Martin Langhoff @ 2006-09-13 21:16 UTC (permalink / raw)
  To: Markus Schiltknecht; +Cc: Jon Smirl, Git Mailing List, monotone-devel, dev

On 9/14/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
> Martin Langhoff wrote:
> > On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> >> Let's copy the git list too and maybe we can come up with one importer
> >> for everyone.
> >
> > It's a really good idea. cvsps has been for a while a (limited, buggy)
> > attempt at that. One thing that bothers me in the cvs2svn algorithm is
> > that is not stable in its decisions about where the branching point is
> > -- run the import twice at different times and it may tell you that
> > the branching point has moved.
>
> Huh? Really? Why is that? I don't see reasons for such a thing happening
> when studying the algorithm.
>
> For sure the proposed dependency-resolving algorithm which does not rely
> on timestamps does not have that problem.

IIRC, it places branch tags as late as possible. I haven't looked at
it in detail, but an import immediately after the first commit against
the branch may yield a different branchpoint from the same import done
a bit later.

cheers,


martin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-13 21:16       ` Martin Langhoff
@ 2006-09-14  4:17         ` Michael Haggerty
  2006-09-14  4:34           ` Jon Smirl
  2006-09-14  4:40           ` Martin Langhoff
  0 siblings, 2 replies; 38+ messages in thread
From: Michael Haggerty @ 2006-09-14  4:17 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Markus Schiltknecht, Jon Smirl, Git Mailing List, monotone-devel,
	dev

Martin Langhoff wrote:
> On 9/14/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
>> Martin Langhoff wrote:
>> > On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
>> >> Let's copy the git list too and maybe we can come up with one importer
>> >> for everyone.
>> >
>> > It's a really good idea. cvsps has been for a while a (limited, buggy)
>> > attempt at that. One thing that bothers me in the cvs2svn algorithm is
>> > that is not stable in its decisions about where the branching point is
>> > -- run the import twice at different times and it may tell you that
>> > the branching point has moved.
>>
>> Huh? Really? Why is that? I don't see reasons for such a thing happening
>> when studying the algorithm.
>>
>> For sure the proposed dependency-resolving algorithm which does not rely
>> on timestamps does not have that problem.
> 
> IIRC, it places branch tags as late as possible. I haven't looked at
> it in detail, but an import immediately after the first commit against
> the branch may yield a different branchpoint from the same import done
> a bit later.

This is correct.  And IMO it makes sense from the standpoint of an
all-at-once conversion.

But I was under the impression that this wouldn't matter for
content-indexed-based SCMs.  The content of all possible branching
points is identical, and therefore from your point of view the topology
should be the same, no?

But aside from this point, I think an intrinsic part of implementing
incremental conversion is "convert the subsequent changes to the CVS
repository *subject to the constraints* imposed by decisions made in
earlier conversion runs.  And the real trick is that things can be done
in CVS (e.g., line-end changes, manual copying of files in the repo)
that (a) are unversioned and (b) have retroactive effects that go
arbitrarily far back in time.  This is the reason that I am pessimistic
that incremental conversion will ever work robustly.

Michael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14  4:17         ` Michael Haggerty
@ 2006-09-14  4:34           ` Jon Smirl
  2006-09-14  5:02             ` Michael Haggerty
  2006-09-14  4:40           ` Martin Langhoff
  1 sibling, 1 reply; 38+ messages in thread
From: Jon Smirl @ 2006-09-14  4:34 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Martin Langhoff, Markus Schiltknecht, Git Mailing List,
	monotone-devel, dev

On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> But aside from this point, I think an intrinsic part of implementing
> incremental conversion is "convert the subsequent changes to the CVS
> repository *subject to the constraints* imposed by decisions made in
> earlier conversion runs.  And the real trick is that things can be done
> in CVS (e.g., line-end changes, manual copying of files in the repo)
> that (a) are unversioned and (b) have retroactive effects that go
> arbitrarily far back in time.  This is the reason that I am pessimistic
> that incremental conversion will ever work robustly.

We don't need really robust incremental conversion. It just needs to
work most of the time. Incremental conversion is usually used to track
the main CVS repo with the new tool while people decide if they like
the new tool. Commits will still flow to the CVS repo and get
incrementally copied to the new tool so that it tracks CVS in close to
real time.

If the increment import messes up you can always redo a full import,
but a full Mozilla import takes about 2 hours with the git tools. I
would always do a full import on the day of the actual cut over.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14  4:34           ` Jon Smirl
@ 2006-09-14  5:02             ` Michael Haggerty
  2006-09-14  5:21               ` Martin Langhoff
  2006-09-14  5:30               ` Jon Smirl
  0 siblings, 2 replies; 38+ messages in thread
From: Michael Haggerty @ 2006-09-14  5:02 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Martin Langhoff, Markus Schiltknecht, Git Mailing List,
	monotone-devel, dev

Jon Smirl wrote:
> On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> But aside from this point, I think an intrinsic part of implementing
>> incremental conversion is "convert the subsequent changes to the CVS
>> repository *subject to the constraints* imposed by decisions made in
>> earlier conversion runs.  And the real trick is that things can be done
>> in CVS (e.g., line-end changes, manual copying of files in the repo)
>> that (a) are unversioned and (b) have retroactive effects that go
>> arbitrarily far back in time.  This is the reason that I am pessimistic
>> that incremental conversion will ever work robustly.
> 
> We don't need really robust incremental conversion. It just needs to
> work most of the time. Incremental conversion is usually used to track
> the main CVS repo with the new tool while people decide if they like
> the new tool. Commits will still flow to the CVS repo and get
> incrementally copied to the new tool so that it tracks CVS in close to
> real time.

I hadn't thought of the idea of using incremental conversion as an
advertising method for switching SCM systems :-)  But if changes flow
back to CVS, doesn't this have to be pretty robust?

In our trial period, we simply did a single conversion to SVN and let
people play with this test repository.  When we decided to switch over
we did another full conversion and simply discarded the changes that had
been made in the test SVN repository.

The use cases that I had considered were:

1. For conversions that take days, one could do a full commit while
leaving CVS online, then take CVS offline and do only an incremental
conversion to reduce SCM downtime.  This is of course less of an issue
if you could bring the conversion time down to a couple hours for even
the largest CVS repos.

2. Long-term continuous mirroring (backwards and forwards) between CVS
and another SCM, to allow people to use their preferred tool.  (I
actually think that this is a silly idea, but some people seem to like it.)

For both of these applications, incremental conversion would have to be
robust (for 1 it would at least have to give a clear indication of
unrecoverable errors).

Michael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14  5:02             ` Michael Haggerty
@ 2006-09-14  5:21               ` Martin Langhoff
  2006-09-14  5:35                 ` Michael Haggerty
  2006-09-14  5:30               ` Jon Smirl
  1 sibling, 1 reply; 38+ messages in thread
From: Martin Langhoff @ 2006-09-14  5:21 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Jon Smirl, Markus Schiltknecht, Git Mailing List, monotone-devel,
	dev

On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> 2. Long-term continuous mirroring (backwards and forwards) between CVS
> and another SCM, to allow people to use their preferred tool.  (I
> actually think that this is a silly idea, but some people seem to like it.)

Call me silly ;-) I use this all the time to track projects that use
CVS or SVN, where I either

 - Do have write access, but often develop offline (and I have a bunch
of perl/shell scripts to extract the patches and auto-commit them into
CVS/SVN).

 - Do have write access, but want to experimental work branches
without making much noise in the cvs repo -- and being able to merge
CVS's HEAD in repeatedly as you'd want.

 - Run "vendor-branch-tracking" setups for projects where I have a
custom branch of a FOSS sofware project, and repeatedly import updates
from upstream. this is the 'killer-app' of DSCMs IMHO.

It is not as robust as I'd like; with CVS, the git imports eventually
stray a bit from upstream, and requires manual fixing. But it is
_good_.

cheers,


martin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14  5:21               ` Martin Langhoff
@ 2006-09-14  5:35                 ` Michael Haggerty
  0 siblings, 0 replies; 38+ messages in thread
From: Michael Haggerty @ 2006-09-14  5:35 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Jon Smirl, Markus Schiltknecht, Git Mailing List, monotone-devel,
	dev

Martin Langhoff wrote:
> On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> 2. Long-term continuous mirroring (backwards and forwards) between CVS
>> and another SCM, to allow people to use their preferred tool.  (I
>> actually think that this is a silly idea, but some people seem to like
>> it.)
> 
> Call me silly ;-) I use this all the time to track projects that use
> CVS or SVN, where I either
>
> [...]

Sorry, I guess I was speaking as a person who prefers and is most
familiar with centralized SCM.  But I see from your response that the
ultimate in decentralized development is that each developer decides
what SCM to use :-) and that incremental conversion makes sense in that
context.

Michael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14  5:02             ` Michael Haggerty
  2006-09-14  5:21               ` Martin Langhoff
@ 2006-09-14  5:30               ` Jon Smirl
  1 sibling, 0 replies; 38+ messages in thread
From: Jon Smirl @ 2006-09-14  5:30 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Martin Langhoff, Markus Schiltknecht, Git Mailing List,
	monotone-devel, dev

On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> Jon Smirl wrote:
> > On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> >> But aside from this point, I think an intrinsic part of implementing
> >> incremental conversion is "convert the subsequent changes to the CVS
> >> repository *subject to the constraints* imposed by decisions made in
> >> earlier conversion runs.  And the real trick is that things can be done
> >> in CVS (e.g., line-end changes, manual copying of files in the repo)
> >> that (a) are unversioned and (b) have retroactive effects that go
> >> arbitrarily far back in time.  This is the reason that I am pessimistic
> >> that incremental conversion will ever work robustly.
> >
> > We don't need really robust incremental conversion. It just needs to
> > work most of the time. Incremental conversion is usually used to track
> > the main CVS repo with the new tool while people decide if they like
> > the new tool. Commits will still flow to the CVS repo and get
> > incrementally copied to the new tool so that it tracks CVS in close to
> > real time.
>
> I hadn't thought of the idea of using incremental conversion as an
> advertising method for switching SCM systems :-)  But if changes flow
> back to CVS, doesn't this have to be pretty robust?

Changes flow back to CVS but using the new tool to generate a patch,
apply the patch to your CVS check out and commit it.

There are too many people working on Mozilla to get agreement to
switch in a short amount of time. git may need to mirror CVS for
several months. There are also other people pushing svn, monotone,
perforce, etc, etc, etc. Bottom line, Mozilla really needs a
distributed system because external companies are making large changes
and want their repos in house.

In my experience none of the other SCMs are up to taking one Mozilla
yet. Git has the tools but I can get a clean import.


I am using this process on Mozilla right now with git. I have a script
that updates my CVS tree overnight and then commits the changes into a
local git repo. I can then work on Mozilla using git but my history is
all messed up. When a change is ready I generate a diff against last
night's check out and apply it to my CVS tree and commit. CVS then
finds any merge problems for me.

>
> In our trial period, we simply did a single conversion to SVN and let
> people play with this test repository.  When we decided to switch over
> we did another full conversion and simply discarded the changes that had
> been made in the test SVN repository.
>
> The use cases that I had considered were:
>
> 1. For conversions that take days, one could do a full commit while
> leaving CVS online, then take CVS offline and do only an incremental
> conversion to reduce SCM downtime.  This is of course less of an issue
> if you could bring the conversion time down to a couple hours for even
> the largest CVS repos.
>
> 2. Long-term continuous mirroring (backwards and forwards) between CVS
> and another SCM, to allow people to use their preferred tool.  (I
> actually think that this is a silly idea, but some people seem to like it.)
>
> For both of these applications, incremental conversion would have to be
> robust (for 1 it would at least have to give a clear indication of
> unrecoverable errors).
>
>
> Michael
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14  4:17         ` Michael Haggerty
  2006-09-14  4:34           ` Jon Smirl
@ 2006-09-14  4:40           ` Martin Langhoff
  1 sibling, 0 replies; 38+ messages in thread
From: Martin Langhoff @ 2006-09-14  4:40 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Markus Schiltknecht, Jon Smirl, Git Mailing List, monotone-devel,
	dev

On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> > IIRC, it places branch tags as late as possible. I haven't looked at
> > it in detail, but an import immediately after the first commit against
> > the branch may yield a different branchpoint from the same import done
> > a bit later.
>
> This is correct.  And IMO it makes sense from the standpoint of an
> all-at-once conversion.
>
> But I was under the impression that this wouldn't matter for
> content-indexed-based SCMs.  The content of all possible branching
> points is identical, and therefore from your point of view the topology
> should be the same, no?

Exactly. But if you shift the branching point to later, two things change

 - it is possible that (in some corner cases) the content itself
changes as the branching point could end up being moved a couple of
commits "later". one of the downsides of cvs not being atomic.

 - even if the content does not change, rearranging of history in git
is a no-no. git relies on history being read-only 100%

> But aside from this point, I think an intrinsic part of implementing
> incremental conversion is "convert the subsequent changes to the CVS
> repository *subject to the constraints* imposed by decisions made in
> earlier conversion runs.

Yes, and that's a fundamental change in the algorithm. That's exactly
why I mentioned it in this thread ;-) Any incremental importer has to
make up some parts of history, and then remember what it has made up.

So part of the process becomes
 - figure our history on top of the history we already parsed
 - check whether the cvs repo now has any 'new' history that affects
already-parsed history negatively, and report those as errors

hmmmmmm.

> This is the reason that I am pessimistic
> that incremental conversion will ever work robustly.

We all are :) But for a repo that doesn't go through direct tampering,
we can improve the algorithm to be more stable.



martin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-13 20:41   ` Martin Langhoff
  2006-09-13 21:04     ` Markus Schiltknecht
@ 2006-09-13 21:05     ` Markus Schiltknecht
  2006-09-13 21:38       ` Jon Smirl
  1 sibling, 1 reply; 38+ messages in thread
From: Markus Schiltknecht @ 2006-09-13 21:05 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jon Smirl, Git Mailing List, monotone-devel, dev

Martin Langhoff wrote:
> On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
>> Let's copy the git list too and maybe we can come up with one importer
>> for everyone.
> 
> It's a really good idea. cvsps has been for a while a (limited, buggy)
> attempt at that.

BTW: good point, I always thought about cvsps. Does anybody know what 
'dump' format that uses?

For sure it's algorithm isn't that strong. cvs2svn is better, IMHO. The 
proposed dependency resolving algorithm will be even better /me thinks.

Regards

Markus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-13 21:05     ` Markus Schiltknecht
@ 2006-09-13 21:38       ` Jon Smirl
  2006-09-14  5:36         ` Michael Haggerty
  0 siblings, 1 reply; 38+ messages in thread
From: Jon Smirl @ 2006-09-13 21:38 UTC (permalink / raw)
  To: Markus Schiltknecht
  Cc: Martin Langhoff, Git Mailing List, monotone-devel, dev

On 9/13/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
> Martin Langhoff wrote:
> > On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> >> Let's copy the git list too and maybe we can come up with one importer
> >> for everyone.
> >
> > It's a really good idea. cvsps has been for a while a (limited, buggy)
> > attempt at that.
>
> BTW: good point, I always thought about cvsps. Does anybody know what
> 'dump' format that uses?

cvsps has potential but the multiple missing branch labels in the
Mozilla CVS confuse it and its throws away important data. It's
algorithm would need reworking too. cvs2svn is the only CVS converter
that imported Mozilla CVS on the first try and mostly got things
right.

Patchset format for cvsps
http://www.cobite.com/cvsps/README

AFAIK none of the CVS converters are using the dependency algorithm.
So the proposal on the table is to develop a new converter that uses
the dependency data from CVS to form the change sets and then outputs
this data in a form that all of the backends can consume. Of course
each of the backends is going to have to write some code in order to
consume this new import format.

>
> For sure it's algorithm isn't that strong. cvs2svn is better, IMHO. The
> proposed dependency resolving algorithm will be even better /me thinks.
>
> Regards
>
> Markus
>

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-13 21:38       ` Jon Smirl
@ 2006-09-14  5:36         ` Michael Haggerty
  2006-09-14 15:50           ` Shawn Pearce
  0 siblings, 1 reply; 38+ messages in thread
From: Michael Haggerty @ 2006-09-14  5:36 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Markus Schiltknecht, Martin Langhoff, Git Mailing List,
	monotone-devel, dev

Jon Smirl wrote:
> On 9/13/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
>> Martin Langhoff wrote:
>> > On 9/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
>> >> Let's copy the git list too and maybe we can come up with one importer
>> >> for everyone.

That would be great.

> AFAIK none of the CVS converters are using the dependency algorithm.
> So the proposal on the table is to develop a new converter that uses
> the dependency data from CVS to form the change sets and then outputs
> this data in a form that all of the backends can consume. Of course
> each of the backends is going to have to write some code in order to
> consume this new import format.

Frankly, I think people are getting the priorities wrong by focusing on
the format of the output of cvs2svn.  Hacking a new output format onto
cvs2svn is a trivial matter of a couple hours of programming.

The real strength of cvs2svn (and I can say this without bragging
because most of this was done before I got involved in the project) is
that it handles dozens of peculiar corner cases and bizarre CVS
perversions, including a good test suite containing lots of twisted
little example repositories.  This is 90% of the intellectual content of
cvs2svn.

I've spent many, many hours refactoring and reengineering cvs2svn to
make it easy to modify and add new features.  The main thing that I want
to change is to use the dependency graph (rather than timestamps tweaked
to reflect dependency ordering) to deduce changesets.  But I would never
think of throwing away the "old" cvs2svn and starting anew, because then
I would have to add all the little corner cases again from scratch.

It would be nice to have a universal dumpfile format, but IMO not
critical.  The only difference between our SCMs that might be difficult
to paper over in a universal dumpfile is that SVN wants its changesets
in chronological order, whereas I gather that others would prefer the
data in dependency order branch by branch.

I say let cvs2svn (or if you like, we can rename it to "cvs2noncvs" :-)
) reconstruct the repository's change sets, then let us build several
backends that output the data in the format that is most convenient for
each project.

Michael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14  5:36         ` Michael Haggerty
@ 2006-09-14 15:50           ` Shawn Pearce
  2006-09-14 16:04             ` Jakub Narebski
  2006-09-15  7:37             ` Markus Schiltknecht
  0 siblings, 2 replies; 38+ messages in thread
From: Shawn Pearce @ 2006-09-14 15:50 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Martin Langhoff, monotone-devel, Jon Smirl, dev, Git Mailing List

Michael Haggerty <mhagger@alum.mit.edu> wrote:
>  The only difference between our SCMs that might be difficult
> to paper over in a universal dumpfile is that SVN wants its changesets
> in chronological order, whereas I gather that others would prefer the
> data in dependency order branch by branch.

This really isn't an issue for Git.

Originally I wanted Jon Smirl to modify the cvs2svn code to emit
only one branch at a time as that would be much faster than jumping
around branches in chronological order.  But it turned out to
be too much work to change cvs2svn.  So git-fast-import (the Git
program that consumes the dump stream from Jon's modified cvs2svn)
maintains an LRU of the branches in memory and reloads inactive
branches as necessary when cvs2svn jumps around.

It turns out it didn't matter if the git-fast-import maintained 5
active branches in the LRU or 60.  Apparently the Mozilla repo didn't
jump around more than 5 branches at a time - most of the time anyway.

Branches in git-fast-import seemed to cost us only 2 MB of memory
per active branch on the Mozilla repository.  Holding 60 of them at
once (120 MB) is peanuts on most machines today.  But really only 5
(10 MB) were needed for an efficient import.

I don't know how the Monotone guys feel about it but I think Git
is happy with the data in any order, just so long as the dependency
chains aren't fed out of order.  Which I think nearly all changeset
based SCMs would have an issue with.  So we should be just fine
with the current chronological order produced by cvs2svn.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14 15:50           ` Shawn Pearce
@ 2006-09-14 16:04             ` Jakub Narebski
  2006-09-14 16:18               ` Shawn Pearce
  2006-09-14 16:27               ` Jon Smirl
  2006-09-15  7:37             ` Markus Schiltknecht
  1 sibling, 2 replies; 38+ messages in thread
From: Jakub Narebski @ 2006-09-14 16:04 UTC (permalink / raw)
  To: git; +Cc: monotone-devel, dev, monotone-devel, git

Shawn Pearce wrote:

> Originally I wanted Jon Smirl to modify the cvs2svn (...)

By the way, will cvs2git (modified cvs2svn) and git-fast-import publicly
available?

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14 16:04             ` Jakub Narebski
@ 2006-09-14 16:18               ` Shawn Pearce
  2006-09-14 16:27               ` Jon Smirl
  1 sibling, 0 replies; 38+ messages in thread
From: Shawn Pearce @ 2006-09-14 16:18 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, monotone-devel, dev

Jakub Narebski <jnareb@gmail.com> wrote:
> Shawn Pearce wrote:
> 
> > Originally I wanted Jon Smirl to modify the cvs2svn (...)
> 
> By the way, will cvs2git (modified cvs2svn) and git-fast-import publicly
> available?

Yes.  I want to submit git-fast-import to the main Git project and
ask Junio to bring it in.

However right now I feel like the code isn't up-to-snuff and won't
pass peer review on the Git mailing list.  So I wanted to spend a
little bit of time cleaning it up before asking Junio to carry it
in the main distribution.  My pack mmap window code is actually
part of that cleanup.

I think the goal of this thread is to try and merge the ideas
behind Jon's modified cvs2svn into the core cvs2svn, possibly
causing cvs2svn to be renamed to cvs2notcvs (or some such) and
having a slightly more modular output format so Git, Monotone and
Subversion can all benefit from the difficult-to-do-right changeset
generation logic.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14 16:04             ` Jakub Narebski
  2006-09-14 16:18               ` Shawn Pearce
@ 2006-09-14 16:27               ` Jon Smirl
  2006-09-14 17:01                 ` Michael Haggerty
  1 sibling, 1 reply; 38+ messages in thread
From: Jon Smirl @ 2006-09-14 16:27 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, monotone-devel, dev

On 9/14/06, Jakub Narebski <jnareb@gmail.com> wrote:
> Shawn Pearce wrote:
>
> > Originally I wanted Jon Smirl to modify the cvs2svn (...)
>
> By the way, will cvs2git (modified cvs2svn) and git-fast-import publicly
> available?

It has some unresolved problems so I wasn't spreading it around everywhere.

It is based on cvs2svn from August. There has been too much change to
the current cvs2svn to merge it anymore. It is going to need
significant rewrite. But cvs2svn will all change again if it converts
to the dependency model. It is better to get a backend independent
interface build into cvs2svn.

It it not generating an accurate repo. cvs2svn is outputting tags
based on multiple revisions, git can't do that. I'm just tossing some
of the tag data that git can't handle. I base the tag on the fist
revision which is not correct.

If the repo is missing branch tags cvs2svn may turn a single missing
branch into hundreds of branches. The Mozilla repo has about 1000
extra branches because of this.

Sometime cvs2svn will partial copy from another rev to generate a new
rev. Git doesn't do this so I am tossing the copy requests. I need to
figure out how to hook into the data before cvs2svn tries to copy
things.

cvs2svn makes no attempt to detect merges so gitk will show 1,700
active branches when there are really only 10 currently active
branches in Mozilla.

That said 99.9% of Mozilla CVS is in the output git repo, but it isn't
quite right.

If you still want the code I'll send it to you.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14 16:27               ` Jon Smirl
@ 2006-09-14 17:01                 ` Michael Haggerty
  2006-09-14 17:08                   ` Jakub Narebski
  2006-09-14 17:17                   ` Jon Smirl
  0 siblings, 2 replies; 38+ messages in thread
From: Michael Haggerty @ 2006-09-14 17:01 UTC (permalink / raw)
  To: Jon Smirl; +Cc: monotone-devel, dev, git, Jakub Narebski

Jon Smirl wrote:
> On 9/14/06, Jakub Narebski <jnareb@gmail.com> wrote:
>> Shawn Pearce wrote:
>>
>> > Originally I wanted Jon Smirl to modify the cvs2svn (...)
>>
>> By the way, will cvs2git (modified cvs2svn) and git-fast-import publicly
>> available?
> 
> It has some unresolved problems so I wasn't spreading it around everywhere.
> 
> It is based on cvs2svn from August. There has been too much change to
> the current cvs2svn to merge it anymore. [...]
> 
> If the repo is missing branch tags cvs2svn may turn a single missing
> branch into hundreds of branches. The Mozilla repo has about 1000
> extra branches because of this.

[To explain to our studio audience:] Currently, if there is an actual
branch in CVS but no symbol associated with it, cvs2svn generates branch
labels like "unlabeled-1.2.3", where "1.2.3" is the branch revision
number in CVS for the particular file.  The problem is that the branch
revision numbers for files in the same logical branch are usually
different.  That is why many extra branches are generated.

Such unnamed branches cannot reasonably be accessed via CVS anyway, and
somebody probably made the conscious decision to delete the branch from
CVS (though without doing it correctly).  Therefore such revisions are
probably garbage.  It would be easy to add an option to discard such
revisions, and we should probably do so.  (In fact, they can already be
excluded with "--exclude=unlabeled-.*".)  The only caveat is that it is
possible for other, named branches to sprout from an unnamed branch.  In
this case either the second branch would have to be excluded too, or the
unlabeled branch would have to be included.

Alternatively, there was a suggestion to add heuristics to guess which
files' "unlabeled" branches actually belong in the same original branch.
 This would be a lot of work, and the result would never be very
accurate (for one thing, there is no evidence of the branch whatsoever
in files that had no commits on the branch).

Other ideas are welcome.

Michael

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14 17:01                 ` Michael Haggerty
@ 2006-09-14 17:08                   ` Jakub Narebski
  2006-09-14 17:17                   ` Jon Smirl
  1 sibling, 0 replies; 38+ messages in thread
From: Jakub Narebski @ 2006-09-14 17:08 UTC (permalink / raw)
  To: monotone-devel; +Cc: dev, git

Michael Haggerty wrote:

> Alternatively, there was a suggestion to add heuristics to guess which
> files' "unlabeled" branches actually belong in the same original branch.
>  This would be a lot of work, and the result would never be very
> accurate (for one thing, there is no evidence of the branch whatsoever
> in files that had no commits on the branch).
> 
> Other ideas are welcome.

Interpolate the state of repository according to timestamps, with some
coarse-grainess of course.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14 17:01                 ` Michael Haggerty
  2006-09-14 17:08                   ` Jakub Narebski
@ 2006-09-14 17:17                   ` Jon Smirl
  1 sibling, 0 replies; 38+ messages in thread
From: Jon Smirl @ 2006-09-14 17:17 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: monotone-devel, dev, git, Jakub Narebski

On 9/14/06, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> Jon Smirl wrote:
> > On 9/14/06, Jakub Narebski <jnareb@gmail.com> wrote:
> >> Shawn Pearce wrote:
> >>
> >> > Originally I wanted Jon Smirl to modify the cvs2svn (...)
> >>
> >> By the way, will cvs2git (modified cvs2svn) and git-fast-import publicly
> >> available?
> >
> > It has some unresolved problems so I wasn't spreading it around everywhere.
> >
> > It is based on cvs2svn from August. There has been too much change to
> > the current cvs2svn to merge it anymore. [...]
> >
> > If the repo is missing branch tags cvs2svn may turn a single missing
> > branch into hundreds of branches. The Mozilla repo has about 1000
> > extra branches because of this.
>
> [To explain to our studio audience:] Currently, if there is an actual
> branch in CVS but no symbol associated with it, cvs2svn generates branch
> labels like "unlabeled-1.2.3", where "1.2.3" is the branch revision
> number in CVS for the particular file.  The problem is that the branch
> revision numbers for files in the same logical branch are usually
> different.  That is why many extra branches are generated.
>
> Such unnamed branches cannot reasonably be accessed via CVS anyway, and
> somebody probably made the conscious decision to delete the branch from
> CVS (though without doing it correctly).  Therefore such revisions are
> probably garbage.  It would be easy to add an option to discard such
> revisions, and we should probably do so.  (In fact, they can already be
> excluded with "--exclude=unlabeled-.*".)  The only caveat is that it is
> possible for other, named branches to sprout from an unnamed branch.  In
> this case either the second branch would have to be excluded too, or the
> unlabeled branch would have to be included.

In MozCVS there are important branches where the first label has been
deleted but there are subsequent branches off from the first branch.
These subsequent branches are still visible in CVS. Someone else had
this same problem on the cvs2svn list. This has happen twice on major
branches.

Manually looking at one of these it looks like the author wanted to
change the branch name. They made a branch with the wrong name,
branched again with the new name, and deleted the first branch.

> Alternatively, there was a suggestion to add heuristics to guess which
> files' "unlabeled" branches actually belong in the same original branch.
>  This would be a lot of work, and the result would never be very
> accurate (for one thing, there is no evidence of the branch whatsoever
> in files that had no commits on the branch).

You wrote up a detailed solution for this a few weeks ago on the
cvs2svn list. The basic idea is to look at the change sets on the
unlabeled branches. If change sets span multiple unlabeled branches,
there should be one unlabeled branch instead of multiple ones. That
would work to reduce the number of unlabeled branches down from 1000
to the true number which I believe is in the 10-20 range.

Would the dependency based model make these relationships more obvious?

>
> Other ideas are welcome.
>
> Michael
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14 15:50           ` Shawn Pearce
  2006-09-14 16:04             ` Jakub Narebski
@ 2006-09-15  7:37             ` Markus Schiltknecht
  2006-09-16  3:39               ` Shawn Pearce
  1 sibling, 1 reply; 38+ messages in thread
From: Markus Schiltknecht @ 2006-09-15  7:37 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Michael Haggerty, Jon Smirl, Martin Langhoff, Git Mailing List,
	monotone-devel, dev

Hi,

Shawn Pearce wrote:
> I don't know how the Monotone guys feel about it but I think Git
> is happy with the data in any order, just so long as the dependency
> chains aren't fed out of order.  Which I think nearly all changeset
> based SCMs would have an issue with.  So we should be just fine
> with the current chronological order produced by cvs2svn.

I'd vote for splitting into file data (and delta / patches) import and 
metadata import (author, changelog, DAG).

Monotone would be happiest if the file data were sent one file after 
another and (inside each file) in the order of each file's single 
history. That guarantees good import performance for monotone. I imagine 
it's about the same for git. And if you have to somehow cache the files 
anyway, subversion will benefit, too. (Well, at least the cache will 
thank us with good performance).

After all file data has been delivered, the metadata can be delivered. 
As neigther monotone nor git care much if they are chronological across 
branches, I'd vote for doing it that way.

Regards

Markus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-15  7:37             ` Markus Schiltknecht
@ 2006-09-16  3:39               ` Shawn Pearce
  2006-09-16  6:04                 ` Oswald Buddenhagen
  2006-09-16  6:21                 ` Nathaniel Smith
  0 siblings, 2 replies; 38+ messages in thread
From: Shawn Pearce @ 2006-09-16  3:39 UTC (permalink / raw)
  To: Markus Schiltknecht
  Cc: Michael Haggerty, Jon Smirl, Martin Langhoff, Git Mailing List,
	monotone-devel, dev

Markus Schiltknecht <markus@bluegap.ch> wrote:
> Shawn Pearce wrote:
> >I don't know how the Monotone guys feel about it but I think Git
> >is happy with the data in any order, just so long as the dependency
> >chains aren't fed out of order.  Which I think nearly all changeset
> >based SCMs would have an issue with.  So we should be just fine
> >with the current chronological order produced by cvs2svn.
> 
> I'd vote for splitting into file data (and delta / patches) import and 
> metadata import (author, changelog, DAG).
> 
> Monotone would be happiest if the file data were sent one file after 
> another and (inside each file) in the order of each file's single 
> history. That guarantees good import performance for monotone. I imagine 
> it's about the same for git. And if you have to somehow cache the files 
> anyway, subversion will benefit, too. (Well, at least the cache will 
> thank us with good performance).
>
> After all file data has been delivered, the metadata can be delivered. 
> As neigther monotone nor git care much if they are chronological across 
> branches, I'd vote for doing it that way.

Right.  I think that one of the cvs2svn guys had the right idea
here.  Provide two hooks: one early during the RCS file parse which
supplies a backend each full text file revision and another during
the very last stage which includes the "file" in the metadata stream
for commit.

This would give Git and Monotone a way to grab the full text for each
file and stream them out up front, then include only a "token" in the
metadata stream which identifies the specific revision.  Meanwhile
SVN can either cache the file revision during the early part or
ignore it, then dump out the full content during the metadata.

As it happens Git doesn't care what order the file revisions come in.
If we don't repack the imported data we would prefer to get the
revisions in newest->oldest order so we can delta the older versions
against the newer versions (like RCS).  This is also happens to be
the fastest way to extract the revision data from RCS.

On the other hand from what I understand of Monotone it needs
the revisions in oldest->newest order, as does SVN.

Doing both orderings in cvs2noncvs is probably ugly.  Doing just
oldest->newest (since 2/3 backends want that) would be acceptable
but would slow down Git imports as the RCS parsing overhead would
be much higher.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-16  3:39               ` Shawn Pearce
@ 2006-09-16  6:04                 ` Oswald Buddenhagen
  2006-09-16  6:21                 ` Nathaniel Smith
  1 sibling, 0 replies; 38+ messages in thread
From: Oswald Buddenhagen @ 2006-09-16  6:04 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Markus Schiltknecht, Michael Haggerty, Jon Smirl, Martin Langhoff,
	Git Mailing List, monotone-devel, dev

On Fri, Sep 15, 2006 at 11:39:18PM -0400, Shawn Pearce wrote:
> On the other hand from what I understand of Monotone it needs
> the revisions in oldest->newest order, as does SVN.
> 
> Doing both orderings in cvs2noncvs is probably ugly.
>
don't worry, as i know mike, he'll come up with an abstract, outright
beautiful interface that makes you want to implement middle->oldnewest
just for the sake of doing it. :)

-- 
Hi! I'm a .signature virus! Copy me into your ~/.signature, please!
--
Chaos, panic, and disorder - my work here is done.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: cvs import
  2006-09-16  3:39               ` Shawn Pearce
  2006-09-16  6:04                 ` Oswald Buddenhagen
@ 2006-09-16  6:21                 ` Nathaniel Smith
  1 sibling, 0 replies; 38+ messages in thread
From: Nathaniel Smith @ 2006-09-16  6:21 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Martin Langhoff, monotone-devel, Jon Smirl, dev, Git Mailing List

On Fri, Sep 15, 2006 at 11:39:18PM -0400, Shawn Pearce wrote:
> On the other hand from what I understand of Monotone it needs
> the revisions in oldest->newest order, as does SVN.

Monotone stores file deltas new->old, similar to git.  It should be
reasonably efficient at turning them around if it has to, though -- so
long as you give all the versions of a single file at a time, so
there's some reasonable locality, instead of jumping all around the
tree.

-- Nathaniel

-- 
"...All of this suggests that if we wished to find a modern-day model
for British and American speech of the late eighteenth century, we could
probably do no better than Yosemite Sam."

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
       [not found] <45084400.1090906@bluegap.ch>
  2006-09-13 19:01 ` cvs import Jon Smirl
@ 2006-09-13 22:52 ` Nathaniel Smith
  2006-09-13 23:21   ` Daniel Carosone
  2006-09-13 23:42   ` Keith Packard
  1 sibling, 2 replies; 38+ messages in thread
From: Nathaniel Smith @ 2006-09-13 22:52 UTC (permalink / raw)
  To: Markus Schiltknecht; +Cc: dev, monotone-devel, Git Mailing List

On Wed, Sep 13, 2006 at 07:46:40PM +0200, Markus Schiltknecht wrote:
> Hi,
> 
> I've been trying to understand the cvsimport algorithm used by monotone 
> and wanted to adjust that to be more like the one in cvs2svn.
> 
> I've had some problems with cvs2svn itself and began to question the 
> algorithm used there. It turned out that the cvs2svn people have 
> discussed an improved algorithms and are about to write a cvs2svn 2.0. 
> The main problem with the current algorithm is that it depends on the 
> timestamp information stored in the CVS repository.
> 
> Instead, it would be much better to just take the dependencies of the 
> revisions into account. Considering the timestamp an irrelevant (for the 
> import) attribute of the revision.

I just read over the thread on the cvs2svn list about this -- I have a
few random thoughts.  Take them with a grain of salt, since I haven't
actually tried writing a CVS importer myself...

Regarding the basic dependency-based algorithm, the approach of
throwing everything into blobs and then trying to tease them apart
again seems backwards.  What I'm thinking is, first we go through and
build the history graph for each file.  Now, advance a frontier across
the all of these graphs simultaneously.  Your frontier is basically a
map <filename -> CVS revision>, that represents a tree snapshot.  The
basic loop is:
  1) pick some subset of files to advance to their next revision
  2) slide the frontier one CVS revision forward on each of those
     files
  3) snapshot the new frontier (write it to the target VCS as a new
     tree commit)
  4) go to step 1
Obviously, this will produce a target VCS history that respects the
CVS dependency graph, so that's good; it puts a strict limit on how
badly whatever heuristics we use can screw us over if they guess wrong
about things.  Also, it makes the problem much simpler -- all the
heuristics are now in step 1, where we are given a bunch of possible
edits, and we have to pick some subset of them to accept next.

This isn't trivial problem.  I think the main thing you want to avoid
is:
    1  2  3  4
    |  |  |  |
  --o--o--o--o----- <-- current frontier
    |  |  |  |
    A  B  A  C
       |
       A
say you have four files named "1", "2", "3", and "4".  We want to
slide the frontier down, and the next edits were originally created by
one of three commits, A, B, or C.  In this situation, we can take
commit B, or we can take commit C, but we don't want to take commit A
until _after_ we have taken commit B -- because otherwise we will end
up splitting A up into two different commits, A1, B, A2.

There are a lot of approaches one could take here, on up to pulling
out a full-on optimal constraint satisfaction system (if we can route
chips, we should be able to pick a good ordering for accepting CVS
edits, after all).  A really simple heuristic, though, would be to
just pick the file whose next commit has the earliest timestamp, then
group in all the other "next commits" with the same commit message,
and (maybe) a similar timestamp.  I have a suspicion that this
heuristic will work really, really, well in practice.  Also, it's
cheap to apply, and worst case you accidentally split up a commit that
already had wacky timestamps, and we already know that we _have_ to do
that in some cases.

Handling file additions could potentially be slightly tricky in this
model.  I guess it is not so bad, if you model added files as being
present all along (so you never have to add add whole new entries to
the frontier), with each file starting out in a pre-birth state, and
then addition of the file is the first edit performed on top of that,
and you treat these edits like any other edits when considering how to
advance the frontier.

I have no particular idea on how to handle tags and branches here;
I've never actually wrapped my head around CVS's model for those :-).
I'm not seeing any obvious problem with handling them, though.

In this approach, incremental conversion is cheap, easy, and robust --
simply remember what frontier corresponded to the final revision
imported, and restart the process directly at that frontier.

Regarding storing things on disk vs. in memory: we always used to
stress-test monotone's cvs importer with the gcc history; just a few
weeks ago someone did a test import of NetBSD's src repo (~180k
commits) on a desktop with 2 gigs of RAM.  It takes a pretty big
history to really require disk (and for that matter, people with
histories that big likely have a big enough organization that they can
get access to some big iron to run the conversion on -- and probably
will want to anyway, to make it run in reasonable time).

> Now, that can be used to convert from CVS to about anything else. 
> Obviously we were discussing about subversion, but then there was git, 
> too. And monotone.
> 
> I'm beginning to question if one could come up with a generally useful 
> cleaned-and-sane-CVS-changeset-dump-format, which could then be used by 
> importers to all sort of VCSes. This would make monotone's cvsimport 
> function dependent on cvs2svn (and therefore python). But the general 
> try-to-get-something-usefull-from-an-insane-CVS-repository-algorithm 
> would only have to be written once.
> 
> On the other hand, I see that lots of the cvsimport functionality for 
> monotone has already been written (rcs file parsing, stuffing files, 
> file deltas and complete revisions into the monotone database, etc..). 
> Changing it to a better algorithm does not seem to be _that_ much work 
> anymore. Plus the hard part seems to be to come up with a good 
> algorithm, not implementing it. And we could still exchange our 
> experience with the general algorithm with the cvs2svn people.
>
> Plus, the guy who mentioned git pointed out that git needs quite a 
> different dump-format than subversion to do an efficient conversion. I 
> think coming up with a generally-usable dump format would not be that easy.

Probably the biggest technical advantage of having the converter built
into monotone is that it makes it easy to import the file contents.
Since this data is huge (100x the repo size, maybe?), and the naive
algorithm for reconstructing takes time that is quadratic in the depth
of history, this is very valuable.  I'm not sure what sort of dump
format one could come up with that would avoid making this step very
expensive.

I also suspect that SVN's dump format is suboptimal at the metadata
level -- we would essentially have to run a lot of branch/tag
inferencing logic _again_ to go from SVN-style "one giant tree with
branches described as copies, and multiple copies allowed for
branches/tags that are built up over time", to monotone-style
"DAG of tree snapshots".  This would be substantially less annoying
inferencing logic than that needed to decipher CVS in the first place,
granted, and it's stuff we want to write at some point anyway to allow
SVN importing, but it adds another step where information could be
lost.  I may be biased because I grok monotone better, but I suspect
it would be much easier to losslessly convert a monotone-style history
to an svn-style history than vice versa, possibly a generic dumping
tool would want to generate output that looks more like monotone's
model?  The biggest stumbling block I see is if it is important to
build up branches and tags by multiple copies out of trunk -- there
isn't any way to represent that in monotone.  A generic tool could
also use some sort of hybrid model (e.g., dag-of-snapshots plus
some extra annotations), if that worked better.

It's also very nice that users don't need any external software to
import CVS->monotone, just because it cuts down on hassle, but I would
rather have a more hasslesome tool that worked then a less hasslesome
tool that didn't, and I'm not the one volunteering to write the code,
so :-).

Even if we _do_ end up writing two implementations of the algorithm,
we should share a test suite.  Testing cvs importers is way harder
than writing them, because there's no ground truth to compare your
program's output to... in fact, having two separate implementations
and testing them against each other would be useful to increase
confidence in each of them.

(I'm only on one of the CC'ed lists, so reply-to-all appreciated)

-- Nathaniel

-- 
"On arrival in my ward I was immediately served with lunch. `This is
what you ordered yesterday.' I pointed out that I had just arrived,
only to be told: `This is what your bed ordered.'"
  -- Letter to the Editor, The Times, September 2000

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-13 22:52 ` Nathaniel Smith
@ 2006-09-13 23:21   ` Daniel Carosone
  2006-09-13 23:52     ` [Monotone-devel] " Daniel Carosone
  2006-09-13 23:42   ` Keith Packard
  1 sibling, 1 reply; 38+ messages in thread
From: Daniel Carosone @ 2006-09-13 23:21 UTC (permalink / raw)
  To: Markus Schiltknecht, monotone-devel, dev, Git Mailing List

[-- Attachment #1.1: Type: text/plain, Size: 3325 bytes --]

On Wed, Sep 13, 2006 at 03:52:00PM -0700, Nathaniel Smith wrote:
> This isn't trivial problem.  I think the main thing you want to avoid
> is:
>     1  2  3  4
>     |  |  |  |
>   --o--o--o--o----- <-- current frontier
>     |  |  |  |
>     A  B  A  C
>        |
>        A
> There are a lot of approaches one could take here, on up to pulling
> out a full-on optimal constraint satisfaction system (if we can route
> chips, we should be able to pick a good ordering for accepting CVS
> edits, after all).  A really simple heuristic, though, would be to
> just pick the file whose next commit has the earliest timestamp, then
> group in all the other "next commits" with the same commit message,
> and (maybe) a similar timestamp.  

Pick the earliest first, or more generally: take all the file commits
immediately below the frontier.  Find revs further below the frontier
(up to some small depth or time limit) on other files that might match
them, based on changelog etc (the same grouping you describe, and we
do now).  Eliminate any of those that are not entirely on the frontier
(ie, have some other revision in the way, as with file 2).  Commit the
remaining set in time order. [*]

If you wind up with an empty set, then you need to split revs, but at
this point you have only conflicting revs on the frontier i.e. you've
already committed all the other revs you can that might have avoided
this need, whereas we currently might be doing this too often).

For time order, you could look at each rev as having a time window,
from the first to last commit matching.  If the revs windows are
non-overlapping, commit them in order.  If the rev windows overlap, at
this point we already know the file changes don't overlap - we *could*
commit these as parallel heads and merge them, to better model the
original developer's overlapping commits.

> Handling file additions could potentially be slightly tricky in this
> model.  I guess it is not so bad, if you model added files as being
> present all along (so you never have to add add whole new entries to
> the frontier), with each file starting out in a pre-birth state, and
> then addition of the file is the first edit performed on top of that,
> and you treat these edits like any other edits when considering how to
> advance the frontier.

CVS allows resurrections too..

> I have no particular idea on how to handle tags and branches here;
> I've never actually wrapped my head around CVS's model for those :-).
> I'm not seeing any obvious problem with handling them, though.

Tags could be modelled as another 'event' in the file graph, like a
commit. If your frontier advances through both revisions and a 'tag
this revision' event, the same sequencing as above would work. If tags
had been moved, this would wind up with a sequence whereby commits
interceded with tagging, and we'd need to split the commits such that
we could end up with a revision matching the tagged content.

> In this approach, incremental conversion is cheap, easy, and robust --
> simply remember what frontier corresponded to the final revision
> imported, and restart the process directly at that frontier.

Hm. Except for the tagging idea above, because tags can be applied
behind a live cvs frontier.

--
Dan.

[-- Attachment #1.2: Type: application/pgp-signature, Size: 186 bytes --]

[-- Attachment #2: Type: text/plain, Size: 158 bytes --]

_______________________________________________
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Monotone-devel] cvs import
  2006-09-13 23:21   ` Daniel Carosone
@ 2006-09-13 23:52     ` Daniel Carosone
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Carosone @ 2006-09-13 23:52 UTC (permalink / raw)
  To: Markus Schiltknecht, monotone-devel, dev, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1478 bytes --]

On Thu, Sep 14, 2006 at 09:21:39AM +1000, Daniel Carosone wrote:
> > I have no particular idea on how to handle tags and branches here;
> > I've never actually wrapped my head around CVS's model for those :-).
> > I'm not seeing any obvious problem with handling them, though.
> 
> Tags could be modelled as another 'event' in the file graph, like a
> commit. If your frontier advances through both revisions and a 'tag
> this revision' event, the same sequencing as above would work.

Likewise, if we had "file branched" events in the file lifeline (based
on the rcs id's), then we would be sure to always have a monotone
revision that corresponded to the branching event, where we could
attach the revisions in the branch.

Because we can't split tags, and can't split branch events, we will
end up splitting file commits (down to individual commits per file) in
order to arrive at the revisions we need for those.

Because tags and branches can be across subsets of the tree, we gain
some scheduling flexibility about where in the reconstructed sequence
they can come.

Many well-managed CVS repositories will use good practices, such as
having a branch base tag.  If they do, then they will help this
algorithm produce correct results.

Once we have a branch with a base starting revision, we can pretty
much treat it independently from there: make a whole new set of file
lifelines along the RCS branches and a new frontier for it.

--
Dan.

[-- Attachment #2: Type: application/pgp-signature, Size: 186 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Monotone-devel] cvs import
  2006-09-13 22:52 ` Nathaniel Smith
  2006-09-13 23:21   ` Daniel Carosone
@ 2006-09-13 23:42   ` Keith Packard
  2006-09-14  0:32     ` Nathaniel Smith
  2006-09-14  2:35     ` Shawn Pearce
  1 sibling, 2 replies; 38+ messages in thread
From: Keith Packard @ 2006-09-13 23:42 UTC (permalink / raw)
  To: Nathaniel Smith
  Cc: keithp, Markus Schiltknecht, monotone-devel, dev,
	Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 919 bytes --]

On Wed, 2006-09-13 at 15:52 -0700, Nathaniel Smith wrote:

> Regarding the basic dependency-based algorithm, the approach of
> throwing everything into blobs and then trying to tease them apart
> again seems backwards.  What I'm thinking is, first we go through and
> build the history graph for each file.  Now, advance a frontier across
> the all of these graphs simultaneously.  Your frontier is basically a
> map <filename -> CVS revision>, that represents a tree snapshot. 

Parsecvs does this, except backwards from now into the past; I found it
easier to identify merge points than branch points (Oh, look, these two
branches are the same now, they must have merged).

However, this means that parsecvs must hold the entire tree state in
memory, which turned out to be its downfall with large repositories.
Worked great for all of X.org, not so good with Mozilla.

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-13 23:42   ` Keith Packard
@ 2006-09-14  0:32     ` Nathaniel Smith
  2006-09-14  0:57       ` [Monotone-devel] " Jon Smirl
  2006-09-14  2:35     ` Shawn Pearce
  1 sibling, 1 reply; 38+ messages in thread
From: Nathaniel Smith @ 2006-09-14  0:32 UTC (permalink / raw)
  To: Keith Packard; +Cc: dev, monotone-devel, Git Mailing List

On Wed, Sep 13, 2006 at 04:42:01PM -0700, Keith Packard wrote:
> However, this means that parsecvs must hold the entire tree state in
> memory, which turned out to be its downfall with large repositories.
> Worked great for all of X.org, not so good with Mozilla.

Does anyone know how big Mozilla (or other humonguous repos, like KDE)
are, in terms of number of files?

A few numbers for repositories I had lying around:
  Linux kernel -- ~21,000
  gcc -- ~42,000
  NetBSD "src" repo -- ~100,000
  uClinux distro -- ~110,000

These don't seem very indimidating... even if it takes an entire
kilobyte per CVS revision to store the information about it that we
need to make decisions about how to move the frontier... that's only
110 megabytes for the largest of these repos.  The frontier sweeping
algorithm only _needs_ to have available the current frontier, and the
current frontier+1.  Storing information on every version of every
file in memory might be worse; but since the algorithm accesses this
data in a linear way, it'd be easy enough to stick those in a
lookaside table on disk if really necessary, like a bdb or sqlite file
or something.

(Again, in practice storing all the metadata for the entire 180k
revisions of the 100k files in the netbsd repo was possible on a
desktop.  Monotone's cvs_import does try somewhat to be frugal about
memory, though, interning strings and suchlike.)

-- Nathaniel

-- 
When the flush of a new-born sun fell first on Eden's green and gold,
Our father Adam sat under the Tree and scratched with a stick in the mould;
And the first rude sketch that the world had seen was joy to his mighty heart,
Till the Devil whispered behind the leaves, "It's pretty, but is it Art?"
  -- The Conundrum of the Workshops, Rudyard Kipling

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Monotone-devel] cvs import
  2006-09-14  0:32     ` Nathaniel Smith
@ 2006-09-14  0:57       ` Jon Smirl
  2006-09-14  1:53         ` Daniel Carosone
  0 siblings, 1 reply; 38+ messages in thread
From: Jon Smirl @ 2006-09-14  0:57 UTC (permalink / raw)
  To: Keith Packard, dev, monotone-devel, Git Mailing List

On 9/13/06, Nathaniel Smith <njs@pobox.com> wrote:
> On Wed, Sep 13, 2006 at 04:42:01PM -0700, Keith Packard wrote:
> > However, this means that parsecvs must hold the entire tree state in
> > memory, which turned out to be its downfall with large repositories.
> > Worked great for all of X.org, not so good with Mozilla.
>
> Does anyone know how big Mozilla (or other humonguous repos, like KDE)
> are, in terms of number of files?

Mozilla is 120,000 files. The complexity comes from 10 years worth of
history. A few of the files have around 1,700 revisions. There are
about 1,600 branches and 1,000 tags. The branch number is inflated
because cvs2svn is generating extra branches, the real number is
around 700. The CVS repo takes 4.2GB disk space. cvs2svn turns this
into 250,000 commits over about 1M unique revisions.

>
> A few numbers for repositories I had lying around:
>   Linux kernel -- ~21,000
>   gcc -- ~42,000
>   NetBSD "src" repo -- ~100,000
>   uClinux distro -- ~110,000
>
> These don't seem very indimidating... even if it takes an entire
> kilobyte per CVS revision to store the information about it that we
> need to make decisions about how to move the frontier... that's only
> 110 megabytes for the largest of these repos.  The frontier sweeping
> algorithm only _needs_ to have available the current frontier, and the
> current frontier+1.  Storing information on every version of every
> file in memory might be worse; but since the algorithm accesses this
> data in a linear way, it'd be easy enough to stick those in a
> lookaside table on disk if really necessary, like a bdb or sqlite file
> or something.
>
> (Again, in practice storing all the metadata for the entire 180k
> revisions of the 100k files in the netbsd repo was possible on a
> desktop.  Monotone's cvs_import does try somewhat to be frugal about
> memory, though, interning strings and suchlike.)
>
> -- Nathaniel
>
> --
> When the flush of a new-born sun fell first on Eden's green and gold,
> Our father Adam sat under the Tree and scratched with a stick in the mould;
> And the first rude sketch that the world had seen was joy to his mighty heart,
> Till the Devil whispered behind the leaves, "It's pretty, but is it Art?"
>   -- The Conundrum of the Workshops, Rudyard Kipling
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14  0:57       ` [Monotone-devel] " Jon Smirl
@ 2006-09-14  1:53         ` Daniel Carosone
  2006-09-14  2:30           ` [Monotone-devel] " Shawn Pearce
  2006-09-14 21:57           ` [Monotone-devel] " Petr Baudis
  0 siblings, 2 replies; 38+ messages in thread
From: Daniel Carosone @ 2006-09-14  1:53 UTC (permalink / raw)
  To: Jon Smirl; +Cc: dev, Keith Packard, monotone-devel, Git Mailing List

[-- Attachment #1.1: Type: text/plain, Size: 1676 bytes --]

On Wed, Sep 13, 2006 at 08:57:33PM -0400, Jon Smirl wrote:
> Mozilla is 120,000 files. The complexity comes from 10 years worth of
> history. A few of the files have around 1,700 revisions. There are
> about 1,600 branches and 1,000 tags. The branch number is inflated
> because cvs2svn is generating extra branches, the real number is
> around 700. The CVS repo takes 4.2GB disk space. cvs2svn turns this
> into 250,000 commits over about 1M unique revisions.

Those numbers are pretty close to those in the NetBSD repository, and
between them these probably represent just about the most extensive
public CVS test data available. 

I've only done imports of individual top-level dirs (what used to be
modules), like src and pkgsrc, because they're used independently and
don't really overlap.

src had about 180k commits over 1M versions of 120k files, 1000 tags
and 260 branches. pkgsrc had 110k commits over about half as many
files and versions thereof.  We too have a few hot files, one had
13,625 revisions.  xsrc adds a bunch more files and content, but not
many versions; that's mostly vendor branches and only some local
changes.  Between them the cvs ,v files take up 4.7G covering about 13
years of history.

One thing that was interesting was that "src" used to be several
different modules, but we rearranged the repository at one point to
match the checkout structure these modules produced (combining them
all under the src dir).  This doesn't seem to have upset the import at
all.  Just about every other form of CVS evil has been perpetrated in
this repository at some stage or other too, but always very carefully.

--
Dan.

[-- Attachment #1.2: Type: application/pgp-signature, Size: 186 bytes --]

[-- Attachment #2: Type: text/plain, Size: 158 bytes --]

_______________________________________________
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Monotone-devel] cvs import
  2006-09-14  1:53         ` Daniel Carosone
@ 2006-09-14  2:30           ` Shawn Pearce
  2006-09-14  3:19             ` Daniel Carosone
  2006-09-14 21:57           ` [Monotone-devel] " Petr Baudis
  1 sibling, 1 reply; 38+ messages in thread
From: Shawn Pearce @ 2006-09-14  2:30 UTC (permalink / raw)
  To: Daniel Carosone
  Cc: Jon Smirl, Keith Packard, dev, monotone-devel, Git Mailing List

Daniel Carosone <dan@geek.com.au> wrote:
> On Wed, Sep 13, 2006 at 08:57:33PM -0400, Jon Smirl wrote:
> > Mozilla is 120,000 files. The complexity comes from 10 years worth of
> > history. A few of the files have around 1,700 revisions. There are
> > about 1,600 branches and 1,000 tags. The branch number is inflated
> > because cvs2svn is generating extra branches, the real number is
> > around 700. The CVS repo takes 4.2GB disk space. cvs2svn turns this
> > into 250,000 commits over about 1M unique revisions.
> 
> Those numbers are pretty close to those in the NetBSD repository, and
> between them these probably represent just about the most extensive
> public CVS test data available. 

I don't know exactly how big it is but the Gentoo CVS repository
is also considered to be very large (about the size of the Mozilla
repository) and just as difficult to import.  Its either crashed or
taken about a month to process with the current Git CVS->Git tools.

Since I know that the bulk of the Gentoo CVS repository is the
portage tree I did a quick find|wc -l in my /usr/portage; its about
124,500 files.

Its interesting that Gentoo has almost as large of a repository given
that its such a young project, compared to NetBSD and Mozilla.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: cvs import
  2006-09-14  2:30           ` [Monotone-devel] " Shawn Pearce
@ 2006-09-14  3:19             ` Daniel Carosone
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Carosone @ 2006-09-14  3:19 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Daniel Carosone, Keith Packard, monotone-devel, Jon Smirl, dev,
	Git Mailing List

[-- Attachment #1.1: Type: text/plain, Size: 1905 bytes --]

On Wed, Sep 13, 2006 at 10:30:17PM -0400, Shawn Pearce wrote:
> I don't know exactly how big it is but the Gentoo CVS repository
> is also considered to be very large (about the size of the Mozilla
> repository) and just as difficult to import.  Its either crashed or
> taken about a month to process with the current Git CVS->Git tools.

Ah, thanks for the tip.

> Since I know that the bulk of the Gentoo CVS repository is the
> portage tree I did a quick find|wc -l in my /usr/portage; its about
> 124,500 files.
> 
> Its interesting that Gentoo has almost as large of a repository given
> that its such a young project, compared to NetBSD and Mozilla.  :-)

Portage uses files and thus CVS very differently, though.  Each ebuild
for each package revision of each version of a third-party package
(like, say, monotone 0.28 and 0.29, and -r1, -r2 pkg bumps of those if
they were needed) is its own file that's added, maybe edited a couple
of times, and then deleted again later as new versions are added and
older ones retired.  These are copies and renames in the workspace,
but are invisible to CVS.  This uses up lots more files than a single
long-lived build that gets edited each time; the Attic dirs must have
huge numbers of files, way beyond the number that are live now.

This lets portage keep builds around in a HEAD checkout for multiple
versions at once, tagged internally with different statuses.
Effectively, these tags take the place of VCS-based branches and
releases, and are more flexible for end users tracking their favourite
applications while keeping the rest of their system stable.

If they had a VCS that supported file cloning and/or renaming, and
used that to follow history between these ebuild files, things would
be very different. There are some interesting use cases for VCS tools
in supporting this behaviour nicely, too.  

--
Dan.

[-- Attachment #1.2: Type: application/pgp-signature, Size: 186 bytes --]

[-- Attachment #2: Type: text/plain, Size: 158 bytes --]

_______________________________________________
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Monotone-devel] cvs import
  2006-09-14  1:53         ` Daniel Carosone
  2006-09-14  2:30           ` [Monotone-devel] " Shawn Pearce
@ 2006-09-14 21:57           ` Petr Baudis
  2006-09-14 22:04             ` Shawn Pearce
  1 sibling, 1 reply; 38+ messages in thread
From: Petr Baudis @ 2006-09-14 21:57 UTC (permalink / raw)
  To: Jon Smirl, Keith Packard, dev, monotone-devel, Git Mailing List

Dear diary, on Thu, Sep 14, 2006 at 03:53:24AM CEST, I got a letter
where Daniel Carosone <dan@geek.com.au> said that...
> On Wed, Sep 13, 2006 at 08:57:33PM -0400, Jon Smirl wrote:
> > Mozilla is 120,000 files. The complexity comes from 10 years worth of
> > history. A few of the files have around 1,700 revisions. There are
> > about 1,600 branches and 1,000 tags. The branch number is inflated
> > because cvs2svn is generating extra branches, the real number is
> > around 700. The CVS repo takes 4.2GB disk space. cvs2svn turns this
> > into 250,000 commits over about 1M unique revisions.
> 
> Those numbers are pretty close to those in the NetBSD repository, and
> between them these probably represent just about the most extensive
> public CVS test data available. 

  Don't forget OpenOffice. It's just a shame that the OpenOffice CVS
tree is not available for cloning.

	http://wiki.services.openoffice.org/wiki/SVNMigration

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Snow falling on Perl. White noise covering line noise.
Hides all the bugs too. -- J. Putnam

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Monotone-devel] cvs import
  2006-09-14 21:57           ` [Monotone-devel] " Petr Baudis
@ 2006-09-14 22:04             ` Shawn Pearce
  0 siblings, 0 replies; 38+ messages in thread
From: Shawn Pearce @ 2006-09-14 22:04 UTC (permalink / raw)
  To: Petr Baudis
  Cc: Jon Smirl, Keith Packard, dev, monotone-devel, Git Mailing List

Petr Baudis <pasky@suse.cz> wrote:
>   Don't forget OpenOffice. It's just a shame that the OpenOffice CVS
> tree is not available for cloning.
> 
> 	http://wiki.services.openoffice.org/wiki/SVNMigration

Hmm, the KDE repo is even larger than Mozilla: 19 GB in CVS and
499,367 revisions.  Question is, are those distinct file revisions
or SVN revisions?  And just what machine did they use that completed
that conversion in 38 hours?

-- 
Shawn.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Monotone-devel] cvs import
  2006-09-13 23:42   ` Keith Packard
  2006-09-14  0:32     ` Nathaniel Smith
@ 2006-09-14  2:35     ` Shawn Pearce
  1 sibling, 0 replies; 38+ messages in thread
From: Shawn Pearce @ 2006-09-14  2:35 UTC (permalink / raw)
  To: Keith Packard
  Cc: Nathaniel Smith, Markus Schiltknecht, monotone-devel, dev,
	Git Mailing List

Keith Packard <keithp@keithp.com> wrote:
> On Wed, 2006-09-13 at 15:52 -0700, Nathaniel Smith wrote:
> 
> > Regarding the basic dependency-based algorithm, the approach of
> > throwing everything into blobs and then trying to tease them apart
> > again seems backwards.  What I'm thinking is, first we go through and
> > build the history graph for each file.  Now, advance a frontier across
> > the all of these graphs simultaneously.  Your frontier is basically a
> > map <filename -> CVS revision>, that represents a tree snapshot. 
> 
> Parsecvs does this, except backwards from now into the past; I found it
> easier to identify merge points than branch points (Oh, look, these two
> branches are the same now, they must have merged).

Why not let Git do that?  If two branches are the same in CVS then
shouldn't they have the same tree SHA1 in Git?  Surely comparing
20 bytes of SHA1 is faster than almost any other comparsion...
 
> However, this means that parsecvs must hold the entire tree state in
> memory, which turned out to be its downfall with large repositories.
> Worked great for all of X.org, not so good with Mozilla.

Any chance that can be paged in on demand from some sort of work
file?  git-fast-import hangs onto a configurable number of tree
states (default of 5) but keeps them in an LRU chain and dumps the
ones that aren't current.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2006-09-16  6:21 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <45084400.1090906@bluegap.ch>
2006-09-13 19:01 ` cvs import Jon Smirl
2006-09-13 20:41   ` Martin Langhoff
2006-09-13 21:04     ` Markus Schiltknecht
2006-09-13 21:15       ` Oswald Buddenhagen
2006-09-13 21:16       ` Martin Langhoff
2006-09-14  4:17         ` Michael Haggerty
2006-09-14  4:34           ` Jon Smirl
2006-09-14  5:02             ` Michael Haggerty
2006-09-14  5:21               ` Martin Langhoff
2006-09-14  5:35                 ` Michael Haggerty
2006-09-14  5:30               ` Jon Smirl
2006-09-14  4:40           ` Martin Langhoff
2006-09-13 21:05     ` Markus Schiltknecht
2006-09-13 21:38       ` Jon Smirl
2006-09-14  5:36         ` Michael Haggerty
2006-09-14 15:50           ` Shawn Pearce
2006-09-14 16:04             ` Jakub Narebski
2006-09-14 16:18               ` Shawn Pearce
2006-09-14 16:27               ` Jon Smirl
2006-09-14 17:01                 ` Michael Haggerty
2006-09-14 17:08                   ` Jakub Narebski
2006-09-14 17:17                   ` Jon Smirl
2006-09-15  7:37             ` Markus Schiltknecht
2006-09-16  3:39               ` Shawn Pearce
2006-09-16  6:04                 ` Oswald Buddenhagen
2006-09-16  6:21                 ` Nathaniel Smith
2006-09-13 22:52 ` Nathaniel Smith
2006-09-13 23:21   ` Daniel Carosone
2006-09-13 23:52     ` [Monotone-devel] " Daniel Carosone
2006-09-13 23:42   ` Keith Packard
2006-09-14  0:32     ` Nathaniel Smith
2006-09-14  0:57       ` [Monotone-devel] " Jon Smirl
2006-09-14  1:53         ` Daniel Carosone
2006-09-14  2:30           ` [Monotone-devel] " Shawn Pearce
2006-09-14  3:19             ` Daniel Carosone
2006-09-14 21:57           ` [Monotone-devel] " Petr Baudis
2006-09-14 22:04             ` Shawn Pearce
2006-09-14  2:35     ` Shawn Pearce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).