git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* cvsps, parsecvs, svn2git and the CVS exporter mess
@ 2012-12-22 17:36 Eric S. Raymond
  2012-12-23 20:21 ` Heiko Voigt
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Eric S. Raymond @ 2012-12-22 17:36 UTC (permalink / raw)
  To: Yann Dirson, Michael Haggerty, Heiko Voigt, Antoine Pelisse,
	Bart Massey, Keith Packard, David Mansfield, git

Wanting reposurgeon to be able to read CVS repositories has landed me
in the middle of a mess.  This note explains my thinking, what I
intend to do to fix the mess, and how others can help if they are
motivated.  I have copied all the individuals who I know have an
interest in the problem, and the git list because what I'm planning is
going to be significant for them.

My requirement is not complicated to describe. For use as a
reposurgeon front end I need a tool that is basically a
cvs-fast-export, runnable in either a CVS repository or a checkout
(either will do, both is not required) and emitting a fast-import
stream.

There are three competing tools that might fit this bill.  

* One is Michael Haggerty's cvs2git.  I had bad experiences with the
cvs2svn code it's derived from in the past, but Michael believes those
problems have been fixed and I will accept that - at least until I can
test for myself.  Its documented interface is not quite good enough
yet; as the documentation says, "The data that should be fed to git
fast-import are written to two files, which have to be loaded into git
fast-import manually."

* Another is the cvsps code formerly maintained by David Mansfield and
used by git-cvsimport; he passed the maintainer's baton to me, and I
have shipped a 3.0 with a working --fast-export option.

* A third is parsecvs, which Keith Packard and Bart Massey handed off
to me a week before David invited me to take over cvsps.  While
parsecvs does not yet have a --fast-export option, I anticipate no
great difficulty in adding one.

It is pure accident that I now maintain two of these.  Initially I
was interested in parsecvs, but it failed to build for me.  By the
time Bart Massey sent me a fix patch, I had already been handed 
cvsps, added --fast-export, and shipped 3.0.

Having three different tools for this job seems to me duplicative and
pointless; two of them should probably be let die an honorable death.
I don't actually care which of the three survives - and, in
particular, if I determine that cvs2git is doing the best job of the
three I am quite willing to declare end-of-life for cvsps and
parsecvs.  It's not like I don't have plenty of other projects to work
on.

Therefore, I think my main focus needs to be developing a really
effective test suite to triage these tools with and applying it to all
three.  I have already made a solid start on this; see
tests/cvstest.py and tests/basic.tst in the cvsps-3.0 distribution for
my test framework.

I presently know of three test suites other than mine. One was built
by Heiko to test cvsps, another lives in the git t/ directory, and the
third is cvs2git's. I haven't looked at cv2git's yet, but the others
are not in their present form suited to where I am taking cvsps and
parsecvs.  Heiko's relies on the default human-readable cvsps format,
which I consider obsolete and uninteresting.  The git tests are
dependent on details of porcelain behavior.  I think it would be
better to test import-stream output.

Here is what I propose.  Let's build a common test suite that cvs2git,
git-cvsimport, cvsps, and parsecvs can all use, apply it rigorously,
and let the best tool win.  (This would mean, among other things, that
git can stop carrying things that are essentially cvsps tests in its
tree.)

The two people I most need to sign off on this are, I guess, Michael
Haggerty and either Junio Hamano or whoever specifically owns
git-cvsimport and its tests.  Whichever way this comes out, the back
end of git-cvsimport is going to need some work - I don't plan to put
any further effort into the output format it's presently using.

If we can agree on this, I'll start a public repo, and contribute my
Python framework - it's more capable than any of the shell harnesses
out there because it can easily drive interleaved operations on multiple 
checkout directories.

Anybody who is still interested in this problem should contribute
tests.  Heiko Voigt, I'd particularly like you in on this.  David
Mansfield, if you can spare the few minutes required to write
generators for the "funky" and "invalid" tag cases, that would be
really helpful.  Michael Haggerty, your piece would be moving the
cvs2git tests to the new framework.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

The kind of charity you can force out of people nourishes about as much as
the kind of love you can buy --- and spreads even nastier diseases.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cvsps, parsecvs, svn2git and the CVS exporter mess
  2012-12-22 17:36 cvsps, parsecvs, svn2git and the CVS exporter mess Eric S. Raymond
@ 2012-12-23 20:21 ` Heiko Voigt
  2012-12-23 22:45   ` Eric S. Raymond
  2013-01-03 15:37 ` Michael Haggerty
  2013-01-03 15:51 ` Martin Langhoff
  2 siblings, 1 reply; 11+ messages in thread
From: Heiko Voigt @ 2012-12-23 20:21 UTC (permalink / raw)
  To: Eric S. Raymond
  Cc: Yann Dirson, Michael Haggerty, Antoine Pelisse, Bart Massey,
	Keith Packard, David Mansfield, git

Hi,

On Sat, Dec 22, 2012 at 12:36:48PM -0500, Eric S. Raymond wrote:
> If we can agree on this, I'll start a public repo, and contribute my
> Python framework - it's more capable than any of the shell harnesses
> out there because it can easily drive interleaved operations on multiple 
> checkout directories.

Please share so we can have a look. BTW, where can I find your cvsps
code?

> Anybody who is still interested in this problem should contribute
> tests.  Heiko Voigt, I'd particularly like you in on this.

If it does not take to much effort I could port my tests to the new
framework. Since I currently are not in active need of cvs conversions
its not of big interest to me anymore. But if it does not take too much
time I am happy to help.

>From my past cvs conversion experiences my personal guess is that
cvs2svn will win this competition.

Cheers Heiko

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cvsps, parsecvs, svn2git and the CVS exporter mess
  2012-12-23 20:21 ` Heiko Voigt
@ 2012-12-23 22:45   ` Eric S. Raymond
  0 siblings, 0 replies; 11+ messages in thread
From: Eric S. Raymond @ 2012-12-23 22:45 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Yann Dirson, Michael Haggerty, Antoine Pelisse, Bart Massey,
	Keith Packard, David Mansfield, git

Heiko Voigt <hvoigt@hvoigt.net>:
> Please share so we can have a look. BTW, where can I find your cvsps
> code?

https://gitorious.org/cvsps

Developments of the last 48 hours:

1. Andreas Schwab sent me a patch that uses commitids wherever the history
   has them - this makes all the time-skew problems go away.  I added code
   to warn if commitids aren't present, so users will get a clear indication
   of when time-skew problems might bite them versus when that is happily
   impossible.

2. I've scrapped a lot of obsolete code and options.  The repo head
   version uses what used to be called cvs-direct mode all the time
   now; it works, and the effect on performance is major.  This also
   means that cvsps doesn't need to use any local CVS commands or even
   have CVS installed where it runs.

> >From my past cvs conversion experiences my personal guess is that
> cvs2svn will win this competition.

That could be.  But right now cvsps has one significant advantage over
cvs2git (which parsecvs might share) - it's *blazingly* fast.  So fast
that I scrapped all the local-caching logic; there seems no point to it at
today's network speeds, and that's one less layer of complications to
go wrong.

I've removed a couple hundred lines of code and the program works
better and faster than it did before.  That's having a good day!
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cvsps, parsecvs, svn2git and the CVS exporter mess
  2012-12-22 17:36 cvsps, parsecvs, svn2git and the CVS exporter mess Eric S. Raymond
  2012-12-23 20:21 ` Heiko Voigt
@ 2013-01-03 15:37 ` Michael Haggerty
  2013-01-03 20:53   ` Eric S. Raymond
  2013-01-03 15:51 ` Martin Langhoff
  2 siblings, 1 reply; 11+ messages in thread
From: Michael Haggerty @ 2013-01-03 15:37 UTC (permalink / raw)
  To: Eric S. Raymond
  Cc: Yann Dirson, Heiko Voigt, Antoine Pelisse, Bart Massey,
	Keith Packard, David Mansfield, git

On 12/22/2012 06:36 PM, Eric S. Raymond wrote:
> * One is Michael Haggerty's cvs2git.  I had bad experiences with the
> cvs2svn code it's derived from in the past, but Michael believes those
> problems have been fixed and I will accept that - at least until I can
> test for myself.  Its documented interface is not quite good enough
> yet; as the documentation says, "The data that should be fed to git
> fast-import are written to two files, which have to be loaded into git
> fast-import manually."

There are two good reasons that the output is written to two separate files:

1. The files are generated during different passes of cvs2git, and since
the cvs2git conversion is restartable pass-by-pass, the first file might
only need to be generated once even while the user is iterating on
adjustments to other conversion options.

2. The first ("blobfile") contains blob definitions for file revisions,
which are read out of the RCS files in the order they are held in the
RCS file.  This is vastly faster than reading the file revisions in the
order that they are needed for git commits because (1) all revisions for
a file can be computed from one serial read of the RCS file; (2) there
is no need to jump around from rcsfile to rcsfile.  The second
("dumpfile") stitches the blobs together into git commits by referring
to the blobs that are needed.  This file is smaller because it doesn't
contain the actual file contents.  Another advantage of this approach is
that a blob need only appear once in the blobfile even if it is used
multiple times in the git history.

Anyway, surely cat'ing two output files together is not such a difficult
problem?

A potentially bigger problem is that if you want to handle such
blob/dump output, you have to deal with git-fast-import format's "blob"
command as opposed to only handling inline blobs.  However, if that is a
problem, it is possible to configure cvs2git to write the blobs inline
with the rest of the dumpfile (this mode is supported because "hg
fast-import" doesn't support detached blobs).  You would have to create
an options file that uses GitRevisionInlineWriter, similar to what is
done in cvs2hg-example.options.

> [...]
> Having three different tools for this job seems to me duplicative and
> pointless; two of them should probably be let die an honorable death.
> I don't actually care which of the three survives - and, in
> particular, if I determine that cvs2git is doing the best job of the
> three I am quite willing to declare end-of-life for cvsps and
> parsecvs.  It's not like I don't have plenty of other projects to work
> on.

cvs2git does not currently support incremental conversions; therefore, a
cvsps-based option (if it would actually work, that is) would have at
least one advantage over cvs2git.

> I presently know of three test suites other than mine. One was built
> by Heiko to test cvsps, another lives in the git t/ directory, and the
> third is cvs2git's. I haven't looked at cv2git's yet, but the others
> are not in their present form suited to where I am taking cvsps and
> parsecvs.  Heiko's relies on the default human-readable cvsps format,
> which I consider obsolete and uninteresting.  The git tests are
> dependent on details of porcelain behavior.  I think it would be
> better to test import-stream output.

cvs2svn has an extensive test suite which includes tests derived from
bug reports that we have received over the years.  I adapted a few of
its test repositories to create the git test suite additions that I made
in Feb 2009, but there are many more in our project.

A lot of our test suite deals with additional conversion features, like:

* Re-encoding filenames, usernames, and log messages from whatever
happens to have been used in the CVS repository into UTF-8

* Fixing CVS branches, tags, and mixed branch/tag messes according to
user wishes; renaming branches and tags

* Allowing the user to influence the choice of which branch should serve
as the source for another branch/tag (CVS records this information very
ambiguously)

* Fixing binary vs. text files, expanding/contracting CVS keywords, etc.

* Removing lots of synthetic revisions and other cruft generated by CVS
to fit within the RCS file format

* Dealing with vendor branches in a sensible way, especially considering
that very many users misuse vendor branches for initial imports

* Dealing with various common types of CVS repository corruption

See our list of features [1] for more details.  Presumably many of these
features would not be covered by your test framework, and are not
supported by the other conversion tools.

Unfortunately, our tests are mostly based on cvs2svn (i.e., not 2git);
that is, the conversion is done with cvs2svn and checked by verifying
the contents of the resulting Subversion repository.

The script contrib/verify-cvs2svn.py is another kind of test; it checks
every branch and tag out of CVS and the destination repository and
verifies that their contents are identical.  This script is intended to
be used by users to check their own conversion.  Please note that it
doesn't check the history, only the branch/tag tips.  But this script
works with both Subversion and git (at least it should; it probably
doesn't get tested much).

> Here is what I propose.  Let's build a common test suite that cvs2git,
> git-cvsimport, cvsps, and parsecvs can all use, apply it rigorously,
> and let the best tool win.  (This would mean, among other things, that
> git can stop carrying things that are essentially cvsps tests in its
> tree.)

I think it would be great to have a way to test across tools, though
please realize that the inference of the most plausible "true" CVS
history is partly objective but also often a matter of heuristics and
taste.  Moreover, the choice of how to represent the inferred history in
git, which has rather a different model than CVS/Subversion, is also
non-obvious and somewhat controversial.  I expect that there will be a
number of simple CVS repositories for which we can all agree about the
correct git output, but not far away will be a vast number for which the
"correct" answer is unclear.  Many of the interesting tests would fall
into the latter category.

> The two people I most need to sign off on this are, I guess, Michael
> Haggerty and either Junio Hamano or whoever specifically owns
> git-cvsimport and its tests.  [...]

It's not clear what you want me to sign off on.  I guess you want to
replace (or augment?) the cvs2svn test suite with one based on your
framework?  Right off the top of my head I can think of a few
considerations from the point of view of the cvs2svn project:

* We definitely want to continue testing the Subversion output of
cvs2svn.  A test suite that only tests the git output could at best be
an addition to the current test suite, not a replacement for it.  (That
being said, the addition of good tests of the 2git output would be great.)

* A test suite that tests only the easy cases wouldn't really be
interesting, because the difficult cases are where the potential
problems lie.

* It would be unfortunate if the cvs2svn test suite would grow another
run-time dependency or if we would have to invest a lot of time
synchronizing with another project, though if the gain were big enough
we could consider it.

* The licenses obviously have to be compatible to the extent required by
the level of coupling.

* I don't have a lot of time to work on the integration.  cvs2svn has
long been at a level of maturity where it doesn't need much care and
feeding, and I would like to keep it that way :-)  Nowadays I am far
more interested in working on the git project with my little available
open-sourcin' time.


Rereading this email, I realize that it is not clear to me why your new
testing project needs the "signoff" or cooperation from any of the
conversions tool projects (git-cvsimport or cvs2svn or parsecvs or ...)
in the first place.  The essence of your project will be a collection of
CVS test repositories, and code that can read the conversion output
(whether via git or as fast-input data) and verify that it matches
expectations (right?).  Presumably it will have a place where any of the
conversion tools could be plugged into it, and perhaps a bit of code
that knows how to configure and run the best-known tools (and perhaps
even to download and build them).

It would seem natural to me that your project stops there, and stays at
arms-length from the conversion projects.  If your test suite proves
itself to be obviously better than the cvs2svn test suite, then we might
try to integrate it *then* (or not; even then it wouldn't really be
obligatory).

Michael

[1] http://cvs2svn.tigris.org/features.html

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cvsps, parsecvs, svn2git and the CVS exporter mess
  2012-12-22 17:36 cvsps, parsecvs, svn2git and the CVS exporter mess Eric S. Raymond
  2012-12-23 20:21 ` Heiko Voigt
  2013-01-03 15:37 ` Michael Haggerty
@ 2013-01-03 15:51 ` Martin Langhoff
  2 siblings, 0 replies; 11+ messages in thread
From: Martin Langhoff @ 2013-01-03 15:51 UTC (permalink / raw)
  To: Eric S. Raymond
  Cc: Yann Dirson, Michael Haggerty, Heiko Voigt, Antoine Pelisse,
	Bart Massey, Keith Packard, David Mansfield, Git Mailing List

On Sat, Dec 22, 2012 at 12:36 PM, Eric S. Raymond <esr@thyrsus.com> wrote:
> It is pure accident that I now maintain two of these.

Maintainership is always temporary.

> Having three different tools for this job seems to me duplicative and
> pointless; two of them should probably be let die an honorable death.

Perhaps just maintain the code that serves your goals. That way, you
don't need long trolly emails nor approval from anyone.




m
--
 martin.langhoff@gmail.com
 martin@laptop.org -- Software Architect - OLPC
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cvsps, parsecvs, svn2git and the CVS exporter mess
  2013-01-03 15:37 ` Michael Haggerty
@ 2013-01-03 20:53   ` Eric S. Raymond
  2013-01-05  8:27     ` Max Horn
  0 siblings, 1 reply; 11+ messages in thread
From: Eric S. Raymond @ 2013-01-03 20:53 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Yann Dirson, Heiko Voigt, Antoine Pelisse, Bart Massey,
	Keith Packard, David Mansfield, git

Michael Haggerty <mhagger@alum.mit.edu>:
> There are two good reasons that the output is written to two separate files:

Those are good reasons to write to a pair of tempfiles, and I was able
to deduce in advance most of what your explanation would be from the
bare fact that you did it that way.

They are *not* good reasons for having an interface that exposes this
implementation detail to the caller - that choice I consider a failure
of interface-design judgment.  But I know how to fix this in a simple and
backward-compatible way, and will do so when I have time to write you
a patch.  Next week or the week after, most likely.

Also, the cvs2git manual page is still rather half-baked and careless,
with several fossil references to cvs2svn that shouldn't be there and
obviously incomplete feature coverage. Fixing these bugs is also on my
to-do list for sometime this month.

I'd be willing to put in this work anyway, but it still in the back of
my mind that if cvs2git wins the test-suite competition I might
officially end-of-life both cvsps and parsecvs.  One of the features
of the new git-cvsimport is direct support for using cvs2git as a
conversion engine.
 
> A potentially bigger problem is that if you want to handle such
> blob/dump output, you have to deal with git-fast-import format's "blob"
> command as opposed to only handling inline blobs.

Not a problem.  All of the main potential consumers for this output,
including reposurgeon, handle the blob command just fine.

> cvs2git does not currently support incremental conversions; therefore, a
> cvsps-based option (if it would actually work, that is) would have at
> least one advantage over cvs2git.

Yes. The reason I didn't ship the replacement patch Junio was
expecting yesterday is that I don't have test coverage for the
incremental case.  I'm working on that now.

> cvs2svn has an extensive test suite which includes tests derived from
> bug reports that we have received over the years.  I adapted a few of
> its test repositories to create the git test suite additions that I made
> in Feb 2009, but there are many more in our project.

I've merged those into my tree.

> I think it would be great to have a way to test across tools, though
> please realize that the inference of the most plausible "true" CVS
> history is partly objective but also often a matter of heuristics and
> taste.  Moreover, the choice of how to represent the inferred history in
> git, which has rather a different model than CVS/Subversion, is also
> non-obvious and somewhat controversial.  I expect that there will be a
> number of simple CVS repositories for which we can all agree about the
> correct git output, but not far away will be a vast number for which the
> "correct" answer is unclear.  Many of the interesting tests would fall
> into the latter category.

I'm aware of the problem.  One of the interesting questions is how much
further into the weird cases everybody can agree on what correct 
translation looks like.  We won't know until we push it.
 
> It's not clear what you want me to sign off on.

If you're not willing to use the new suite, my spending the effort 
required to genericize it gets much less interesting.  I needed 
Junio's agreement because I wanted to move the old git-cvsimport
tests from the git tree to the new test suite; they're not really
tests of the wrapper script at all but of the conversion engines.

>                                               I guess you want to
> replace (or augment?) the cvs2svn test suite with one based on your
> framework? 

Augment, not replace - and just as importantly, commit to writing 
new tests into the new generic framework when they don't involve a 
tool-specific option.  It would be silly and duplicative for us *not*
to be sharing as many tests as we can.

> * We definitely want to continue testing the Subversion output of
> cvs2svn.  A test suite that only tests the git output could at best be
> an addition to the current test suite, not a replacement for it.  (That
> being said, the addition of good tests of the 2git output would be great.)

Agreed.

> * A test suite that tests only the easy cases wouldn't really be
> interesting, because the difficult cases are where the potential
> problems lie.

Yes, I know.  I'm arguing that we should be doing that exploration
jointly rather than separately.

> * It would be unfortunate if the cvs2svn test suite would grow another
> run-time dependency or if we would have to invest a lot of time
> synchronizing with another project, though if the gain were big enough
> we could consider it.

I know how to keep the friction cost low.  You'll see more about this when
I split off the test suite and announce it.

> * The licenses obviously have to be compatible to the extent required by
> the level of coupling.

I don't think this will be a problem.  You own the copyright on your tests and
I own it on mine, so we can relicense under whatever common license we choose.
I'm not fussy about what we use; ASL 2.0 would be fine by me.

> * I don't have a lot of time to work on the integration.  cvs2svn has
> long been at a level of maturity where it doesn't need much care and
> feeding, and I would like to keep it that way :-)  Nowadays I am far
> more interested in working on the git project with my little available
> open-sourcin' time.

I don't want to spend the rest of my life on the CVS-lifting problem either.
My present plans envision intense work on it for another three weeks or
so, after which I expect we'll be at a relatively stable and low-maintainance
state. 

FYI, here are my agenda items in roughly the order I expect to finish them:

1. Write test coverage for incremental imports.
2. Ship version 2 of the git-cvsimport replacement patch (with the fallback 
   option Junio requested) to the git list.
3. Get parsecvs to a non-broken state and ship a release
4. Ship a patch for git-cvsimport that adds the option to use parsecvs 
   as a conversion engine.
5. Break the test suite out of cvsps, give it its own public repo, document
   it, and hand you the keys.
6. Fix the interface-design bug(s) in cvs2git, and its documentation.
7. Torture-test all three tools (cvsps, parsecvs, cvs2git) against the
   new suite.
8. Make a judgement about whether I should EOL cvsps or parsecvs or both.

I have other commitments, so this will take a bit longer than it might
have.  I expect to be at step 8 in roughly a month (early February).
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cvsps, parsecvs, svn2git and the CVS exporter mess
  2013-01-03 20:53   ` Eric S. Raymond
@ 2013-01-05  8:27     ` Max Horn
  2013-01-05 15:11       ` Eric S. Raymond
       [not found]       ` <CAA6gtpky9JxFDdpLM6kY9su-9FWX8RoWHU4uptd_Zk+ZJuhrtA@mail.gmail.com>
  0 siblings, 2 replies; 11+ messages in thread
From: Max Horn @ 2013-01-05  8:27 UTC (permalink / raw)
  To: esr
  Cc: Michael Haggerty, Yann Dirson, Heiko Voigt, Antoine Pelisse,
	Bart Massey, Keith Packard, David Mansfield, git


On 03.01.2013, at 21:53, Eric S. Raymond wrote:

> Michael Haggerty <mhagger@alum.mit.edu>:
>> There are two good reasons that the output is written to two separate files:
> 
> Those are good reasons to write to a pair of tempfiles, and I was able
> to deduce in advance most of what your explanation would be from the
> bare fact that you did it that way.
> 
> They are *not* good reasons for having an interface that exposes this
> implementation detail to the caller - that choice I consider a failure
> of interface-design judgment.  But I know how to fix this in a simple and
> backward-compatible way, and will do so when I have time to write you
> a patch.  Next week or the week after, most likely.
> 
> Also, the cvs2git manual page is still rather half-baked and careless,
> with several fossil references to cvs2svn that shouldn't be there and
> obviously incomplete feature coverage. Fixing these bugs is also on my
> to-do list for sometime this month.
> 
> I'd be willing to put in this work anyway, but it still in the back of
> my mind that if cvs2git wins the test-suite competition I might
> officially end-of-life both cvsps and parsecvs.  One of the features
> of the new git-cvsimport is direct support for using cvs2git as a
> conversion engine.
> 
>> A potentially bigger problem is that if you want to handle such
>> blob/dump output, you have to deal with git-fast-import format's "blob"
>> command as opposed to only handling inline blobs.
> 
> Not a problem.  All of the main potential consumers for this output,
> including reposurgeon, handle the blob command just fine.

Hm, you snipped this part of Michael's mail:

>> However, if that is a
>> problem, it is possible to configure cvs2git to write the blobs inline
>> with the rest of the dumpfile (this mode is supported because "hg
>> fast-import" doesn't support detached blobs).

I would call "hg fast-import" a main potential customer, given that there "cvs2hg" is another part of the cvs2svn suite. So I can't quite see how you can come to your conclusion above...



Cheers,
Max

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cvsps, parsecvs, svn2git and the CVS exporter mess
  2013-01-05  8:27     ` Max Horn
@ 2013-01-05 15:11       ` Eric S. Raymond
  2013-01-05 22:57         ` Jonathan Nieder
  2013-01-06 11:15         ` Michael Haggerty
       [not found]       ` <CAA6gtpky9JxFDdpLM6kY9su-9FWX8RoWHU4uptd_Zk+ZJuhrtA@mail.gmail.com>
  1 sibling, 2 replies; 11+ messages in thread
From: Eric S. Raymond @ 2013-01-05 15:11 UTC (permalink / raw)
  To: Max Horn
  Cc: Michael Haggerty, Yann Dirson, Heiko Voigt, Antoine Pelisse,
	Bart Massey, Keith Packard, David Mansfield, git

Max Horn <postbox@quendi.de>:
> Hm, you snipped this part of Michael's mail:
> 
> >> However, if that is a
> >> problem, it is possible to configure cvs2git to write the blobs inline
> >> with the rest of the dumpfile (this mode is supported because "hg
> >> fast-import" doesn't support detached blobs).
> 
> I would call "hg fast-import" a main potential customer, given that there "cvs2hg" is another part of the cvs2svn suite. So I can't quite see how you can come to your conclusion above...

Perhaps I was unclear.  I consider the interface design error to
be not in the fact that all the blobs are written first or detached,
but rather that the implementation detail of the two separate journal
files is ever exposed.

I understand why the storage of intermediate results was done this
way, in order to decrease the tool's working set during the run, but
finishing by automatically concatenating the results and streaming
them to stdout would surely have been the right thing here.
 
The downstream cost of letting the journalling implementation be
exposed, instead, can be seen in this snippet from the new git-cvsimport
I've been working on:

    def command(self):
        "Emit the command implied by all previous options."
        return "(cvs2git --username=git-cvsimport --quiet --quiet --blobfile={0} --dumpfile={1} {2} {3} && cat {0} {1} && rm {0} {1})".format(tempfile.mkstemp()[1], tempfile.mkstemp()[1], self.opts, self.modulepath)

According to the documentation, every caller of csv2git must go
through analogous contortions!  This is not the Unix way; if Unix
design principles had been minimally applied, that second line would
just read like this:

     return "cvs2git --username=git-cvsimport --quiet --quiet"

If Unix design principles had been thoroughly applied, the "--quiet
--quiet" part would be unnecessary too - well-behaved Unix commands
*default* to being completely quiet unless either (a) they have an
exceptional condition to report, or (b) their expected running time is
so long that tasteful silence would leave users in doubt that they're
working.

(And yes, I do think violating these principles is a lapse of taste when
git tools do it, too.)

Michael Haggerty wants me to trust that cvs2git's analysis stage has
been fixed, but I must say that is a more difficult leap of faith when
two of the most visible things about it are still (a) a conspicuous
instance of interface misdesign, and (b) documentation that is careless and
incomplete.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cvsps, parsecvs, svn2git and the CVS exporter mess
       [not found]       ` <CAA6gtpky9JxFDdpLM6kY9su-9FWX8RoWHU4uptd_Zk+ZJuhrtA@mail.gmail.com>
@ 2013-01-05 15:58         ` Eric S. Raymond
  0 siblings, 0 replies; 11+ messages in thread
From: Eric S. Raymond @ 2013-01-05 15:58 UTC (permalink / raw)
  To: Bart Massey
  Cc: Max Horn, Michael Haggerty, Yann Dirson, Heiko Voigt,
	Antoine Pelisse, Keith Packard, David Mansfield, git

Bart Massey <bart@cs.pdx.edu>:
> I don't know what Eric Raymond "officially end-of-life"-ing parsecvs means?

You and Keith handed me the maintainer's baton.  If I were to EOL it,
that would be the successor you two designated judging in public that
the code is unsalvageable or has become pointless.  If you wanted to
exclude the possibility that a successor would make that call, you
shouldn't have handed it off in a state so broken that I can't even
test it properly.

But I don't in fact think the parsecvs code is pointless. The fact that it
only needs the ,v files is nifty and means it could be used as an RCS
exporter too.  The parsing and topo-analysis stages look like really
good work, very crisp and elegant (which is no less than I'd expect
from Keith, actually).

Alas, after wrestling with it I'm beginning to wonder whether the
codebase is salvageable by anyone but Keith himself.  The tight coupling
to the git cache mechanism is the biggest problem.  So far, I can't
figure out what tree.c is actually doing in enough detail to fix it or pry
it loose - the code is opaque and internal documentation is lacking.

More generally, interfacing to the unstable API of libgit was clearly
a serious mistake, leading directly to the current brokenness.  The
tool should have emitted an import stream to begin with.  I'm trying
to fix that, but success is looking doubtful.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cvsps, parsecvs, svn2git and the CVS exporter mess
  2013-01-05 15:11       ` Eric S. Raymond
@ 2013-01-05 22:57         ` Jonathan Nieder
  2013-01-06 11:15         ` Michael Haggerty
  1 sibling, 0 replies; 11+ messages in thread
From: Jonathan Nieder @ 2013-01-05 22:57 UTC (permalink / raw)
  To: Eric S. Raymond
  Cc: Max Horn, Michael Haggerty, Yann Dirson, Heiko Voigt,
	Antoine Pelisse, Bart Massey, Keith Packard, David Mansfield, git

Eric S. Raymond wrote:

> Michael Haggerty wants me to trust that cvs2git's analysis stage has
> been fixed, but I must say that is a more difficult leap of faith when
> two of the most visible things about it are still (a) a conspicuous
> instance of interface misdesign, and (b) documentation that is careless and
> incomplete.

For what it's worth, I use cvs2git quite often.  I've found it to work
well and its code to be clear and its developers responsive.  But I
don't mind if we disagree, and multiple implementations to explore the
design space of importers doesn't seem like a terrible outcome.

Thanks for your work,
Jonathan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: cvsps, parsecvs, svn2git and the CVS exporter mess
  2013-01-05 15:11       ` Eric S. Raymond
  2013-01-05 22:57         ` Jonathan Nieder
@ 2013-01-06 11:15         ` Michael Haggerty
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Haggerty @ 2013-01-06 11:15 UTC (permalink / raw)
  To: esr
  Cc: Max Horn, Yann Dirson, Heiko Voigt, Antoine Pelisse, Bart Massey,
	Keith Packard, David Mansfield, git

On 01/05/2013 04:11 PM, Eric S. Raymond wrote:
> Perhaps I was unclear.  I consider the interface design error to
> be not in the fact that all the blobs are written first or detached,
> but rather that the implementation detail of the two separate journal
> files is ever exposed.
> 
> I understand why the storage of intermediate results was done this
> way, in order to decrease the tool's working set during the run, but
> finishing by automatically concatenating the results and streaming
> them to stdout would surely have been the right thing here.

cvs2svn/cvs2git is built to be able to handle very large CVS
repositories, not only those that can fit in RAM.  This goal influences
a lot of its design, including the pass-by-pass structure with
intermediate databases and the resumability of passes.

The blobfile necessarily contains every version of every file, with no
delta-encoding and no compression.  Its size can be a large multiple of
the on-disk size of the original CVS repository.  If the "save to
tempfiles then cat tempfiles at end of run" behavior were hard-coded
into cvs2git, then there would be no way to avoid requiring enough
temporary space to hold the whole blobfile.

Writing the blobfile into a separate file, on the other hand, means that
for example the blobfile could be written into a named pipe connected to
the standard input of "git fast-import" [1].  "git fast-import" could
even be run on a remote server.

I consider these bigger advantages than the ability to pipe the output
of cvs2git directly into another command.

> The downstream cost of letting the journalling implementation be
> exposed, instead, can be seen in this snippet from the new git-cvsimport
> I've been working on:
> 
>     def command(self):
>         "Emit the command implied by all previous options."
>         return "(cvs2git --username=git-cvsimport --quiet --quiet --blobfile={0} --dumpfile={1} {2} {3} && cat {0} {1} && rm {0} {1})".format(tempfile.mkstemp()[1], tempfile.mkstemp()[1], self.opts, self.modulepath)
> 
> According to the documentation, every caller of csv2git must go
> through analogous contortions!  This is not the Unix way; if Unix
> design principles had been minimally applied, that second line would
> just read like this:
> 
>      return "cvs2git --username=git-cvsimport --quiet --quiet"

Never in my worst nightmares did I imagine that my terrible design taste
would force you to type an extra two lines of code.  Oh the humanity!

By the way, patches are welcome.  And you don't need to trumpet their
imminent arrival [2] or malign the existing code beforehand.  Moreover,
it would be adequate if you just demonstrate working code and *then* ask
for "sign-in", rather than the other way around.

> If Unix design principles had been thoroughly applied, the "--quiet
> --quiet" part would be unnecessary too - well-behaved Unix commands
> *default* to being completely quiet unless either (a) they have an
> exceptional condition to report, or (b) their expected running time is
> so long that tasteful silence would leave users in doubt that they're
> working.

cvs2git is not a command that one uses 100 times a day.  It is a tool
for one-shot conversions of CVS repositories to git.  These conversions
can take hours or even days of processing time (not to mention the time
for configuring the conversion and changing the rest of a project's
infrastructure from CVS to git).  So yes, I think we would like to
appeal to (b) and humbly ask for your permission to give the user some
feedback during the conversion.

> (And yes, I do think violating these principles is a lapse of taste when
> git tools do it, too.)
> 
> Michael Haggerty wants me to trust that cvs2git's analysis stage has
> been fixed, but I must say that is a more difficult leap of faith when
> two of the most visible things about it are still (a) a conspicuous
> instance of interface misdesign, and (b) documentation that is careless and
> incomplete.

The cvs2git documentation is lacking; I admit it (as opposed to the
cvs2svn documentation, which I think is quite complete).  And the
program itself also has a lot of rough edges, for example its inability
to convert .cvsignore files into .gitignore files.  Patches are welcome.
 I haven't used cvs2svn for my own purposes in many years and I've
*never* once had a need to use cvs2git; I maintain these programs purely
as a service to the community.  Most of the community seems satisfied
with the programs as they are, and if not they usually submit courteous
and concrete bug reports or submit patches.

I request that you follow their example.  I especially ask that you
restrain from spreading public FUD about imagined problems based on
speculation.  Please do your tests and *then* report any problems that
you find.

Yours,
Michael

[1] In fact, the current implementation of generate_blobs.py sometimes
seeks back to earlier parts of the blob file when it needs the fulltext
of a revision that has already been output, but this would be easy to
change as soon as somebody needs it.

[2] http://comments.gmane.org/gmane.comp.version-control.git/212340

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-01-06 11:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-22 17:36 cvsps, parsecvs, svn2git and the CVS exporter mess Eric S. Raymond
2012-12-23 20:21 ` Heiko Voigt
2012-12-23 22:45   ` Eric S. Raymond
2013-01-03 15:37 ` Michael Haggerty
2013-01-03 20:53   ` Eric S. Raymond
2013-01-05  8:27     ` Max Horn
2013-01-05 15:11       ` Eric S. Raymond
2013-01-05 22:57         ` Jonathan Nieder
2013-01-06 11:15         ` Michael Haggerty
     [not found]       ` <CAA6gtpky9JxFDdpLM6kY9su-9FWX8RoWHU4uptd_Zk+ZJuhrtA@mail.gmail.com>
2013-01-05 15:58         ` Eric S. Raymond
2013-01-03 15:51 ` Martin Langhoff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).