* Questions about git-fast-import for cvs2svn
@ 2007-07-15 14:11 Michael Haggerty
2007-07-15 16:01 ` Sean
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Michael Haggerty @ 2007-07-15 14:11 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: git
I've been reading the documentation for git-fast-import (thanks for the
fine documentation!) as part of determining how much work it would be to
add a git back end to cvs2svn, and I have a few questions.
1. Is it a problem to create blobs that are never referenced? The
easiest point to create blobs is when the RCS files are originally
parsed, but later we discard some CVS revisions, meaning that the
corresponding blobs would never be needed. Would this be a problem?
2. It appears that author/committer require an email address. How
important is a valid email address here?
a. CVS commits include a username but not an email address. If an
email address is really required, then I suppose the person doing the
conversion would have to supply a lookup table mapping username -> email
address.
b. CVS tag/branch creation events do not even include a username.
Any suggestions for what to use here?
3. I expect we should set 'committer' to the value determined from CVS
and leave 'author' unused. But I suppose another possibility would be
to set the 'committer' to 'cvs2svn' and the 'author' to the original CVS
author. Which one makes sense?
4. It appears that a commit can only have a single 'from', which I
suppose means that files can only be added to one branch from a single
source branch/revision in a single commit. But CVS branches and tags
can include files from multiple source branches and/or revisions. What
would be the most git-like way to handle this situation? Should the
branch be created in one commit, then have files from other sources
added to it in other commits? Or should (is this even possible?) all
files be added to the branch in a single commit, using multiple "merge"
sources?
5. Is there any significance at all to the order that commits are output
to git-fast-import? Obviously, blobs have to be defined before they are
used, and '<committish>'s have to be defined before they are referenced.
But is there any other significance to the order of commits?
All in all, I don't think that a git back end for cvs2svn would be very
trick at all. There will be a bit of refactoring work to allow the user
to switch between SVN/git output at runtime, but so far I don't see any
reason that the fundamental algorithms of cvs2svn will have to be changed.
Thanks,
Michael
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn
2007-07-15 14:11 Questions about git-fast-import for cvs2svn Michael Haggerty
@ 2007-07-15 16:01 ` Sean
2007-07-15 18:51 ` Steffen Prohaska
2007-07-15 18:55 ` Junio C Hamano
2007-07-15 18:43 ` Linus Torvalds
2007-07-15 21:56 ` Robin Rosenberg
2 siblings, 2 replies; 10+ messages in thread
From: Sean @ 2007-07-15 16:01 UTC (permalink / raw)
To: Michael Haggerty; +Cc: Shawn O. Pearce, git
On Sun, 15 Jul 2007 16:11:41 +0200
Michael Haggerty <mhagger@alum.mit.edu> wrote:
Hi Michael,
Will take a stab at answering your questions...
> 1. Is it a problem to create blobs that are never referenced? The
> easiest point to create blobs is when the RCS files are originally
> parsed, but later we discard some CVS revisions, meaning that the
> corresponding blobs would never be needed. Would this be a problem?
Not a problem. Running "git gc" later will cleanup any unused objects.
> 2. It appears that author/committer require an email address. How
> important is a valid email address here?
It's not necessary for the operation of Git itself; it's up to you to
decide how important the information is to your project. You should
be able to set an empty email address for author or committer in
git fast-import as "name <>".
> a. CVS commits include a username but not an email address. If an
> email address is really required, then I suppose the person doing the
> conversion would have to supply a lookup table mapping username -> email
> address.
Yes, take a look at the format supported by git-cvsimport and git-svnimport,
which can map each username into an appropriate name and email addy for Git.
> b. CVS tag/branch creation events do not even include a username.
> Any suggestions for what to use here?
Perhaps just use your own username or one specifically created to
run the conversion process.
> 3. I expect we should set 'committer' to the value determined from CVS
> and leave 'author' unused. But I suppose another possibility would be
> to set the 'committer' to 'cvs2svn' and the 'author' to the original CVS
> author. Which one makes sense?
Another option is to just allow Git to set author and committer to the
same value. As noted in the man page: "If author is omitted then
fast-import will automatically use the committer's information for
the author portion of the commit".
> 4. It appears that a commit can only have a single 'from', which I
> suppose means that files can only be added to one branch from a single
> source branch/revision in a single commit. But CVS branches and tags
> can include files from multiple source branches and/or revisions. What
> would be the most git-like way to handle this situation? Should the
> branch be created in one commit, then have files from other sources
> added to it in other commits? Or should (is this even possible?) all
> files be added to the branch in a single commit, using multiple "merge"
> sources?
Git supports the ability to merge from multiple branches at once (known
as an octopus merge). So it's possible to start a new branch, drawing
in files from more than one source branch in a single commit. As i
understand it, fast-import allows only a single "from" line for a commit,
but allows multiple "merge" lines for additional parentage info.
> 5. Is there any significance at all to the order that commits are output
> to git-fast-import? Obviously, blobs have to be defined before they are
> used, and '<committish>'s have to be defined before they are referenced.
> But is there any other significance to the order of commits?
Don't think so, except perhaps for the packfile optimization issues
mentioned in the man page.
HTH,
Sean
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn
2007-07-15 14:11 Questions about git-fast-import for cvs2svn Michael Haggerty
2007-07-15 16:01 ` Sean
@ 2007-07-15 18:43 ` Linus Torvalds
2007-07-16 6:19 ` Shawn O. Pearce
2007-07-15 21:56 ` Robin Rosenberg
2 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2007-07-15 18:43 UTC (permalink / raw)
To: Michael Haggerty; +Cc: Shawn O. Pearce, git
On Sun, 15 Jul 2007, Michael Haggerty wrote:
>
> 1. Is it a problem to create blobs that are never referenced? The
> easiest point to create blobs is when the RCS files are originally
> parsed, but later we discard some CVS revisions, meaning that the
> corresponding blobs would never be needed. Would this be a problem?
No, don't worry about it. The resulting intermediate pack-file may be
unnecessarily big, but you'd want to do a "git gc" to re-pack everything
afterwards *anyway*, since the pack-files git-fast-import generates are
generally not all that optimall, and that will also prune any unreferenced
blobs.
> 2. It appears that author/committer require an email address. How
> important is a valid email address here?
Git itself doesn't really care, and many CVS conversions have just
converted the username into "user <user>", but from a QoI standpoint it's
much nicer if you at least were to allow the kind of conversion that
allows user-name to be associated with an email.
Maybe git-fast-import could be taught to do the kind of user name
conversion that we already do for CVS imports.. Shawn?
> a. CVS commits include a username but not an email address. If an
> email address is really required, then I suppose the person doing the
> conversion would have to supply a lookup table mapping username -> email
> address.
That would be optimal. Note that it's not just user names: it's much nicer
if you can regenerate a readable full name too, so instead of having
something like "torvalds <torvalds>", you could map "torvalds" into "Linus
Torvalds <torvalds@linux-foundation.org>", which is a lot more readable.
But as far as git is concerned, this is all about being _pretty_, it
doesn't really have any semantic meaning!
Anyway, git-cvsimport knows about a magic file ("CVSROOT/users") that can
map user names into full names and emails. Having soemthing equvalent
for a SVN import would be nice (git-svnimport does the same thing, and
uses ".git/svn-authors" as the default source of author name conversion
data).
> b. CVS tag/branch creation events do not even include a username.
> Any suggestions for what to use here?
Git tags and branch creation doesn't do that either (unless you use signed
tags): only when you create the first commit on a branch does the user
matter.
But if there really is data that doesn't have any user information at all
(for real *changes*), then I'd just make one up. Again, the user
information really doesn't have any *semantics* in git, it's just meant to
be informational for showing the logs. It's nothing more than a structured
part of the commit (or tag) message.
> 3. I expect we should set 'committer' to the value determined from CVS
> and leave 'author' unused. But I suppose another possibility would be
> to set the 'committer' to 'cvs2svn' and the 'author' to the original CVS
> author. Which one makes sense?
Just make them be the same. Git-fast-import will default to that, if you
only give a committer date/name.
That's what git itself does if you just do a "git commit": the committer
will the the same as the author.
> 4. It appears that a commit can only have a single 'from'
No, commits can have an arbitrary number of parents, and if you create a
tag where the data comes from several sources, you could literally do that
ass a really strange merge, and that would probably be the most "correct"
thing to do, even if it might end up looking *really* odd.
[ To be strictly technically correct, I have to admit that I think we
limit the number of parents to 16, but that's not a fundamental limit,
that's just because nobody has ever been so crazy as to need more than
that.
However, there is no "data structure limit" in that number, it's just aa
arbitrary "you'd be crazy to generate a merge of that many parents" kind
of thing, and we could lift the limit if you actually think it's worth
it.
I think the most we have ever seen in practice is a merge of 12 parents,
and the people who did that were told to please not do it again, because
it really does make the graph look extremely "cool". ]
> What would be the most git-like way to handle this situation? Should
> the branch be created in one commit, then have files from other sources
> added to it in other commits? Or should (is this even possible?) all
> files be added to the branch in a single commit, using multiple "merge"
> sources?
Using multiple parents and just generating a single commit (it will be
called a "merge", but really, in git terms a commit is just a commit, and
the difference in number of parents is really not a _technical_
difference, it's just a difference for how these things get visualized).
It would be extremely interesting to see how this works in practice, but I
_think_ it would work really well. The possible downsides might be:
- it *may* just end up looking so confusing that people would prefer some
alternate model.
- we might have some performance issues with lots and lots of parents,
and maybe we'd need to fix something. In particular, I can well imagine
that showing the diff for the end result would be "interesting" (read:
"totally useless")
> 5. Is there any significance at all to the order that commits are output
> to git-fast-import? Obviously, blobs have to be defined before they are
> used, and '<committish>'s have to be defined before they are referenced.
> But is there any other significance to the order of commits?
Not afaik. Git internally very fundamentally simply doesn't care (there
simply _is_ no object ordering, there is just objects that point to other
objects), and I don't think git-fast-import could possibly care either.
You do need to be "topologically" sorted (since you cannot even point to
commits without having their SHA1's), but that should be it.
Linus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn
2007-07-15 16:01 ` Sean
@ 2007-07-15 18:51 ` Steffen Prohaska
2007-07-15 18:58 ` Steffen Prohaska
2007-07-15 18:55 ` Junio C Hamano
1 sibling, 1 reply; 10+ messages in thread
From: Steffen Prohaska @ 2007-07-15 18:51 UTC (permalink / raw)
To: Michael Haggerty, Sean; +Cc: Shawn Pearce, Git Mailing List
On Jul 15, 2007, at 6:01 PM, Sean wrote:
> On Sun, 15 Jul 2007 16:11:41 +0200
> Michael Haggerty <mhagger@alum.mit.edu> wrote:
>
> [...]
>
>> 3. I expect we should set 'committer' to the value determined from
>> CVS
>> and leave 'author' unused. But I suppose another possibility
>> would be
>> to set the 'committer' to 'cvs2svn' and the 'author' to the
>> original CVS
>> author. Which one makes sense?
>
> Another option is to just allow Git to set author and committer to the
> same value. As noted in the man page: "If author is omitted then
> fast-import will automatically use the committer's information for
> the author portion of the commit".
I expect that committer and author would both be set to the value
determined from CVS. CVS doesn't differentiate and I think the
most reasonable assumption in many CVS settings is that the one
who committed a change is the original author.
>> 4. It appears that a commit can only have a single 'from', which I
>> suppose means that files can only be added to one branch from a
>> single
>> source branch/revision in a single commit. But CVS branches and tags
>> can include files from multiple source branches and/or revisions.
>> What
>> would be the most git-like way to handle this situation? Should the
>> branch be created in one commit, then have files from other sources
>> added to it in other commits? Or should (is this even possible?) all
>> files be added to the branch in a single commit, using multiple
>> "merge"
>> sources?
This is really a hard question, which I feel unable to answer.
My feeling is that you would not be able to construct a git
history where branches would need multiple 'froms'. git always
tracks the complete state of all files in the project. So
you can only branch all files at once or no file at all.
It's really hard to say how the situation you described can
be handled.
However, I have a related comment.
Well maintained CVS branches shouldn't suffer from this problem. In
our repository we typically set a tag topic-split on the CVS trunk
and create the branch topic-branch from this tag. Note, some time may
pass before we commit the first change to topic-branch. I'd expect that
a CVS to git importer should handle this situation perfectly. I'd expect
that the git tag topic-split would be set to the last commit common
to the git branch representing the CVS trunk and the git branch
representing the CVS topic-branch. git-cvsimport fails to do so if
the timing of the first commit to the CVS topic-branch is wrong.
To be honest, we have messy branches as well that start off in an
uncontrolled way. But I'd care less about them than about the well
maintained branches.
Michael,
what do you think. Would cvs2svn perfectly handle the well-formed
CVS branches I described?
I already would be very happy if the well-formed branches can be
imported to git and any malformed branch would be reported.
Maybe a second step could be to import malformed branches nonetheless,
perhaps in a non-standard way and give a hint what the difficulty was.
A human may have a chance to fix it using git tools, such as
git-filter-branch or similar.
> Git supports the ability to merge from multiple branches at once
> (known
> as an octopus merge). So it's possible to start a new branch, drawing
> in files from more than one source branch in a single commit. As i
> understand it, fast-import allows only a single "from" line for a
> commit,
> but allows multiple "merge" lines for additional parentage info.
>
> [...]
I'm not sure if merges help to solve the situation described by Michael.
From my understanding the situation is more like starting a branch and
later 'cherry-picking' commits from various other branches at different
times. Michael describes a situation where a branch would need to start
from multiple commits. I think this is different from merging.
I propose not to create any merge commits during import from CVS to git.
CVS doesn't track merges and therefore I'd expect that the history
created
in git during import should form a tree (without merges). If you have
a custom way to detect merges for a specific CVS repository (e.g. by
parsing
CVS commit messages) you can use a grafts file to add them to git later.
Steffen
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn
2007-07-15 16:01 ` Sean
2007-07-15 18:51 ` Steffen Prohaska
@ 2007-07-15 18:55 ` Junio C Hamano
2007-07-16 3:35 ` Eric Wong
1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2007-07-15 18:55 UTC (permalink / raw)
To: Sean; +Cc: Michael Haggerty, Shawn O. Pearce, git
Sean <seanlkml@sympatico.ca> writes:
> Will take a stab at answering your questions...
>
>> 1. Is it a problem to create blobs that are never referenced? The
>> easiest point to create blobs is when the RCS files are originally
>> parsed, but later we discard some CVS revisions, meaning that the
>> corresponding blobs would never be needed. Would this be a problem?
>
> Not a problem. Running "git gc" later will cleanup any unused objects.
>
>> 2. It appears that author/committer require an email address. How
>> important is a valid email address here?
>
> It's not necessary for the operation of Git itself; it's up to you to
> decide how important the information is to your project. You should
> be able to set an empty email address for author or committer in
> git fast-import as "name <>".
Don't do this; git-cvsimport and git-svn uses "name <name>"
which is a saner compromise. This way, you can add .mailmap to
help later "git shortlog" to map using "<name>" part to more
human friendly name. Mapping at conversion time would also be
good and git-cvsimport knows about it (I do not know about
git-svn).
>> b. CVS tag/branch creation events do not even include a username.
>> Any suggestions for what to use here?
>
> Perhaps just use your own username or one specifically created to
> run the conversion process.
I'd suggest to take the person and time information from the
commit that is tagged; that way you can keep the conversion
stable (iow, two conversoin runs using the same input data would
produce identical result).
In git we do not record "branch creation event". Also you can
use lightweight tags which does not have its own data -- which
means you do not have to come up with "the person who made the
tag".
>> 3. I expect we should set 'committer' to the value determined from CVS
>> and leave 'author' unused. But I suppose another possibility would be
>> to set the 'committer' to 'cvs2svn' and the 'author' to the original CVS
>> author. Which one makes sense?
I would set both to "name <name>" from CVS information.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn
2007-07-15 18:51 ` Steffen Prohaska
@ 2007-07-15 18:58 ` Steffen Prohaska
0 siblings, 0 replies; 10+ messages in thread
From: Steffen Prohaska @ 2007-07-15 18:58 UTC (permalink / raw)
To: Michael Haggerty; +Cc: Sean, Shawn Pearce, Git Mailing List
On Jul 15, 2007, at 8:51 PM, Steffen Prohaska wrote:
> I'm not sure if merges help to solve the situation described by
> Michael.
> From my understanding the situation is more like starting a branch and
> later 'cherry-picking' commits from various other branches at
> different
> times. Michael describes a situation where a branch would need to
> start
> from multiple commits. I think this is different from merging.
[ Hmm.. should have checked my email another time
to avoid the race condition with Linus ...]
Steffen
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn
2007-07-15 14:11 Questions about git-fast-import for cvs2svn Michael Haggerty
2007-07-15 16:01 ` Sean
2007-07-15 18:43 ` Linus Torvalds
@ 2007-07-15 21:56 ` Robin Rosenberg
2007-07-15 23:21 ` Robin H. Johnson
2 siblings, 1 reply; 10+ messages in thread
From: Robin Rosenberg @ 2007-07-15 21:56 UTC (permalink / raw)
To: Michael Haggerty; +Cc: Shawn O. Pearce, git
söndag 15 juli 2007 skrev Michael Haggerty:
> b. CVS tag/branch creation events do not even include a username.
> Any suggestions for what to use here?
The CVSROOT/history file contains the user name and timestamp of the tag
creation. CVS can be told not to update the file. It is appended to by default.
-- robin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn
2007-07-15 21:56 ` Robin Rosenberg
@ 2007-07-15 23:21 ` Robin H. Johnson
0 siblings, 0 replies; 10+ messages in thread
From: Robin H. Johnson @ 2007-07-15 23:21 UTC (permalink / raw)
To: Git Mailing List
[-- Attachment #1: Type: text/plain, Size: 679 bytes --]
On Sun, Jul 15, 2007 at 11:56:56PM +0200, Robin Rosenberg wrote:
> s?ndag 15 juli 2007 skrev Michael Haggerty:
> > b. CVS tag/branch creation events do not even include a username.
> > Any suggestions for what to use here?
> The CVSROOT/history file contains the user name and timestamp of the tag
> creation. CVS can be told not to update the file. It is appended to by default.
This assumes that said file is intact, and that LogFormat was not
changed to not put entries into the file on tag/branch.
--
Robin Hugh Johnson
Gentoo Linux Developer & Council Member
E-Mail : robbat2@gentoo.org
GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85
[-- Attachment #2: Type: application/pgp-signature, Size: 321 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn
2007-07-15 18:55 ` Junio C Hamano
@ 2007-07-16 3:35 ` Eric Wong
0 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2007-07-16 3:35 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Sean, Michael Haggerty, Shawn O. Pearce, git
Junio C Hamano <gitster@pobox.com> wrote:
> Sean <seanlkml@sympatico.ca> writes:
> >> 2. It appears that author/committer require an email address. How
> >> important is a valid email address here?
> >
> > It's not necessary for the operation of Git itself; it's up to you to
> > decide how important the information is to your project. You should
> > be able to set an empty email address for author or committer in
> > git fast-import as "name <>".
>
> Don't do this; git-cvsimport and git-svn uses "name <name>"
> which is a saner compromise. This way, you can add .mailmap to
> help later "git shortlog" to map using "<name>" part to more
> human friendly name. Mapping at conversion time would also be
> good and git-cvsimport knows about it (I do not know about
> git-svn).
git-svn can do this, too.
I don't use it myself, but I remember the file format is the
same as the one git-svnimport and git-cvsimport use.
--
Eric Wong
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn
2007-07-15 18:43 ` Linus Torvalds
@ 2007-07-16 6:19 ` Shawn O. Pearce
0 siblings, 0 replies; 10+ messages in thread
From: Shawn O. Pearce @ 2007-07-16 6:19 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Michael Haggerty, git
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 15 Jul 2007, Michael Haggerty wrote:
> > 2. It appears that author/committer require an email address. How
> > important is a valid email address here?
>
> Git itself doesn't really care, and many CVS conversions have just
> converted the username into "user <user>", but from a QoI standpoint it's
> much nicer if you at least were to allow the kind of conversion that
> allows user-name to be associated with an email.
>
> Maybe git-fast-import could be taught to do the kind of user name
> conversion that we already do for CVS imports.. Shawn?
It could, but I'm not sure I want to implement it. ;-)
I pretty much view source->Git translation as the business/policy
of the frontend, not of fast-import. But we have three frontends
that all share the same file format (git-cvsimport, git-svnimport,
git-svn), and are all independent implementations. Maybe pushing
it down into a tool like fast-import would benefit a lot of users,
and thus should be done.
I'll put it on my todo list. Which is much longer than I have
time for.
> > 5. Is there any significance at all to the order that commits are output
> > to git-fast-import? Obviously, blobs have to be defined before they are
> > used, and '<committish>'s have to be defined before they are referenced.
> > But is there any other significance to the order of commits?
>
> Not afaik. Git internally very fundamentally simply doesn't care (there
> simply _is_ no object ordering, there is just objects that point to other
> objects), and I don't think git-fast-import could possibly care either.
> You do need to be "topologically" sorted (since you cannot even point to
> commits without having their SHA1's), but that should be it.
Linus is completely correct here. The only requirement on data
ordering is that all parent commits (from/merge lines) must
come before any child that depends on them. But that's a pretty
reasonable request, as almost all VCS systems want data to come in
at least that order, if not something even more strict.
In theory marks could be used to stub in commits and let you feed
them out of order, but to make that work fast-import would need to
buffer them until it saw everything it needed to produce a SHA-1.
Not exactly a good idea.
--
Shawn.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-07-16 6:19 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-15 14:11 Questions about git-fast-import for cvs2svn Michael Haggerty
2007-07-15 16:01 ` Sean
2007-07-15 18:51 ` Steffen Prohaska
2007-07-15 18:58 ` Steffen Prohaska
2007-07-15 18:55 ` Junio C Hamano
2007-07-16 3:35 ` Eric Wong
2007-07-15 18:43 ` Linus Torvalds
2007-07-16 6:19 ` Shawn O. Pearce
2007-07-15 21:56 ` Robin Rosenberg
2007-07-15 23:21 ` Robin H. Johnson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).