* Questions about git-fast-import for cvs2svn @ 2007-07-15 14:11 Michael Haggerty 2007-07-15 16:01 ` Sean ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Michael Haggerty @ 2007-07-15 14:11 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: git I've been reading the documentation for git-fast-import (thanks for the fine documentation!) as part of determining how much work it would be to add a git back end to cvs2svn, and I have a few questions. 1. Is it a problem to create blobs that are never referenced? The easiest point to create blobs is when the RCS files are originally parsed, but later we discard some CVS revisions, meaning that the corresponding blobs would never be needed. Would this be a problem? 2. It appears that author/committer require an email address. How important is a valid email address here? a. CVS commits include a username but not an email address. If an email address is really required, then I suppose the person doing the conversion would have to supply a lookup table mapping username -> email address. b. CVS tag/branch creation events do not even include a username. Any suggestions for what to use here? 3. I expect we should set 'committer' to the value determined from CVS and leave 'author' unused. But I suppose another possibility would be to set the 'committer' to 'cvs2svn' and the 'author' to the original CVS author. Which one makes sense? 4. It appears that a commit can only have a single 'from', which I suppose means that files can only be added to one branch from a single source branch/revision in a single commit. But CVS branches and tags can include files from multiple source branches and/or revisions. What would be the most git-like way to handle this situation? Should the branch be created in one commit, then have files from other sources added to it in other commits? Or should (is this even possible?) all files be added to the branch in a single commit, using multiple "merge" sources? 5. Is there any significance at all to the order that commits are output to git-fast-import? Obviously, blobs have to be defined before they are used, and '<committish>'s have to be defined before they are referenced. But is there any other significance to the order of commits? All in all, I don't think that a git back end for cvs2svn would be very trick at all. There will be a bit of refactoring work to allow the user to switch between SVN/git output at runtime, but so far I don't see any reason that the fundamental algorithms of cvs2svn will have to be changed. Thanks, Michael ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn 2007-07-15 14:11 Questions about git-fast-import for cvs2svn Michael Haggerty @ 2007-07-15 16:01 ` Sean 2007-07-15 18:51 ` Steffen Prohaska 2007-07-15 18:55 ` Junio C Hamano 2007-07-15 18:43 ` Linus Torvalds 2007-07-15 21:56 ` Robin Rosenberg 2 siblings, 2 replies; 10+ messages in thread From: Sean @ 2007-07-15 16:01 UTC (permalink / raw) To: Michael Haggerty; +Cc: Shawn O. Pearce, git On Sun, 15 Jul 2007 16:11:41 +0200 Michael Haggerty <mhagger@alum.mit.edu> wrote: Hi Michael, Will take a stab at answering your questions... > 1. Is it a problem to create blobs that are never referenced? The > easiest point to create blobs is when the RCS files are originally > parsed, but later we discard some CVS revisions, meaning that the > corresponding blobs would never be needed. Would this be a problem? Not a problem. Running "git gc" later will cleanup any unused objects. > 2. It appears that author/committer require an email address. How > important is a valid email address here? It's not necessary for the operation of Git itself; it's up to you to decide how important the information is to your project. You should be able to set an empty email address for author or committer in git fast-import as "name <>". > a. CVS commits include a username but not an email address. If an > email address is really required, then I suppose the person doing the > conversion would have to supply a lookup table mapping username -> email > address. Yes, take a look at the format supported by git-cvsimport and git-svnimport, which can map each username into an appropriate name and email addy for Git. > b. CVS tag/branch creation events do not even include a username. > Any suggestions for what to use here? Perhaps just use your own username or one specifically created to run the conversion process. > 3. I expect we should set 'committer' to the value determined from CVS > and leave 'author' unused. But I suppose another possibility would be > to set the 'committer' to 'cvs2svn' and the 'author' to the original CVS > author. Which one makes sense? Another option is to just allow Git to set author and committer to the same value. As noted in the man page: "If author is omitted then fast-import will automatically use the committer's information for the author portion of the commit". > 4. It appears that a commit can only have a single 'from', which I > suppose means that files can only be added to one branch from a single > source branch/revision in a single commit. But CVS branches and tags > can include files from multiple source branches and/or revisions. What > would be the most git-like way to handle this situation? Should the > branch be created in one commit, then have files from other sources > added to it in other commits? Or should (is this even possible?) all > files be added to the branch in a single commit, using multiple "merge" > sources? Git supports the ability to merge from multiple branches at once (known as an octopus merge). So it's possible to start a new branch, drawing in files from more than one source branch in a single commit. As i understand it, fast-import allows only a single "from" line for a commit, but allows multiple "merge" lines for additional parentage info. > 5. Is there any significance at all to the order that commits are output > to git-fast-import? Obviously, blobs have to be defined before they are > used, and '<committish>'s have to be defined before they are referenced. > But is there any other significance to the order of commits? Don't think so, except perhaps for the packfile optimization issues mentioned in the man page. HTH, Sean ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn 2007-07-15 16:01 ` Sean @ 2007-07-15 18:51 ` Steffen Prohaska 2007-07-15 18:58 ` Steffen Prohaska 2007-07-15 18:55 ` Junio C Hamano 1 sibling, 1 reply; 10+ messages in thread From: Steffen Prohaska @ 2007-07-15 18:51 UTC (permalink / raw) To: Michael Haggerty, Sean; +Cc: Shawn Pearce, Git Mailing List On Jul 15, 2007, at 6:01 PM, Sean wrote: > On Sun, 15 Jul 2007 16:11:41 +0200 > Michael Haggerty <mhagger@alum.mit.edu> wrote: > > [...] > >> 3. I expect we should set 'committer' to the value determined from >> CVS >> and leave 'author' unused. But I suppose another possibility >> would be >> to set the 'committer' to 'cvs2svn' and the 'author' to the >> original CVS >> author. Which one makes sense? > > Another option is to just allow Git to set author and committer to the > same value. As noted in the man page: "If author is omitted then > fast-import will automatically use the committer's information for > the author portion of the commit". I expect that committer and author would both be set to the value determined from CVS. CVS doesn't differentiate and I think the most reasonable assumption in many CVS settings is that the one who committed a change is the original author. >> 4. It appears that a commit can only have a single 'from', which I >> suppose means that files can only be added to one branch from a >> single >> source branch/revision in a single commit. But CVS branches and tags >> can include files from multiple source branches and/or revisions. >> What >> would be the most git-like way to handle this situation? Should the >> branch be created in one commit, then have files from other sources >> added to it in other commits? Or should (is this even possible?) all >> files be added to the branch in a single commit, using multiple >> "merge" >> sources? This is really a hard question, which I feel unable to answer. My feeling is that you would not be able to construct a git history where branches would need multiple 'froms'. git always tracks the complete state of all files in the project. So you can only branch all files at once or no file at all. It's really hard to say how the situation you described can be handled. However, I have a related comment. Well maintained CVS branches shouldn't suffer from this problem. In our repository we typically set a tag topic-split on the CVS trunk and create the branch topic-branch from this tag. Note, some time may pass before we commit the first change to topic-branch. I'd expect that a CVS to git importer should handle this situation perfectly. I'd expect that the git tag topic-split would be set to the last commit common to the git branch representing the CVS trunk and the git branch representing the CVS topic-branch. git-cvsimport fails to do so if the timing of the first commit to the CVS topic-branch is wrong. To be honest, we have messy branches as well that start off in an uncontrolled way. But I'd care less about them than about the well maintained branches. Michael, what do you think. Would cvs2svn perfectly handle the well-formed CVS branches I described? I already would be very happy if the well-formed branches can be imported to git and any malformed branch would be reported. Maybe a second step could be to import malformed branches nonetheless, perhaps in a non-standard way and give a hint what the difficulty was. A human may have a chance to fix it using git tools, such as git-filter-branch or similar. > Git supports the ability to merge from multiple branches at once > (known > as an octopus merge). So it's possible to start a new branch, drawing > in files from more than one source branch in a single commit. As i > understand it, fast-import allows only a single "from" line for a > commit, > but allows multiple "merge" lines for additional parentage info. > > [...] I'm not sure if merges help to solve the situation described by Michael. From my understanding the situation is more like starting a branch and later 'cherry-picking' commits from various other branches at different times. Michael describes a situation where a branch would need to start from multiple commits. I think this is different from merging. I propose not to create any merge commits during import from CVS to git. CVS doesn't track merges and therefore I'd expect that the history created in git during import should form a tree (without merges). If you have a custom way to detect merges for a specific CVS repository (e.g. by parsing CVS commit messages) you can use a grafts file to add them to git later. Steffen ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn 2007-07-15 18:51 ` Steffen Prohaska @ 2007-07-15 18:58 ` Steffen Prohaska 0 siblings, 0 replies; 10+ messages in thread From: Steffen Prohaska @ 2007-07-15 18:58 UTC (permalink / raw) To: Michael Haggerty; +Cc: Sean, Shawn Pearce, Git Mailing List On Jul 15, 2007, at 8:51 PM, Steffen Prohaska wrote: > I'm not sure if merges help to solve the situation described by > Michael. > From my understanding the situation is more like starting a branch and > later 'cherry-picking' commits from various other branches at > different > times. Michael describes a situation where a branch would need to > start > from multiple commits. I think this is different from merging. [ Hmm.. should have checked my email another time to avoid the race condition with Linus ...] Steffen ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn 2007-07-15 16:01 ` Sean 2007-07-15 18:51 ` Steffen Prohaska @ 2007-07-15 18:55 ` Junio C Hamano 2007-07-16 3:35 ` Eric Wong 1 sibling, 1 reply; 10+ messages in thread From: Junio C Hamano @ 2007-07-15 18:55 UTC (permalink / raw) To: Sean; +Cc: Michael Haggerty, Shawn O. Pearce, git Sean <seanlkml@sympatico.ca> writes: > Will take a stab at answering your questions... > >> 1. Is it a problem to create blobs that are never referenced? The >> easiest point to create blobs is when the RCS files are originally >> parsed, but later we discard some CVS revisions, meaning that the >> corresponding blobs would never be needed. Would this be a problem? > > Not a problem. Running "git gc" later will cleanup any unused objects. > >> 2. It appears that author/committer require an email address. How >> important is a valid email address here? > > It's not necessary for the operation of Git itself; it's up to you to > decide how important the information is to your project. You should > be able to set an empty email address for author or committer in > git fast-import as "name <>". Don't do this; git-cvsimport and git-svn uses "name <name>" which is a saner compromise. This way, you can add .mailmap to help later "git shortlog" to map using "<name>" part to more human friendly name. Mapping at conversion time would also be good and git-cvsimport knows about it (I do not know about git-svn). >> b. CVS tag/branch creation events do not even include a username. >> Any suggestions for what to use here? > > Perhaps just use your own username or one specifically created to > run the conversion process. I'd suggest to take the person and time information from the commit that is tagged; that way you can keep the conversion stable (iow, two conversoin runs using the same input data would produce identical result). In git we do not record "branch creation event". Also you can use lightweight tags which does not have its own data -- which means you do not have to come up with "the person who made the tag". >> 3. I expect we should set 'committer' to the value determined from CVS >> and leave 'author' unused. But I suppose another possibility would be >> to set the 'committer' to 'cvs2svn' and the 'author' to the original CVS >> author. Which one makes sense? I would set both to "name <name>" from CVS information. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn 2007-07-15 18:55 ` Junio C Hamano @ 2007-07-16 3:35 ` Eric Wong 0 siblings, 0 replies; 10+ messages in thread From: Eric Wong @ 2007-07-16 3:35 UTC (permalink / raw) To: Junio C Hamano; +Cc: Sean, Michael Haggerty, Shawn O. Pearce, git Junio C Hamano <gitster@pobox.com> wrote: > Sean <seanlkml@sympatico.ca> writes: > >> 2. It appears that author/committer require an email address. How > >> important is a valid email address here? > > > > It's not necessary for the operation of Git itself; it's up to you to > > decide how important the information is to your project. You should > > be able to set an empty email address for author or committer in > > git fast-import as "name <>". > > Don't do this; git-cvsimport and git-svn uses "name <name>" > which is a saner compromise. This way, you can add .mailmap to > help later "git shortlog" to map using "<name>" part to more > human friendly name. Mapping at conversion time would also be > good and git-cvsimport knows about it (I do not know about > git-svn). git-svn can do this, too. I don't use it myself, but I remember the file format is the same as the one git-svnimport and git-cvsimport use. -- Eric Wong ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn 2007-07-15 14:11 Questions about git-fast-import for cvs2svn Michael Haggerty 2007-07-15 16:01 ` Sean @ 2007-07-15 18:43 ` Linus Torvalds 2007-07-16 6:19 ` Shawn O. Pearce 2007-07-15 21:56 ` Robin Rosenberg 2 siblings, 1 reply; 10+ messages in thread From: Linus Torvalds @ 2007-07-15 18:43 UTC (permalink / raw) To: Michael Haggerty; +Cc: Shawn O. Pearce, git On Sun, 15 Jul 2007, Michael Haggerty wrote: > > 1. Is it a problem to create blobs that are never referenced? The > easiest point to create blobs is when the RCS files are originally > parsed, but later we discard some CVS revisions, meaning that the > corresponding blobs would never be needed. Would this be a problem? No, don't worry about it. The resulting intermediate pack-file may be unnecessarily big, but you'd want to do a "git gc" to re-pack everything afterwards *anyway*, since the pack-files git-fast-import generates are generally not all that optimall, and that will also prune any unreferenced blobs. > 2. It appears that author/committer require an email address. How > important is a valid email address here? Git itself doesn't really care, and many CVS conversions have just converted the username into "user <user>", but from a QoI standpoint it's much nicer if you at least were to allow the kind of conversion that allows user-name to be associated with an email. Maybe git-fast-import could be taught to do the kind of user name conversion that we already do for CVS imports.. Shawn? > a. CVS commits include a username but not an email address. If an > email address is really required, then I suppose the person doing the > conversion would have to supply a lookup table mapping username -> email > address. That would be optimal. Note that it's not just user names: it's much nicer if you can regenerate a readable full name too, so instead of having something like "torvalds <torvalds>", you could map "torvalds" into "Linus Torvalds <torvalds@linux-foundation.org>", which is a lot more readable. But as far as git is concerned, this is all about being _pretty_, it doesn't really have any semantic meaning! Anyway, git-cvsimport knows about a magic file ("CVSROOT/users") that can map user names into full names and emails. Having soemthing equvalent for a SVN import would be nice (git-svnimport does the same thing, and uses ".git/svn-authors" as the default source of author name conversion data). > b. CVS tag/branch creation events do not even include a username. > Any suggestions for what to use here? Git tags and branch creation doesn't do that either (unless you use signed tags): only when you create the first commit on a branch does the user matter. But if there really is data that doesn't have any user information at all (for real *changes*), then I'd just make one up. Again, the user information really doesn't have any *semantics* in git, it's just meant to be informational for showing the logs. It's nothing more than a structured part of the commit (or tag) message. > 3. I expect we should set 'committer' to the value determined from CVS > and leave 'author' unused. But I suppose another possibility would be > to set the 'committer' to 'cvs2svn' and the 'author' to the original CVS > author. Which one makes sense? Just make them be the same. Git-fast-import will default to that, if you only give a committer date/name. That's what git itself does if you just do a "git commit": the committer will the the same as the author. > 4. It appears that a commit can only have a single 'from' No, commits can have an arbitrary number of parents, and if you create a tag where the data comes from several sources, you could literally do that ass a really strange merge, and that would probably be the most "correct" thing to do, even if it might end up looking *really* odd. [ To be strictly technically correct, I have to admit that I think we limit the number of parents to 16, but that's not a fundamental limit, that's just because nobody has ever been so crazy as to need more than that. However, there is no "data structure limit" in that number, it's just aa arbitrary "you'd be crazy to generate a merge of that many parents" kind of thing, and we could lift the limit if you actually think it's worth it. I think the most we have ever seen in practice is a merge of 12 parents, and the people who did that were told to please not do it again, because it really does make the graph look extremely "cool". ] > What would be the most git-like way to handle this situation? Should > the branch be created in one commit, then have files from other sources > added to it in other commits? Or should (is this even possible?) all > files be added to the branch in a single commit, using multiple "merge" > sources? Using multiple parents and just generating a single commit (it will be called a "merge", but really, in git terms a commit is just a commit, and the difference in number of parents is really not a _technical_ difference, it's just a difference for how these things get visualized). It would be extremely interesting to see how this works in practice, but I _think_ it would work really well. The possible downsides might be: - it *may* just end up looking so confusing that people would prefer some alternate model. - we might have some performance issues with lots and lots of parents, and maybe we'd need to fix something. In particular, I can well imagine that showing the diff for the end result would be "interesting" (read: "totally useless") > 5. Is there any significance at all to the order that commits are output > to git-fast-import? Obviously, blobs have to be defined before they are > used, and '<committish>'s have to be defined before they are referenced. > But is there any other significance to the order of commits? Not afaik. Git internally very fundamentally simply doesn't care (there simply _is_ no object ordering, there is just objects that point to other objects), and I don't think git-fast-import could possibly care either. You do need to be "topologically" sorted (since you cannot even point to commits without having their SHA1's), but that should be it. Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn 2007-07-15 18:43 ` Linus Torvalds @ 2007-07-16 6:19 ` Shawn O. Pearce 0 siblings, 0 replies; 10+ messages in thread From: Shawn O. Pearce @ 2007-07-16 6:19 UTC (permalink / raw) To: Linus Torvalds; +Cc: Michael Haggerty, git Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, 15 Jul 2007, Michael Haggerty wrote: > > 2. It appears that author/committer require an email address. How > > important is a valid email address here? > > Git itself doesn't really care, and many CVS conversions have just > converted the username into "user <user>", but from a QoI standpoint it's > much nicer if you at least were to allow the kind of conversion that > allows user-name to be associated with an email. > > Maybe git-fast-import could be taught to do the kind of user name > conversion that we already do for CVS imports.. Shawn? It could, but I'm not sure I want to implement it. ;-) I pretty much view source->Git translation as the business/policy of the frontend, not of fast-import. But we have three frontends that all share the same file format (git-cvsimport, git-svnimport, git-svn), and are all independent implementations. Maybe pushing it down into a tool like fast-import would benefit a lot of users, and thus should be done. I'll put it on my todo list. Which is much longer than I have time for. > > 5. Is there any significance at all to the order that commits are output > > to git-fast-import? Obviously, blobs have to be defined before they are > > used, and '<committish>'s have to be defined before they are referenced. > > But is there any other significance to the order of commits? > > Not afaik. Git internally very fundamentally simply doesn't care (there > simply _is_ no object ordering, there is just objects that point to other > objects), and I don't think git-fast-import could possibly care either. > You do need to be "topologically" sorted (since you cannot even point to > commits without having their SHA1's), but that should be it. Linus is completely correct here. The only requirement on data ordering is that all parent commits (from/merge lines) must come before any child that depends on them. But that's a pretty reasonable request, as almost all VCS systems want data to come in at least that order, if not something even more strict. In theory marks could be used to stub in commits and let you feed them out of order, but to make that work fast-import would need to buffer them until it saw everything it needed to produce a SHA-1. Not exactly a good idea. -- Shawn. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn 2007-07-15 14:11 Questions about git-fast-import for cvs2svn Michael Haggerty 2007-07-15 16:01 ` Sean 2007-07-15 18:43 ` Linus Torvalds @ 2007-07-15 21:56 ` Robin Rosenberg 2007-07-15 23:21 ` Robin H. Johnson 2 siblings, 1 reply; 10+ messages in thread From: Robin Rosenberg @ 2007-07-15 21:56 UTC (permalink / raw) To: Michael Haggerty; +Cc: Shawn O. Pearce, git söndag 15 juli 2007 skrev Michael Haggerty: > b. CVS tag/branch creation events do not even include a username. > Any suggestions for what to use here? The CVSROOT/history file contains the user name and timestamp of the tag creation. CVS can be told not to update the file. It is appended to by default. -- robin ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Questions about git-fast-import for cvs2svn 2007-07-15 21:56 ` Robin Rosenberg @ 2007-07-15 23:21 ` Robin H. Johnson 0 siblings, 0 replies; 10+ messages in thread From: Robin H. Johnson @ 2007-07-15 23:21 UTC (permalink / raw) To: Git Mailing List [-- Attachment #1: Type: text/plain, Size: 679 bytes --] On Sun, Jul 15, 2007 at 11:56:56PM +0200, Robin Rosenberg wrote: > s?ndag 15 juli 2007 skrev Michael Haggerty: > > b. CVS tag/branch creation events do not even include a username. > > Any suggestions for what to use here? > The CVSROOT/history file contains the user name and timestamp of the tag > creation. CVS can be told not to update the file. It is appended to by default. This assumes that said file is intact, and that LogFormat was not changed to not put entries into the file on tag/branch. -- Robin Hugh Johnson Gentoo Linux Developer & Council Member E-Mail : robbat2@gentoo.org GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85 [-- Attachment #2: Type: application/pgp-signature, Size: 321 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-07-16 6:19 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-07-15 14:11 Questions about git-fast-import for cvs2svn Michael Haggerty 2007-07-15 16:01 ` Sean 2007-07-15 18:51 ` Steffen Prohaska 2007-07-15 18:58 ` Steffen Prohaska 2007-07-15 18:55 ` Junio C Hamano 2007-07-16 3:35 ` Eric Wong 2007-07-15 18:43 ` Linus Torvalds 2007-07-16 6:19 ` Shawn O. Pearce 2007-07-15 21:56 ` Robin Rosenberg 2007-07-15 23:21 ` Robin H. Johnson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).