* Diff format in packs @ 2006-07-31 21:08 Jon Smirl 2006-07-31 21:16 ` Jakub Narebski 0 siblings, 1 reply; 13+ messages in thread From: Jon Smirl @ 2006-07-31 21:08 UTC (permalink / raw) To: git I see how the diffs are encoded into the pack, but what did they look like before compressing? It would be great if they looked like CVS diffs. I poked around in the doc and I don't see anything. Is this specified somewhere and I missed it? I see that the diff code is from libxdiff but I haven't figured out how it is being used yet. I'm trying to build a small app that takes a CVS ,v and writes out a pack corresponding to the versions. Suggestions on the most efficient strategy for doing this by calling straight into the git C code? Forking off git commands is not very efficient when done a million times. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Diff format in packs 2006-07-31 21:08 Diff format in packs Jon Smirl @ 2006-07-31 21:16 ` Jakub Narebski 2006-07-31 21:20 ` Jon Smirl 0 siblings, 1 reply; 13+ messages in thread From: Jakub Narebski @ 2006-07-31 21:16 UTC (permalink / raw) To: git Jon Smirl wrote: > I'm trying to build a small app that takes a CVS ,v and writes out a > pack corresponding to the versions. Suggestions on the most efficient > strategy for doing this by calling straight into the git C code? > Forking off git commands is not very efficient when done a million > times. Something akin to parsecvs by Keith Packard? -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Diff format in packs 2006-07-31 21:16 ` Jakub Narebski @ 2006-07-31 21:20 ` Jon Smirl 2006-07-31 22:32 ` Shawn Pearce 2006-08-01 0:47 ` Martin Langhoff 0 siblings, 2 replies; 13+ messages in thread From: Jon Smirl @ 2006-07-31 21:20 UTC (permalink / raw) To: Jakub Narebski; +Cc: git On 7/31/06, Jakub Narebski <jnareb@gmail.com> wrote: > Jon Smirl wrote: > > > I'm trying to build a small app that takes a CVS ,v and writes out a > > pack corresponding to the versions. Suggestions on the most efficient > > strategy for doing this by calling straight into the git C code? > > Forking off git commands is not very efficient when done a million > > times. > > Something akin to parsecvs by Keith Packard? I see the error in my thoughts now, I need the fully expanded delta to compute the sha-1 so I might as well use the parsecvs code. I am working on combining cvs2svn, parsecvs and cvsps into something that can handle Mozilla CVS. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Diff format in packs 2006-07-31 21:20 ` Jon Smirl @ 2006-07-31 22:32 ` Shawn Pearce 2006-07-31 23:08 ` Jon Smirl 2006-08-01 0:47 ` Martin Langhoff 1 sibling, 1 reply; 13+ messages in thread From: Shawn Pearce @ 2006-07-31 22:32 UTC (permalink / raw) To: Jon Smirl; +Cc: Jakub Narebski, git Jon Smirl <jonsmirl@gmail.com> wrote: > On 7/31/06, Jakub Narebski <jnareb@gmail.com> wrote: > >Jon Smirl wrote: > > > >> I'm trying to build a small app that takes a CVS ,v and writes out a > >> pack corresponding to the versions. Suggestions on the most efficient > >> strategy for doing this by calling straight into the git C code? > >> Forking off git commands is not very efficient when done a million > >> times. > > > >Something akin to parsecvs by Keith Packard? > > I see the error in my thoughts now, I need the fully expanded delta to > compute the sha-1 so I might as well use the parsecvs code. > > I am working on combining cvs2svn, parsecvs and cvsps into something > that can handle Mozilla CVS. I think you sort of have the right idea. Creating a pack file from scratch without deltas is a very trivial operation. The pack format is documented in Documentation/technical/pack-format.txt. The actual delta format isn't documented here and generating a delta would be somewhat difficult, but creating a pack with no deltas and only zlib compression is pretty simple. And no, GIT doesn't use the same (horrible) delta format as RCS so you definately are right, you have to expand it before you can compress it. Creating trees and commits from scratch is also really easy. Calling zlib and a SHA1 routine to create the checksum is the hard part. I think I wrote the tree and commit construction part of jgit in a few hours, and that was while I was also being distracted by someone speaking in the front of the room. :-) It should be reasonably simple to extract each revision from a single ,v file into its full undeltafied form, compute its SHA1, compress it with zlib, and append it into a pack file. Do that for every file and toss the SHA1 values, file names and revision numbers off into a table somewhere. Then loop back through and generate trees while playing around only with the RCS file paths, timestamps and SHA1 pointers. Again tree generation is extremely simple; it would be trivial to generate tree objects and append them into the same (or another) pack. Finally writing commit objects pointing at the trees is also easy, without calling git-commit. When you are all done run a `git-repack -a -d -f` and let the delta code compress everything down. That first compression might take a little while but it should do a reasonably good job despite the input pack(s) being highly unorganized. So I think I'm suggesting you find a way to generate the base objects yourself right into a pack file, rather than using the higher level GIT executables to do it. You may be able to reuse some of the code in GIT but I know its writer code is organized for writing loose objects, not for appending new objects into a new pack file, so some surgery would probably be required. -- Shawn. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Diff format in packs 2006-07-31 22:32 ` Shawn Pearce @ 2006-07-31 23:08 ` Jon Smirl 0 siblings, 0 replies; 13+ messages in thread From: Jon Smirl @ 2006-07-31 23:08 UTC (permalink / raw) To: Shawn Pearce; +Cc: Jakub Narebski, git On 7/31/06, Shawn Pearce <spearce@spearce.org> wrote: > It should be reasonably simple to extract each revision from a > single ,v file into its full undeltafied form, compute its SHA1, > compress it with zlib, and append it into a pack file. Do that > for every file and toss the SHA1 values, file names and revision > numbers off into a table somewhere. cvs2svn has put everything needed into a nice db4 database that I can generate the trees from once I get the revisions into git. I'll modify cvs2svn to record the SHA1 for the deltas when they are originally scanned. The change set algorithm used by cvs2svn is the only one of the bunch that can process moz cvs without errors or losing check-ins. It takes 3.5hrs to generate the changeset db. There are over 1M deltas to sort and 250K commits. > So I think I'm suggesting you find a way to generate the base objects > yourself right into a pack file, rather than using the higher level > GIT executables to do it. You may be able to reuse some of the > code in GIT but I know its writer code is organized for writing > loose objects, not for appending new objects into a new pack file, > so some surgery would probably be required. This makes sense, just put the objects into the pack and forget about deltas for now. I can add deltas in the next rewrite or just use git-repack. Since cvs2svn is in Python I need to figure out how to write a pack file from that environment. At least Python has sha1 and zlib support built-in. What is more important is that I get a working incremental update functioning. I'm hoping that I can rely on cvsps for that. Cutting everyone over at once is unlikely to happen. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Diff format in packs 2006-07-31 21:20 ` Jon Smirl 2006-07-31 22:32 ` Shawn Pearce @ 2006-08-01 0:47 ` Martin Langhoff 2006-08-01 1:03 ` Martin Langhoff 2006-08-01 1:13 ` Jon Smirl 1 sibling, 2 replies; 13+ messages in thread From: Martin Langhoff @ 2006-08-01 0:47 UTC (permalink / raw) To: Jon Smirl; +Cc: Jakub Narebski, git Jon, just get all the file versions out of the ,v file and into the GIT repo, then do find .git/objects/ -type f | git-pack-objects. You don't have to even think of generating the packfile yourself. On 8/1/06, Jon Smirl <jonsmirl@gmail.com> wrote: > I am working on combining cvs2svn, parsecvs and cvsps into something > that can handle Mozilla CVS. If you publish your WIP somewhere, I might be able to jump in and help a bit. I've seen your "challenge" email earlier, but haven't been able to get started yet -- lots of work on other foss fronts. cheers, martin ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Diff format in packs 2006-08-01 0:47 ` Martin Langhoff @ 2006-08-01 1:03 ` Martin Langhoff 2006-08-01 1:13 ` Jon Smirl 1 sibling, 0 replies; 13+ messages in thread From: Martin Langhoff @ 2006-08-01 1:03 UTC (permalink / raw) To: Jon Smirl; +Cc: Jakub Narebski, git On 8/1/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > just get all the file versions out of the ,v file and into the GIT > repo, then do find .git/objects/ -type f | git-pack-objects. You don't > have to even think of generating the packfile yourself. Well, that's a bit of a lie, using bare find won't quite work, but if you look up the "Repacking many disconnected blobs" thread, there's a good discussion of packing objects that are not related to trees yet. martin ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Diff format in packs 2006-08-01 0:47 ` Martin Langhoff 2006-08-01 1:03 ` Martin Langhoff @ 2006-08-01 1:13 ` Jon Smirl 2006-08-01 2:16 ` Martin Langhoff 1 sibling, 1 reply; 13+ messages in thread From: Jon Smirl @ 2006-08-01 1:13 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jakub Narebski, git On 7/31/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > Jon, > > just get all the file versions out of the ,v file and into the GIT > repo, then do find .git/objects/ -type f | git-pack-objects. You don't > have to even think of generating the packfile yourself. Moz CVS expands into over 1M files and 12GB in size. I keep getting concerned about algorithms that take days to complete and need 4GB to run. > On 8/1/06, Jon Smirl <jonsmirl@gmail.com> wrote: > > I am working on combining cvs2svn, parsecvs and cvsps into something > > that can handle Mozilla CVS. > > If you publish your WIP somewhere, I might be able to jump in and help > a bit. I've seen your "challenge" email earlier, but haven't been able > to get started yet -- lots of work on other foss fronts. I haven't got anything useful yet, I keep switching in and out of working on this. I am still trying to work out a viable transition strategy that I can attempt to sell the Mozilla developers on. So far I don't have one. The requirements I have so far: 1) the conversion needs to be reproducible so that Moz staff can do it internally without a lot of hassle. They need to verify that nothing has been lost. 2) It shouldn't need a monster machine to run it and it can't take days to finish 3) It has to have incremental support so that it can be run in parallel with CVS with commits still going to CVS. 4) nothing can be lost out of existing CVS 5) a bonus feature would be a partial repository to avoid the initial 700MB git download. cvsps meets these requirements except for not losing anything out of the existing CVS. cvsps is throwing away some items it finds confusing. So far the only algorithm that appears to successfully convert all branches into change sets is the algorithm in cvs2svn. http://cvs2svn.tigris.org/source/browse/cvs2svn/trunk/design-notes.txt?rev=2536&view=markup I've spent more time looking at parsecvs than cvsps, is it reasonable to convert cvsps to the algorithm described above? Another strategy would be to use cvs2svn to build the changeset database and then use cvsps to simply read the changesets out of it and build the git repository. Parsecvs never finishes the conversion it always hits an error or GPF after 4-5 hours, probably a wild pointer somewhere. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Diff format in packs 2006-08-01 1:13 ` Jon Smirl @ 2006-08-01 2:16 ` Martin Langhoff 2006-08-01 2:29 ` Jon Smirl 0 siblings, 1 reply; 13+ messages in thread From: Martin Langhoff @ 2006-08-01 2:16 UTC (permalink / raw) To: Jon Smirl; +Cc: Jakub Narebski, git On 8/1/06, Jon Smirl <jonsmirl@gmail.com> wrote: > On 7/31/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > > Jon, > > > > just get all the file versions out of the ,v file and into the GIT > > repo, then do find .git/objects/ -type f | git-pack-objects. You don't > > have to even think of generating the packfile yourself. > > Moz CVS expands into over 1M files and 12GB in size. I keep getting > concerned about algorithms that take days to complete and need 4GB to > run. If you run that every 1000 rcs files converted, it will be really cheap in processing and memory footprint. That's not a concern. > > On 8/1/06, Jon Smirl <jonsmirl@gmail.com> wrote: > > > I am working on combining cvs2svn, parsecvs and cvsps into something > > > that can handle Mozilla CVS. > > > > If you publish your WIP somewhere, I might be able to jump in and help > > a bit. I've seen your "challenge" email earlier, but haven't been able > > to get started yet -- lots of work on other foss fronts. > > I haven't got anything useful yet, I keep switching in and out of > working on this. I am still trying to work out a viable transition > strategy that I can attempt to sell the Mozilla developers on. So far > I don't have one. I understand that, and it's a shame. > The requirements I have so far: Yep to 1..4. I suspect that you can get "there" with a converted cvs2svn transformed to deal with git as your are pursuing, and in dealing with the follow-on imports using git-cvsimport. The only real limitation there is that new branches opened in that transition period may be imported with the root in the wrong place. But for "ongoing" branches, the setup works great. I've done in many times with parsecvs for the initial import and git-cvsimport for the subsequent incrementals. > 5) a bonus feature would be a partial repository to avoid the initial > 700MB git download. Agreed. However, I thought I had gotten it to be much slimmer than that, but I may be wrong. Also, a current Moz checkout via cvs is massively chatty, so between bandwidth and latency, I think the git protocol beats cvs for the initial checkout even for Moz. > I've spent more time looking at parsecvs than cvsps, is it reasonable > to convert cvsps to the algorithm described above? Another strategy I don't think cvsps is easily fixable. > would be to use cvs2svn to build the changeset database and then use > cvsps to simply read the changesets out of it and build the git > repository. Once cvs2svn has the db built, it should be easy to write a perl/python script that mimics the output of cvsps. > Parsecvs never finishes the conversion it always hits an error or GPF > after 4-5 hours, probably a wild pointer somewhere. Hmmmm. Nag Keith? martin ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Diff format in packs 2006-08-01 2:16 ` Martin Langhoff @ 2006-08-01 2:29 ` Jon Smirl 2006-08-01 2:36 ` Martin Langhoff 2006-08-01 10:59 ` Johannes Schindelin 0 siblings, 2 replies; 13+ messages in thread From: Jon Smirl @ 2006-08-01 2:29 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jakub Narebski, git On 7/31/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > > would be to use cvs2svn to build the changeset database and then use > > cvsps to simply read the changesets out of it and build the git > > repository. > > Once cvs2svn has the db built, it should be easy to write a > perl/python script that mimics the output of cvsps. This is an efficiency problem. My strategy instead is to build the git objects for the revisions when the CVS file is initially parsed and then track the sha-1 of the object. Otherwise I end up needing to use CVS to pull each revision out after the changeseta are computed and that's a slow process since it involves 1M forks of a cvs process. People on the Python list have been giving me some pointers on how to build the revision files efficiently. That's what I am working on right now. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Diff format in packs 2006-08-01 2:29 ` Jon Smirl @ 2006-08-01 2:36 ` Martin Langhoff 2006-08-01 10:59 ` Johannes Schindelin 1 sibling, 0 replies; 13+ messages in thread From: Martin Langhoff @ 2006-08-01 2:36 UTC (permalink / raw) To: Jon Smirl; +Cc: Jakub Narebski, git On 8/1/06, Jon Smirl <jonsmirl@gmail.com> wrote: > On 7/31/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > > > would be to use cvs2svn to build the changeset database and then use > > > cvsps to simply read the changesets out of it and build the git > > > repository. > > > > Once cvs2svn has the db built, it should be easy to write a > > perl/python script that mimics the output of cvsps. > > This is an efficiency problem. My strategy instead is to build the git Agreed when it comes to the initial import. I thought you were meaning for "driving" git-cvsimport when doing incrementals. > People on the Python list have been giving me some pointers on how to > build the revision files efficiently. That's what I am working on > right now. You can definitely steal that from parsecvs -- it should be trivial to get parsecvs to do the RCS -> GIT part and store on disk the path/revision -> SHA1. Yes? cheers, martin ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Diff format in packs 2006-08-01 2:29 ` Jon Smirl 2006-08-01 2:36 ` Martin Langhoff @ 2006-08-01 10:59 ` Johannes Schindelin 2006-08-01 12:01 ` Jakub Narebski 1 sibling, 1 reply; 13+ messages in thread From: Johannes Schindelin @ 2006-08-01 10:59 UTC (permalink / raw) To: Jon Smirl; +Cc: Martin Langhoff, Jakub Narebski, git Hi Jon, On Mon, 31 Jul 2006, Jon Smirl wrote: > On 7/31/06, Martin Langhoff <martin.langhoff@gmail.com> wrote: > > > would be to use cvs2svn to build the changeset database and then use > > > cvsps to simply read the changesets out of it and build the git > > > repository. > > > > Once cvs2svn has the db built, it should be easy to write a > > perl/python script that mimics the output of cvsps. > > This is an efficiency problem. In another mail, you asked for a solution for the initial 700MB download. How about doing it like Linux-2.6, namely have a clean cut, and for those who want, they can load the historical repo, too, and graft it onto the current one? I _think_ that with a new start, incremental git-cvsimport should still work, if you do it cleverly. Obviously, it would _not_ have the full history, but rather add onto the most recent revisions (incremental git-cvsimport detects the revisions to import by author date IIRC). Note that this method will _not_ work, if there are _new_ branches that cvsps has problems with. Ciao, Dscho ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Diff format in packs 2006-08-01 10:59 ` Johannes Schindelin @ 2006-08-01 12:01 ` Jakub Narebski 0 siblings, 0 replies; 13+ messages in thread From: Jakub Narebski @ 2006-08-01 12:01 UTC (permalink / raw) To: git Johannes Schindelin wrote: > In another mail, you asked for a solution for the initial 700MB download. > How about doing it like Linux-2.6, namely have a clean cut, and for those > who want, they can load the historical repo, too, and graft it onto the > current one? Perhaps that would be incentive to finally make "lazy clone"/"shallow clone" support in git. -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2006-08-01 12:02 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-07-31 21:08 Diff format in packs Jon Smirl 2006-07-31 21:16 ` Jakub Narebski 2006-07-31 21:20 ` Jon Smirl 2006-07-31 22:32 ` Shawn Pearce 2006-07-31 23:08 ` Jon Smirl 2006-08-01 0:47 ` Martin Langhoff 2006-08-01 1:03 ` Martin Langhoff 2006-08-01 1:13 ` Jon Smirl 2006-08-01 2:16 ` Martin Langhoff 2006-08-01 2:29 ` Jon Smirl 2006-08-01 2:36 ` Martin Langhoff 2006-08-01 10:59 ` Johannes Schindelin 2006-08-01 12:01 ` Jakub Narebski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).