Diff format in packs

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Diff format in packs
@ 2006-07-31 21:08 Jon Smirl
  2006-07-31 21:16 ` Jakub Narebski
  0 siblings, 1 reply; 13+ messages in thread
From: Jon Smirl @ 2006-07-31 21:08 UTC (permalink / raw)
  To: git

I see how the diffs are encoded into the pack, but what did they look
like before compressing? It would be great if they looked like CVS
diffs. I poked around in the doc and I don't see anything. Is this
specified somewhere and I missed it? I see that the diff code is from
libxdiff  but I haven't figured out how it is being used yet.

I'm trying to build a small app that takes a CVS ,v and writes out a
pack corresponding to the versions. Suggestions on the most efficient
strategy for doing this by calling straight into the git C code?
Forking off git commands is not very efficient when done a million
times.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Diff format in packs
  2006-07-31 21:08 Diff format in packs Jon Smirl
@ 2006-07-31 21:16 ` Jakub Narebski
  2006-07-31 21:20   ` Jon Smirl
  0 siblings, 1 reply; 13+ messages in thread
From: Jakub Narebski @ 2006-07-31 21:16 UTC (permalink / raw)
  To: git

Jon Smirl wrote:

> I'm trying to build a small app that takes a CVS ,v and writes out a
> pack corresponding to the versions. Suggestions on the most efficient
> strategy for doing this by calling straight into the git C code?
> Forking off git commands is not very efficient when done a million
> times.

Something akin to parsecvs by Keith Packard?

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Diff format in packs
  2006-07-31 21:16 ` Jakub Narebski
@ 2006-07-31 21:20   ` Jon Smirl
  2006-07-31 22:32     ` Shawn Pearce
  2006-08-01  0:47     ` Martin Langhoff
  0 siblings, 2 replies; 13+ messages in thread
From: Jon Smirl @ 2006-07-31 21:20 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

On 7/31/06, Jakub Narebski <jnareb@gmail.com> wrote:
> Jon Smirl wrote:
>
> > I'm trying to build a small app that takes a CVS ,v and writes out a
> > pack corresponding to the versions. Suggestions on the most efficient
> > strategy for doing this by calling straight into the git C code?
> > Forking off git commands is not very efficient when done a million
> > times.
>
> Something akin to parsecvs by Keith Packard?

I see the error in my thoughts now, I need the fully expanded delta to
compute the sha-1 so I might as well use the parsecvs code.

I am working on combining cvs2svn, parsecvs and cvsps into something
that can handle Mozilla CVS.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Diff format in packs
  2006-07-31 21:20   ` Jon Smirl
@ 2006-07-31 22:32     ` Shawn Pearce
  2006-07-31 23:08       ` Jon Smirl
  2006-08-01  0:47     ` Martin Langhoff
  1 sibling, 1 reply; 13+ messages in thread
From: Shawn Pearce @ 2006-07-31 22:32 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Jakub Narebski, git

Jon Smirl <jonsmirl@gmail.com> wrote:
> On 7/31/06, Jakub Narebski <jnareb@gmail.com> wrote:
> >Jon Smirl wrote:
> >
> >> I'm trying to build a small app that takes a CVS ,v and writes out a
> >> pack corresponding to the versions. Suggestions on the most efficient
> >> strategy for doing this by calling straight into the git C code?
> >> Forking off git commands is not very efficient when done a million
> >> times.
> >
> >Something akin to parsecvs by Keith Packard?
> 
> I see the error in my thoughts now, I need the fully expanded delta to
> compute the sha-1 so I might as well use the parsecvs code.
> 
> I am working on combining cvs2svn, parsecvs and cvsps into something
> that can handle Mozilla CVS.

I think you sort of have the right idea.  Creating a pack file
from scratch without deltas is a very trivial operation.  The pack
format is documented in Documentation/technical/pack-format.txt.
The actual delta format isn't documented here and generating a delta
would be somewhat difficult, but creating a pack with no deltas
and only zlib compression is pretty simple.  And no, GIT doesn't
use the same (horrible) delta format as RCS so you definately are
right, you have to expand it before you can compress it.

Creating trees and commits from scratch is also really easy.  Calling
zlib and a SHA1 routine to create the checksum is the hard part.
I think I wrote the tree and commit construction part of jgit in
a few hours, and that was while I was also being distracted by
someone speaking in the front of the room.  :-)

It should be reasonably simple to extract each revision from a
single ,v file into its full undeltafied form, compute its SHA1,
compress it with zlib, and append it into a pack file.  Do that
for every file and toss the SHA1 values, file names and revision
numbers off into a table somewhere.

Then loop back through and generate trees while playing around only
with the RCS file paths, timestamps and SHA1 pointers.  Again tree
generation is extremely simple; it would be trivial to generate
tree objects and append them into the same (or another) pack.

Finally writing commit objects pointing at the trees is also easy,
without calling git-commit.

When you are all done run a `git-repack -a -d -f` and let the delta
code compress everything down.  That first compression might take
a little while but it should do a reasonably good job despite the
input pack(s) being highly unorganized.

So I think I'm suggesting you find a way to generate the base objects
yourself right into a pack file, rather than using the higher level
GIT executables to do it.  You may be able to reuse some of the
code in GIT but I know its writer code is organized for writing
loose objects, not for appending new objects into a new pack file,
so some surgery would probably be required.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Diff format in packs
  2006-07-31 22:32     ` Shawn Pearce
@ 2006-07-31 23:08       ` Jon Smirl
  0 siblings, 0 replies; 13+ messages in thread
From: Jon Smirl @ 2006-07-31 23:08 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jakub Narebski, git

On 7/31/06, Shawn Pearce <spearce@spearce.org> wrote:
> It should be reasonably simple to extract each revision from a
> single ,v file into its full undeltafied form, compute its SHA1,
> compress it with zlib, and append it into a pack file.  Do that
> for every file and toss the SHA1 values, file names and revision
> numbers off into a table somewhere.

cvs2svn has put everything needed into a nice db4 database that I can
generate the trees from once I get the revisions into git. I'll modify
cvs2svn to record the SHA1 for the deltas when they are originally
scanned. The change set algorithm used by cvs2svn is the only one of
the bunch that can process moz cvs without errors or losing check-ins.
It takes 3.5hrs to generate the changeset db. There are over 1M deltas
to sort and 250K commits.

> So I think I'm suggesting you find a way to generate the base objects
> yourself right into a pack file, rather than using the higher level
> GIT executables to do it.  You may be able to reuse some of the
> code in GIT but I know its writer code is organized for writing
> loose objects, not for appending new objects into a new pack file,
> so some surgery would probably be required.

This makes sense, just put the objects into the pack and forget about
deltas for now. I can add deltas in the next rewrite or just use
git-repack. Since cvs2svn is in Python I need to figure out how to
write a pack file from that environment. At least Python has sha1 and
zlib support built-in.

What is more important is that I get a working incremental update
functioning. I'm hoping that I can rely on cvsps for that. Cutting
everyone over at once is unlikely to happen.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Diff format in packs
  2006-07-31 21:20   ` Jon Smirl
  2006-07-31 22:32     ` Shawn Pearce
@ 2006-08-01  0:47     ` Martin Langhoff
  2006-08-01  1:03       ` Martin Langhoff
  2006-08-01  1:13       ` Jon Smirl
  1 sibling, 2 replies; 13+ messages in thread
From: Martin Langhoff @ 2006-08-01  0:47 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Jakub Narebski, git

Jon,

just get all the file versions out of the ,v file and into the GIT
repo, then do find .git/objects/ -type f | git-pack-objects. You don't
have to even think of generating the packfile yourself.

On 8/1/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> I am working on combining cvs2svn, parsecvs and cvsps into something
> that can handle Mozilla CVS.

If you publish your WIP somewhere, I might be able to jump in and help
a bit. I've seen your "challenge" email earlier, but haven't been able
to get started yet -- lots of work on other foss fronts.

cheers,

martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Diff format in packs
  2006-08-01  0:47     ` Martin Langhoff
@ 2006-08-01  1:03       ` Martin Langhoff
  2006-08-01  1:13       ` Jon Smirl
  1 sibling, 0 replies; 13+ messages in thread
From: Martin Langhoff @ 2006-08-01  1:03 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Jakub Narebski, git

On 8/1/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> just get all the file versions out of the ,v file and into the GIT
> repo, then do find .git/objects/ -type f | git-pack-objects. You don't
> have to even think of generating the packfile yourself.

Well, that's a bit of a lie, using bare find won't quite work, but if
you look up the "Repacking many disconnected blobs" thread, there's a
good discussion of packing objects that are not related to trees yet.

martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Diff format in packs
  2006-08-01  0:47     ` Martin Langhoff
  2006-08-01  1:03       ` Martin Langhoff
@ 2006-08-01  1:13       ` Jon Smirl
  2006-08-01  2:16         ` Martin Langhoff
  1 sibling, 1 reply; 13+ messages in thread
From: Jon Smirl @ 2006-08-01  1:13 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jakub Narebski, git

On 7/31/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> Jon,
>
> just get all the file versions out of the ,v file and into the GIT
> repo, then do find .git/objects/ -type f | git-pack-objects. You don't
> have to even think of generating the packfile yourself.

Moz CVS expands into over 1M files and 12GB in size. I keep getting
concerned about algorithms that take days to complete and need 4GB to
run.

> On 8/1/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> > I am working on combining cvs2svn, parsecvs and cvsps into something
> > that can handle Mozilla CVS.
>
> If you publish your WIP somewhere, I might be able to jump in and help
> a bit. I've seen your "challenge" email earlier, but haven't been able
> to get started yet -- lots of work on other foss fronts.

I haven't got anything useful yet, I keep switching in and out of
working on this. I am still trying to work out a viable transition
strategy that I can attempt to sell the Mozilla developers on. So far
I don't have one.

The requirements I have so far:
1) the conversion needs to be reproducible so that Moz staff can do it
internally without a lot of hassle. They need to verify that nothing
has been lost.
2) It shouldn't need a monster machine to run it and it can't take
days to finish
3) It has to have incremental support so that it can be run in
parallel with CVS with commits still going to CVS.
4) nothing can be lost out of existing CVS
5) a bonus feature would be a partial repository to avoid the initial
700MB git download.

cvsps meets these requirements except for not losing anything out of
the existing CVS. cvsps is throwing away some items it finds
confusing. So far the only algorithm that appears to successfully
convert all branches into change sets is the algorithm in cvs2svn.
http://cvs2svn.tigris.org/source/browse/cvs2svn/trunk/design-notes.txt?rev=2536&view=markup

I've spent more time looking at parsecvs than cvsps, is it reasonable
to convert cvsps to the algorithm described above? Another strategy
would be to use cvs2svn to build the changeset database and then use
cvsps to simply read the changesets out of it and build the git
repository.

Parsecvs never finishes the conversion it always hits an error or GPF
after 4-5 hours, probably a wild pointer somewhere.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Diff format in packs
  2006-08-01  1:13       ` Jon Smirl
@ 2006-08-01  2:16         ` Martin Langhoff
  2006-08-01  2:29           ` Jon Smirl
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Langhoff @ 2006-08-01  2:16 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Jakub Narebski, git

On 8/1/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> On 7/31/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> > Jon,
> >
> > just get all the file versions out of the ,v file and into the GIT
> > repo, then do find .git/objects/ -type f | git-pack-objects. You don't
> > have to even think of generating the packfile yourself.
>
> Moz CVS expands into over 1M files and 12GB in size. I keep getting
> concerned about algorithms that take days to complete and need 4GB to
> run.

If you run that every 1000 rcs files converted, it will be really
cheap in processing and memory footprint. That's not a concern.

> > On 8/1/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> > > I am working on combining cvs2svn, parsecvs and cvsps into something
> > > that can handle Mozilla CVS.
> >
> > If you publish your WIP somewhere, I might be able to jump in and help
> > a bit. I've seen your "challenge" email earlier, but haven't been able
> > to get started yet -- lots of work on other foss fronts.
>
> I haven't got anything useful yet, I keep switching in and out of
> working on this. I am still trying to work out a viable transition
> strategy that I can attempt to sell the Mozilla developers on. So far
> I don't have one.

I understand that, and it's a shame.

> The requirements I have so far:

Yep to 1..4. I suspect that you can get "there" with a converted
cvs2svn transformed to deal with git as your are pursuing, and in
dealing with the follow-on imports using git-cvsimport. The only real
limitation there is that new branches opened in that transition period
may be imported with the root in the wrong place.

But for "ongoing" branches, the setup works great. I've done in many
times with parsecvs for the initial import and git-cvsimport for the
subsequent incrementals.

> 5) a bonus feature would be a partial repository to avoid the initial
> 700MB git download.

Agreed. However, I thought I had gotten it to be much slimmer than
that, but I may be wrong. Also, a current Moz checkout via cvs is
massively chatty, so between bandwidth and latency, I think the git
protocol beats cvs for the initial checkout even for Moz.

> I've spent more time looking at parsecvs than cvsps, is it reasonable
> to convert cvsps to the algorithm described above? Another strategy

I don't think cvsps is easily fixable.

> would be to use cvs2svn to build the changeset database and then use
> cvsps to simply read the changesets out of it and build the git
> repository.

Once cvs2svn has the db built, it should be easy to write a
perl/python script that mimics the output of cvsps.

> Parsecvs never finishes the conversion it always hits an error or GPF
> after 4-5 hours, probably a wild pointer somewhere.

Hmmmm. Nag Keith?

martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Diff format in packs
  2006-08-01  2:16         ` Martin Langhoff
@ 2006-08-01  2:29           ` Jon Smirl
  2006-08-01  2:36             ` Martin Langhoff
  2006-08-01 10:59             ` Johannes Schindelin
  0 siblings, 2 replies; 13+ messages in thread
From: Jon Smirl @ 2006-08-01  2:29 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jakub Narebski, git

On 7/31/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> > would be to use cvs2svn to build the changeset database and then use
> > cvsps to simply read the changesets out of it and build the git
> > repository.
>
> Once cvs2svn has the db built, it should be easy to write a
> perl/python script that mimics the output of cvsps.

This is an efficiency problem. My strategy instead is to build the git
objects for the revisions when the CVS file is initially parsed and
then track the sha-1 of the object. Otherwise I end up needing to use
CVS to pull each revision out after the changeseta are computed and
that's a slow process since it involves 1M forks of a cvs process.

People on the Python list have been giving me some pointers on how to
build the revision files efficiently. That's what I am working on
right now.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Diff format in packs
  2006-08-01  2:29           ` Jon Smirl
@ 2006-08-01  2:36             ` Martin Langhoff
  2006-08-01 10:59             ` Johannes Schindelin
  1 sibling, 0 replies; 13+ messages in thread
From: Martin Langhoff @ 2006-08-01  2:36 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Jakub Narebski, git

On 8/1/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> On 7/31/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> > > would be to use cvs2svn to build the changeset database and then use
> > > cvsps to simply read the changesets out of it and build the git
> > > repository.
> >
> > Once cvs2svn has the db built, it should be easy to write a
> > perl/python script that mimics the output of cvsps.
>
> This is an efficiency problem. My strategy instead is to build the git

Agreed when it comes to the initial import. I thought you were meaning
for "driving" git-cvsimport when doing incrementals.

> People on the Python list have been giving me some pointers on how to
> build the revision files efficiently. That's what I am working on
> right now.

You can definitely steal that from parsecvs -- it should be trivial to
get parsecvs to do the RCS -> GIT part and store on disk the
path/revision -> SHA1. Yes?

cheers,


martin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Diff format in packs
  2006-08-01  2:29           ` Jon Smirl
  2006-08-01  2:36             ` Martin Langhoff
@ 2006-08-01 10:59             ` Johannes Schindelin
  2006-08-01 12:01               ` Jakub Narebski
  1 sibling, 1 reply; 13+ messages in thread
From: Johannes Schindelin @ 2006-08-01 10:59 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Martin Langhoff, Jakub Narebski, git

Hi Jon,

On Mon, 31 Jul 2006, Jon Smirl wrote:

> On 7/31/06, Martin Langhoff <martin.langhoff@gmail.com> wrote:
> > > would be to use cvs2svn to build the changeset database and then use
> > > cvsps to simply read the changesets out of it and build the git
> > > repository.
> > 
> > Once cvs2svn has the db built, it should be easy to write a
> > perl/python script that mimics the output of cvsps.
> 
> This is an efficiency problem.

In another mail, you asked for a solution for the initial 700MB download. 
How about doing it like Linux-2.6, namely have a clean cut, and for those 
who want, they can load the historical repo, too, and graft it onto the 
current one?

I _think_ that with a new start, incremental git-cvsimport should still 
work, if you do it cleverly. Obviously, it would _not_ have the full 
history, but rather add onto the most recent revisions (incremental 
git-cvsimport detects the revisions to import by author date IIRC).

Note that this method will _not_ work, if there are _new_ branches that 
cvsps has problems with.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Diff format in packs
  2006-08-01 10:59             ` Johannes Schindelin
@ 2006-08-01 12:01               ` Jakub Narebski
  0 siblings, 0 replies; 13+ messages in thread
From: Jakub Narebski @ 2006-08-01 12:01 UTC (permalink / raw)
  To: git

Johannes Schindelin wrote:

> In another mail, you asked for a solution for the initial 700MB download. 
> How about doing it like Linux-2.6, namely have a clean cut, and for those 
> who want, they can load the historical repo, too, and graft it onto the 
> current one?

Perhaps that would be incentive to finally make "lazy clone"/"shallow clone"
support in git.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-08-01 12:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-31 21:08 Diff format in packs Jon Smirl
2006-07-31 21:16 ` Jakub Narebski
2006-07-31 21:20   ` Jon Smirl
2006-07-31 22:32     ` Shawn Pearce
2006-07-31 23:08       ` Jon Smirl
2006-08-01  0:47     ` Martin Langhoff
2006-08-01  1:03       ` Martin Langhoff
2006-08-01  1:13       ` Jon Smirl
2006-08-01  2:16         ` Martin Langhoff
2006-08-01  2:29           ` Jon Smirl
2006-08-01  2:36             ` Martin Langhoff
2006-08-01 10:59             ` Johannes Schindelin
2006-08-01 12:01               ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).