Huge win, compressing a window of delta runs as a unit

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Huge win, compressing a window of delta runs as a unit
@ 2006-08-16 17:20 Jon Smirl
  2006-08-17  4:07 ` Shawn Pearce
  2006-08-18  4:03 ` Nicolas Pitre
  0 siblings, 2 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-16 17:20 UTC (permalink / raw)
  To: Shawn Pearce, git

Shawn put together a new version of his import utility that packs all
of the deltas from a run into a single blob instead of one blob per
delta. The idea is to put 10 or more deltas into each delta entry
instead of one. The index format would map the 10 sha1's to a single
packed delta entry which would be expanded when needed. Note that you
probably needed multiple entries out of the delta pack to generate the
revision you were looking for so this is no real loss on extraction.

I ran it overnight on mozcvs. If his delta pack code is correct this
is a huge win.

One entry per delta -  845,42,0150
Packed deltas - 295,018,474
65% smaller

The effect of packing the deltas is to totally eliminate many of the
redundant zlib dictionaries.

This is without using a zlib dictionary which gains another 4% form a
4KB dictionary.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl
@ 2006-08-17  4:07 ` Shawn Pearce
  2006-08-17  7:56   ` Johannes Schindelin
  2006-08-17 17:22   ` Nicolas Pitre
  2006-08-18  4:03 ` Nicolas Pitre
  1 sibling, 2 replies; 36+ messages in thread
From: Shawn Pearce @ 2006-08-17  4:07 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Jon Smirl <jonsmirl@gmail.com> wrote:
> Shawn put together a new version of his import utility that packs all
> of the deltas from a run into a single blob instead of one blob per
> delta. The idea is to put 10 or more deltas into each delta entry
> instead of one. The index format would map the 10 sha1's to a single
> packed delta entry which would be expanded when needed. Note that you
> probably needed multiple entries out of the delta pack to generate the
> revision you were looking for so this is no real loss on extraction.
> 
> I ran it overnight on mozcvs. If his delta pack code is correct this
> is a huge win.
> 
> One entry per delta -  845,42,0150
> Packed deltas - 295,018,474
> 65% smaller
> 
> The effect of packing the deltas is to totally eliminate many of the
> redundant zlib dictionaries.

I'm going to try to integrate this into core GIT this weekend.
My current idea is to make use of the OBJ_EXT type flag to add
an extended header field behind the length which describes the
"chunk" as being a delta chain compressed in one zlib stream.
I'm not overly concerned about saving lots of space in the header
here as it looks like we're winning a huge amount of pack space,
so the extended header will probably itself be a couple of bytes.
This keeps the shorter reserved types free for other great ideas.  :)

My primary goal of integrating it into core GIT is to take
advantage of verify-pack to check the file fast-import is producing.
Plus having support for it in sha1_file.c will make it easier to
performance test the common access routines that need to be fast,
like commit and tree walking.

My secondary goal is to get a patchset which other folks can try
on their own workloads to see if its as effective as what Jon is
seeing on the Mozilla archive.

Unfortunately I can't think of a way to make this type of pack
readable by older software.  So this could be creating a pretty
big change in the pack format, relatively speaking.  :)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17  4:07 ` Shawn Pearce
@ 2006-08-17  7:56   ` Johannes Schindelin
  2006-08-17  8:07     ` Johannes Schindelin
  2006-08-17 17:22   ` Nicolas Pitre
  1 sibling, 1 reply; 36+ messages in thread
From: Johannes Schindelin @ 2006-08-17  7:56 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jon Smirl, git

Hi,

On Thu, 17 Aug 2006, Shawn Pearce wrote:

> Jon Smirl <jonsmirl@gmail.com> wrote:
> > 
> > One entry per delta -  845,42,0150
> > Packed deltas - 295,018,474
> > 65% smaller

Impressed.

> Unfortunately I can't think of a way to make this type of pack
> readable by older software.  So this could be creating a pretty
> big change in the pack format, relatively speaking.  :)

Not necessarily: the format could stay the same, where multiple object 
names point to the same entry.

It should not be a problem, as long as this change is kept local: the same 
git tools which created this kind of pack have to read it.

However, I see some real problems when a public repository has a pack like 
this: if the client is older, the server has to do some real work, 
especially in the case of a fresh clone. It gets even worse with a public 
HTTP repo.

The obvious short-time solution would be to make this opt-in, and use it 
only on private repositories.

Then, one could think about how to handle the delta-chains in the pack 
protocol (it should be easy, since you just have to store one object 
instead of a few), and make this a "supports" option.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17  7:56   ` Johannes Schindelin
@ 2006-08-17  8:07     ` Johannes Schindelin
  2006-08-17 14:36       ` Jon Smirl
  0 siblings, 1 reply; 36+ messages in thread
From: Johannes Schindelin @ 2006-08-17  8:07 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jon Smirl, git

Hi,

On Thu, 17 Aug 2006, Johannes Schindelin wrote:

> However, I see some real problems when a public repository has a pack like 
> this: if the client is older, the server has to do some real work, 
> especially in the case of a fresh clone. It gets even worse with a public 
> HTTP repo.
> 
> The obvious short-time solution would be to make this opt-in, and use it 
> only on private repositories.
> 
> Then, one could think about how to handle the delta-chains in the pack 
> protocol (it should be easy, since you just have to store one object 
> instead of a few), and make this a "supports" option.

Oh, and I completely forgot: IIUC you have to unpack the (possibly big) 
delta-chain until you have the delta you really want, right? So, you might 
want to have a clever mechanism to keep the delta-chains short for recent 
objects (i.e. objects you are likely to need more often).

At least, the delta-chains should be limited by size (_not_ by number of 
deltas: you can have huge deltas, and if you have to unpack 5 huge deltas 
before getting to the huge delta you really need, it takes really long).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17  8:07     ` Johannes Schindelin
@ 2006-08-17 14:36       ` Jon Smirl
  2006-08-17 15:45         ` Johannes Schindelin
  0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-17 14:36 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Shawn Pearce, git

On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> At least, the delta-chains should be limited by size (_not_ by number of
> deltas: you can have huge deltas, and if you have to unpack 5 huge deltas
> before getting to the huge delta you really need, it takes really long).

This is not an obvious conclusion. Reducing our pack from 845MB to
292MB eliminates a huge amount of IO. All of the work I am doing with
this pack has been totally IO bound. The CPU time to get from one
delta to another is minor compared to the IO time increase if you
break the chains up. Each time you break the chain you have to store a
full copy of the file, the IO increase from doing this out weighs the
CPU time reduction.

I would use a two pack model. One pack holds the 292MB archive and the
second pack is made from recent changes. They can both be the same
format since the chains in the second pack won't be very long. We
might want to put a flag into an archive pack that tells git-repack
not to touch it by default.

As for public servers there is an immediate win in shifting to the new
pack format.  Three hour downloads vs one hour, plus the bandwidth
bills. Are the tools smart enough to say this is a newer pack format,
upgrade? It takes far less than two hours to upgrade your git install.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17 14:36       ` Jon Smirl
@ 2006-08-17 15:45         ` Johannes Schindelin
  2006-08-17 16:33           ` Nicolas Pitre
  2006-08-17 17:17           ` Jon Smirl
  0 siblings, 2 replies; 36+ messages in thread
From: Johannes Schindelin @ 2006-08-17 15:45 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn Pearce, git

Hi,

On Thu, 17 Aug 2006, Jon Smirl wrote:

> On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > At least, the delta-chains should be limited by size (_not_ by number of
> > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas
> > before getting to the huge delta you really need, it takes really long).
> 
> This is not an obvious conclusion.

The big win is bought by having _one_ zlib stream for multiple deltas, 
right?

So, everytime you want to access the _last_ delta in the chain, you unpack 
_all_ of them. This quite obvious conclusion is obviously your reason to 
propose two packs, one for the archive of "old" objects, and one for the 
"new" objects.

> As for public servers there is an immediate win in shifting to the new
> pack format.  Three hour downloads vs one hour, plus the bandwidth
> bills. Are the tools smart enough to say this is a newer pack format,
> upgrade? It takes far less than two hours to upgrade your git install.

Have you thought about a non-upgraded client? The server has to repack in 
that case, and if the client wants a clone, you might not even have the 
time to kiss your server good-bye before it goes belly up.

There is an obvious choice here as to how fast people would upgrade their 
servers.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17 15:45         ` Johannes Schindelin
@ 2006-08-17 16:33           ` Nicolas Pitre
  2006-08-17 17:05             ` Johannes Schindelin
  2006-08-17 17:22             ` Jon Smirl
  2006-08-17 17:17           ` Jon Smirl
  1 sibling, 2 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-17 16:33 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jon Smirl, Shawn Pearce, git

On Thu, 17 Aug 2006, Johannes Schindelin wrote:

> Hi,
> 
> On Thu, 17 Aug 2006, Jon Smirl wrote:
> 
> > On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > > At least, the delta-chains should be limited by size (_not_ by number of
> > > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas
> > > before getting to the huge delta you really need, it takes really long).
> > 
> > This is not an obvious conclusion.
> 
> The big win is bought by having _one_ zlib stream for multiple deltas, 
> right?
> 
> So, everytime you want to access the _last_ delta in the chain, you unpack 
> _all_ of them.

This is the case whether deltas are separately deflated or not.

> This quite obvious conclusion is obviously your reason to 
> propose two packs, one for the archive of "old" objects, and one for the 
> "new" objects.

Old objects are usually further down the delta chain and also stored 
further from the beginning of the pack.  Hence "new" objects could still 
have quick access since even if a delta chain is all in the same zlib 
stream, it is likely that inflating the whole of the zlib stream to get 
"new" objects won't be necessary, just like it is done now where only 
the needed deltas are inflated.

> > As for public servers there is an immediate win in shifting to the new
> > pack format.  Three hour downloads vs one hour, plus the bandwidth
> > bills. Are the tools smart enough to say this is a newer pack format,
> > upgrade? It takes far less than two hours to upgrade your git install.
> 
> Have you thought about a non-upgraded client? The server has to repack in 
> that case, and if the client wants a clone, you might not even have the 
> time to kiss your server good-bye before it goes belly up.

Nah.  The only overhead for a server to feed an "old" pack format from a 
"new" pack file is to inflate and deflate some data.  That is not _that_ 
costly.


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17 16:33           ` Nicolas Pitre
@ 2006-08-17 17:05             ` Johannes Schindelin
  2006-08-17 17:22             ` Jon Smirl
  1 sibling, 0 replies; 36+ messages in thread
From: Johannes Schindelin @ 2006-08-17 17:05 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jon Smirl, Shawn Pearce, git

Hi,

On Thu, 17 Aug 2006, Nicolas Pitre wrote:

> On Thu, 17 Aug 2006, Johannes Schindelin wrote:
> 
> > On Thu, 17 Aug 2006, Jon Smirl wrote:
> > 
> > > On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > > > At least, the delta-chains should be limited by size (_not_ by number of
> > > > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas
> > > > before getting to the huge delta you really need, it takes really long).
> > > 
> > > This is not an obvious conclusion.
> > 
> > The big win is bought by having _one_ zlib stream for multiple deltas, 
> > right?
> > 
> > So, everytime you want to access the _last_ delta in the chain, you unpack 
> > _all_ of them.
> 
> This is the case whether deltas are separately deflated or not.

Oh, now I get it! The delta chain is comprised of exactly those deltas 
which are needed to reconstruct a certain object from another object which 
was stored undeltified in the pack.

Now it makes sense to me that they could be bundled before being deflated. 
In the pack-index, the objects of that delta-chain could just point to 
the delta-chain object. Which has a mini-index in its extended header.

Or did I misunderstand again?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17 16:33           ` Nicolas Pitre
  2006-08-17 17:05             ` Johannes Schindelin
@ 2006-08-17 17:22             ` Jon Smirl
  2006-08-17 18:15               ` Nicolas Pitre
  1 sibling, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-17 17:22 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Schindelin, Shawn Pearce, git

On 8/17/06, Nicolas Pitre <nico@cam.org> wrote:
> Nah.  The only overhead for a server to feed an "old" pack format from a
> "new" pack file is to inflate and deflate some data.  That is not _that_
> costly.

It is costly because the Mozilla pack is going to inflate from 295MB
to 845MB. It will take you much longer to download an extra 500MB than
upgrading your client would take.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17 17:22             ` Jon Smirl
@ 2006-08-17 18:15               ` Nicolas Pitre
  0 siblings, 0 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-17 18:15 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Johannes Schindelin, Shawn Pearce, git

On Thu, 17 Aug 2006, Jon Smirl wrote:

> On 8/17/06, Nicolas Pitre <nico@cam.org> wrote:
> > Nah.  The only overhead for a server to feed an "old" pack format from a
> > "new" pack file is to inflate and deflate some data.  That is not _that_
> > costly.
> 
> It is costly because the Mozilla pack is going to inflate from 295MB
> to 845MB. It will take you much longer to download an extra 500MB than
> upgrading your client would take.

I didn't say it isn't costly.  I said that it isn't _that_ costly to the 
point of bringing down a server.


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17 15:45         ` Johannes Schindelin
  2006-08-17 16:33           ` Nicolas Pitre
@ 2006-08-17 17:17           ` Jon Smirl
  2006-08-17 17:32             ` Nicolas Pitre
  1 sibling, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-17 17:17 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Shawn Pearce, git

On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Thu, 17 Aug 2006, Jon Smirl wrote:
>
> > On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > > At least, the delta-chains should be limited by size (_not_ by number of
> > > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas
> > > before getting to the huge delta you really need, it takes really long).
> >
> > This is not an obvious conclusion.
>
> The big win is bought by having _one_ zlib stream for multiple deltas,
> right?
>
> So, everytime you want to access the _last_ delta in the chain, you unpack
> _all_ of them. This quite obvious conclusion is obviously your reason to
> propose two packs, one for the archive of "old" objects, and one for the
> "new" objects.

Do some measurements, the IO vs CPU time trade off is way in favor of
eliminating the IO. It really doesn't take very much CPU to unpack the
delta chain.

The two pack proprosal was not about reducing the delta chain length;
they are reverse deltas, the newest version is always at the front.
Two packs are used to avoid repacking the 280MB pack when you do a
repack command. It takes 2-3 hours to repack 280MB. Even if if you
just copy the old pack to the new it take 30 minutes to do all of the
IO.

> > As for public servers there is an immediate win in shifting to the new
> > pack format.  Three hour downloads vs one hour, plus the bandwidth
> > bills. Are the tools smart enough to say this is a newer pack format,
> > upgrade? It takes far less than two hours to upgrade your git install.
>
> Have you thought about a non-upgraded client? The server has to repack in
> that case, and if the client wants a clone, you might not even have the
> time to kiss your server good-bye before it goes belly up.

Is there a pack format version number built into the server procotol?
If not there needs to be one. When there is a mismatch with the server
pack version the client should simply display an error requesting an
upgrade of the client software.

Git should be designed for forward evolution, not infinite backward
compatibility. It is easy to upgrade your client to support the new
protocol. The protocol just needs to ensure that the client reliably
gets an error about the need to upgrade.

Forward evolution implies that a client is able to work with older
servers, but not the inverse, that new servers have to work with old
clients.

> There is an obvious choice here as to how fast people would upgrade their
> servers.
>
> Ciao,
> Dscho
>
>

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17 17:17           ` Jon Smirl
@ 2006-08-17 17:32             ` Nicolas Pitre
  2006-08-17 18:06               ` Jon Smirl
  0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-17 17:32 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Johannes Schindelin, Shawn Pearce, git

On Thu, 17 Aug 2006, Jon Smirl wrote:

> Is there a pack format version number built into the server procotol?
> If not there needs to be one. When there is a mismatch with the server
> pack version the client should simply display an error requesting an
> upgrade of the client software.

There is a pack version number.  Reception of an unknown pack already 
produces an error:

        if (!pack_version_ok(hdr->hdr_version))
                die("unknown pack file version %d", ntohl(hdr->hdr_version));

That will occur really early in a pull or clone when using the native 
protocol.

> Git should be designed for forward evolution, not infinite backward
> compatibility. It is easy to upgrade your client to support the new
> protocol. The protocol just needs to ensure that the client reliably
> gets an error about the need to upgrade.
> 
> Forward evolution implies that a client is able to work with older
> servers, but not the inverse, that new servers have to work with old
> clients.

And still, if we really wanted, the server could have the ability to 
stream an old pack format by simply inflating the grouped objects and 
deflating them individually all on the fly, which is not necessarily 
that costly.  But it might still be a good idea to simply have the 
ability to generate both formats and make the grouped deltas the default 
only after a while.

Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17 17:32             ` Nicolas Pitre
@ 2006-08-17 18:06               ` Jon Smirl
  0 siblings, 0 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-17 18:06 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Schindelin, Shawn Pearce, git

On 8/17/06, Nicolas Pitre <nico@cam.org> wrote:
> On Thu, 17 Aug 2006, Jon Smirl wrote:
>
> > Is there a pack format version number built into the server procotol?
> > If not there needs to be one. When there is a mismatch with the server
> > pack version the client should simply display an error requesting an
> > upgrade of the client software.
>
> There is a pack version number.  Reception of an unknown pack already
> produces an error:
>
>         if (!pack_version_ok(hdr->hdr_version))
>                 die("unknown pack file version %d", ntohl(hdr->hdr_version));
>
> That will occur really early in a pull or clone when using the native
> protocol.

Can this be fixed so that if the pack protocol coming from the server
is greater than the one handled by the client. they client gets a nice
hint about upgrading?

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17  4:07 ` Shawn Pearce
  2006-08-17  7:56   ` Johannes Schindelin
@ 2006-08-17 17:22   ` Nicolas Pitre
  2006-08-17 18:03     ` Jon Smirl
  1 sibling, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-17 17:22 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jon Smirl, git

On Thu, 17 Aug 2006, Shawn Pearce wrote:

> I'm going to try to integrate this into core GIT this weekend.
> My current idea is to make use of the OBJ_EXT type flag to add
> an extended header field behind the length which describes the
> "chunk" as being a delta chain compressed in one zlib stream.
> I'm not overly concerned about saving lots of space in the header
> here as it looks like we're winning a huge amount of pack space,
> so the extended header will probably itself be a couple of bytes.
> This keeps the shorter reserved types free for other great ideas.  :)

We're streaving for optimal data storage here so don't be afraid to use 
one of the available types for an "object stream" object.  Because when 
you think of it, the deflating of multiple objects into a single zlib 
stream can be applied to all object types not only deltas.  If ever 
deflating many blobs into one zlib stream is dimmed worth it then the 
encoding will already be ready for it.  Also you can leverage existing 
code to write headers, etc.

I'd suggest you use OBJ_GROUP = 0 as a new primary object type.  Then 
the "size" field in the header could then become the number of objects 
that are included in the group.  Most of the time that will fit in the 
low 4 bits of the first header byte, but if there is more than 15 
grouped objects then more bits can be used on the following byte.  
Anyway so far all the code to generate and parse that is already there.  
If ever there is a need for more extensions that could be prefixed with 
a pure zero byte (an object group with a zero object count which is 
distinguishable from a real group).

Then, having the number of grouped objects, you just have to list the 
usual headers for those objects, which are their type and inflated size 
just like regular object headers, including the base sha1 for deltas.  
Again you already have code to produce and parse those.

And finally just append the objects payload in a single deflated stream.

This way the reading of an object from a group can be optimized if the 
object data is located at the beginning of the stream such that you only 
need to inflate the amount of bytes leading to the desired data 
(possibly caching those for further delta replaying), inflate 
the needed data for the desired object and then ignoring the remaining 
of the stream.

Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17 17:22   ` Nicolas Pitre
@ 2006-08-17 18:03     ` Jon Smirl
  2006-08-17 18:24       ` Nicolas Pitre
  0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-17 18:03 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn Pearce, git

On 8/17/06, Nicolas Pitre <nico@cam.org> wrote:
> We're streaving for optimal data storage here so don't be afraid to use
> one of the available types for an "object stream" object.  Because when
> you think of it, the deflating of multiple objects into a single zlib
> stream can be applied to all object types not only deltas.  If ever
> deflating many blobs into one zlib stream is dimmed worth it then the
> encoding will already be ready for it.  Also you can leverage existing
> code to write headers, etc.

Here are two more case that need to be accounted for in the packs.

1) If you zip something and it gets bigger. We need an entry that just
stores the object without it being zipped. Zipping jpegs or mpegs will
likely make them significantly bigger. Or does zlib like already
detect this case and do a null compression?

2) If you delta something and the delta is bigger than the object
being deltaed. The delta code should detect this and store the full
object instead of the delta. Again jpegs and mpegs will trigger this.
You may even want to say that the delta has to be smaller than 80% of
the full object.

Shawn is planning on looking into true dictionary based compression.
That will generate even more types inside of a pack. If dictionary
compression works out full text search can be added with a little more
overhead. True dictionary based compression has the potential to go
even smaller than the current 280MB. The optimal way to do this is for
each pack to contain it's own dictionary.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-17 18:03     ` Jon Smirl
@ 2006-08-17 18:24       ` Nicolas Pitre
  0 siblings, 0 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-17 18:24 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn Pearce, git

On Thu, 17 Aug 2006, Jon Smirl wrote:

> On 8/17/06, Nicolas Pitre <nico@cam.org> wrote:
> > We're streaving for optimal data storage here so don't be afraid to use
> > one of the available types for an "object stream" object.  Because when
> > you think of it, the deflating of multiple objects into a single zlib
> > stream can be applied to all object types not only deltas.  If ever
> > deflating many blobs into one zlib stream is dimmed worth it then the
> > encoding will already be ready for it.  Also you can leverage existing
> > code to write headers, etc.
> 
> Here are two more case that need to be accounted for in the packs.
> 
> 1) If you zip something and it gets bigger. We need an entry that just
> stores the object without it being zipped. Zipping jpegs or mpegs will
> likely make them significantly bigger. Or does zlib like already
> detect this case and do a null compression?

zlib appears to do the right thing here.

> 2) If you delta something and the delta is bigger than the object
> being deltaed. The delta code should detect this and store the full
> object instead of the delta. Again jpegs and mpegs will trigger this.
> You may even want to say that the delta has to be smaller than 80% of
> the full object.

I'd invite you to read te try_delta() code in 
[builtin-]builtin-pack-objects.c.  You might find that we're already way 
more agressive than that with rather simple heuristics that have been 
tweaked and tuned for over a year now.  The commit logs for that file 
(before the rename to builtin-pack-objects.c) is also quite 
enlightening.


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl
  2006-08-17  4:07 ` Shawn Pearce
@ 2006-08-18  4:03 ` Nicolas Pitre
  2006-08-18 12:53   ` Jon Smirl
  2006-08-18 13:15   ` Jon Smirl
  1 sibling, 2 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-18  4:03 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn Pearce, git

On Wed, 16 Aug 2006, Jon Smirl wrote:

> Shawn put together a new version of his import utility that packs all
> of the deltas from a run into a single blob instead of one blob per
> delta. The idea is to put 10 or more deltas into each delta entry
> instead of one. The index format would map the 10 sha1's to a single
> packed delta entry which would be expanded when needed. Note that you
> probably needed multiple entries out of the delta pack to generate the
> revision you were looking for so this is no real loss on extraction.
> 
> I ran it overnight on mozcvs. If his delta pack code is correct this
> is a huge win.
> 
> One entry per delta -  845,42,0150
> Packed deltas - 295,018,474
> 65% smaller

Well, I have to state that I don't believe those results to be right.

Not that I question the results you obtained, but rather that Shawn's 
pack generation is most probably broken.

Since I was highly suspicious looking at the above size difference (it 
is just too good to be true), I did a quick hack to pack-objects.c to 
produce packs with all deltas depending on the same root object into the 
same zlib stream.  In effect I implemented exactly what Shawn pretends 
to have done but directly in pack-objects.c so git-repack could be used 
with existing repositories like the linux kernel in order to validate 
the concept.

Results:

One entry per delta - 122,103,455 
Packed (or grouped) deltas : 108,479,014

Only 11% smaller.

I was really enthousiastic at first when I saw the result you obtained, 
but I no longer believe it can be right.  

Furthermore, such a pack organization has many disadvantages: it 
prevents trivial delta data reuse for incremental push/pull operations 
which is a huge performance penalty (bad for servers), as well as making 
random object extraction from such a pack somewhat more complex.  Those 
disadvantages greatly outweight the 11% size saving IMHO.

A better way to get such a size saving is to increase the window and 
depth parameters.  For example, a window of 20 and depth of 20 can 
usually provide a pack size saving greater than 11% with none of the 
disadvantages mentioned above.

I therefore don't think this idea should be pursued any further.

For completeness you can find my changes to pack-objects below.  Beware 
that this produces packs that cannot be read back so backup your 
repository first before playing with this.

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 448461b..bfae892 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -227,7 +227,7 @@ static int encode_header(enum object_typ
 	int n = 1;
 	unsigned char c;
 
-	if (type < OBJ_COMMIT || type > OBJ_DELTA)
+	if (type < 0 || type > OBJ_DELTA)
 		die("bad type %d", type);
 
 	c = (type << 4) | (size & 15);
@@ -332,6 +332,8 @@ static unsigned long write_object(struct
 	return hdrlen + datalen;
 }
 
+#if 0
+
 static unsigned long write_one(struct sha1file *f,
 			       struct object_entry *e,
 			       unsigned long offset)
@@ -349,6 +351,107 @@ static unsigned long write_one(struct sh
 	return offset;
 }
 
+#else
+
+static unsigned count_delta_childs(struct object_entry *e, unsigned long o)
+{
+	unsigned count = 0;
+	struct object_entry *child = e->delta_child;
+	while (child) {
+		count += 1 + count_delta_childs(child, o);
+		if (child->offset)
+			die("object already seen???");
+		child->offset = o;
+		child = child->delta_sibling;
+	}
+	return count;
+}
+
+static unsigned long write_delta_headers(struct sha1file *f, struct object_entry *e, unsigned long *size)
+{
+	unsigned long offset = 0;
+	struct object_entry *child = e->delta_child;
+	while (child) {
+		unsigned char header[10];
+		unsigned hdrlen;
+		hdrlen = encode_header(OBJ_DELTA, child->delta_size, header);
+		*size += child->delta_size;
+		sha1write(f, header, hdrlen);
+		sha1write(f, child->delta, 20);
+		offset += hdrlen + 20;
+		offset += write_delta_headers(f, child, size);
+		child = child->delta_sibling;
+	}
+	return offset;
+}
+
+static void * concat_delta_data(struct object_entry *e, void *buf)
+{
+	struct object_entry *child = e->delta_child;
+	while (child) {
+		char type[10];
+		unsigned long size;
+		void *data = read_sha1_file(child->sha1, type, &size);
+		if (!data)
+			die("unable to read %s", sha1_to_hex(child->sha1));
+		data = delta_against(data, size, child);
+		memcpy(buf, data, child->delta_size);
+		buf += child->delta_size;
+		free(data);
+		written++;
+		written_delta++;
+		buf = concat_delta_data(child, buf);
+		child = child->delta_sibling;
+	}
+	return buf;
+}
+
+static unsigned long write_one(struct sha1file *f,
+			       struct object_entry *e,
+			       unsigned long offset)
+{
+	unsigned char header[10];
+	unsigned count, hdrlen;
+	unsigned long size;
+	void *buf, *p;
+
+	if (e->offset)
+		/* offset starts from header size and cannot be zero
+		 * if it is written already.
+		 */
+		return offset;
+	if (!e->delta) {
+		e->offset = offset;
+		return offset + write_object(f, e);
+	}
+
+	/* find delta root object */
+	while (e->delta)
+		e = e->delta;
+
+	/* count how many deltas depend on it */
+	count = count_delta_childs(e, offset);
+
+	/* write header for object group */
+	hdrlen = encode_header(0, count, header);
+	sha1write(f, header, hdrlen);
+	offset += hdrlen;
+
+	/* write each object's header and find total data size */
+	size = 0;
+	offset += write_delta_headers(f, e, &size);
+
+	/* write concatenated object data buffer */
+	buf = xmalloc(size);
+	p = concat_delta_data(e, buf);	
+	offset += sha1write_compressed(f, buf, p-buf);
+	free(buf);
+
+	return offset;
+}
+
+#endif
+
 static void write_pack_file(void)
 {
 	int i;


Nicolas

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-18  4:03 ` Nicolas Pitre
@ 2006-08-18 12:53   ` Jon Smirl
  2006-08-18 16:30     ` Nicolas Pitre
  2006-08-18 13:15   ` Jon Smirl
  1 sibling, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-18 12:53 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn Pearce, git

[-- Attachment #1: Type: text/plain, Size: 370 bytes --]

I attached Shawn's code. He is gone until Monday and can't defend it.

Do note that I am running this on the Mozilla CVS which is over 10
years old. Some of the files have over 2,000 deltas. I average 10
deltas per file but the distribution is not at all even. Many files
get checked-in and never changed, for example 1000's of images.

-- 
Jon Smirl
jonsmirl@gmail.com

[-- Attachment #2: fast-import3.c --]
[-- Type: text/x-csrc, Size: 31535 bytes --]

/*
Format of STDIN stream:

  stream ::= cmd*;

  cmd ::= new_blob
        | new_branch
        | new_commit
        | new_tag
        ;

  new_blob ::= 'blob' lf
	mark?
    file_content;
  file_content ::= data;

  new_branch ::= 'branch' sp ref_str lf
    ('from' sp (ref_str | hexsha1 | sha1exp_str | idnum) lf)?
    lf;

  new_commit ::= 'commit' sp ref_str lf
	mark?
	('author' sp name '<' email '>' ts tz lf)?
	'committer' sp name '<' email '>' ts tz lf
	commit_msg
    file_change*
    lf;
  commit_msg ::= data;

  file_change ::= 'M' sp mode sp (hexsha1 | idnum) sp path_str lf
                | 'D' sp path_str lf
                ;
  mode ::= '644' | '755';

  new_tag ::= 'tag' sp tag_str lf
    'from' sp (ref_str | hexsha1 | sha1exp_str | idnum) lf
	'tagger' sp name '<' email '>' ts tz lf
    tag_msg;
  tag_msg ::= data;

     # note: the first idnum in a stream should be 1 and subsequent
     # idnums should not have gaps between values as this will cause
     # the stream parser to reserve space for the gapped values.  An
	 # idnum can be updated in the future to a new object by issuing
     # a new mark directive with the old idnum.
	 #
  mark ::= 'mark' sp idnum lf;

     # note: declen indicates the length of binary_data in bytes.
     # declen does not include the lf preceeding or trailing the
     # binary data.
     #
  data ::= 'data' sp declen lf
    binary_data
	lf;

     # note: quoted strings are C-style quoting supporting \c for
     # common escapes of 'c' (e..g \n, \t, \\, \") or \nnn where nnn
	 # is the signed byte value in octal.  Note that the only
     # characters which must actually be escaped to protect the
     # stream formatting is: \, " and LF.  Otherwise these values
	 # are UTF8.
     #
  ref_str     ::= ref     | '"' quoted(ref)     '"' ;
  sha1exp_str ::= sha1exp | '"' quoted(sha1exp) '"' ;
  tag_str     ::= tag     | '"' quoted(tag)     '"' ;
  path_str    ::= path    | '"' quoted(path)    '"' ;

  declen ::= # unsigned 32 bit value, ascii base10 notation;
  binary_data ::= # file content, not interpreted;

  sp ::= # ASCII space character;
  lf ::= # ASCII newline (LF) character;

     # note: a colon (':') must precede the numerical value assigned to
	 # an idnum.  This is to distinguish it from a ref or tag name as
     # GIT does not permit ':' in ref or tag strings.
	 #
  idnum   ::= ':' declen;
  path    ::= # GIT style file path, e.g. "a/b/c";
  ref     ::= # GIT ref name, e.g. "refs/heads/MOZ_GECKO_EXPERIMENT";
  tag     ::= # GIT tag name, e.g. "FIREFOX_1_5";
  sha1exp ::= # Any valid GIT SHA1 expression;
  hexsha1 ::= # SHA1 in hexadecimal format;

     # note: name and email are UTF8 strings, however name must not
	 # contain '<' or lf and email must not contain any of the
     # following: '<', '>', lf.
	 #
  name  ::= # valid GIT author/committer name;
  email ::= # valid GIT author/committer email;
  ts    ::= # time since the epoch in seconds, ascii base10 notation;
  tz    ::= # GIT style timezone;
*/

#include "builtin.h"
#include "cache.h"
#include "object.h"
#include "blob.h"
#include "tree.h"
#include "delta.h"
#include "pack.h"
#include "refs.h"
#include "csum-file.h"
#include "strbuf.h"
#include "quote.h"

struct object_entry
{
	struct object_entry *next;
	enum object_type type;
	unsigned long offset;
	unsigned char sha1[20];
};

struct object_entry_pool
{
	struct object_entry_pool *next_pool;
	struct object_entry *next_free;
	struct object_entry *end;
	struct object_entry entries[FLEX_ARRAY]; /* more */
};

struct last_object
{
	void *data;
	unsigned int len;
	unsigned int depth;
	unsigned char sha1[20];
};

struct mem_pool
{
	struct mem_pool *next_pool;
	char *next_free;
	char *end;
	char space[FLEX_ARRAY]; /* more */
};

struct atom_str
{
	struct atom_str *next_atom;
	int str_len;
	char str_dat[FLEX_ARRAY]; /* more */
};

struct tree_content;
struct tree_entry
{
	struct tree_content *tree;
	struct atom_str* name;
	unsigned int mode;
	unsigned char sha1[20];
};

struct tree_content
{
	unsigned int entry_capacity; /* must match avail_tree_content */
	unsigned int entry_count;
	struct tree_entry *entries[FLEX_ARRAY]; /* more */
};

struct avail_tree_content
{
	unsigned int entry_capacity; /* must match tree_content */
	struct avail_tree_content *next_avail;
};

struct branch
{
	struct branch *table_next_branch;
	struct branch *active_next_branch;
	const char *name;
	unsigned long last_commit;
	struct tree_entry branch_tree;
	unsigned char sha1[20];
};


/* Stats and misc. counters */
static int max_depth = 10;
static unsigned long alloc_count;
static unsigned long branch_count;
static unsigned long object_count;
static unsigned long duplicate_count;
static unsigned long object_count_by_type[9];
static unsigned long duplicate_count_by_type[9];
static unsigned long stream_cnt;

/* Memory pools */
static size_t mem_pool_alloc = 2*1024*1024 - sizeof(struct mem_pool);
static size_t total_allocd;
static struct mem_pool *mem_pool;

/* Atom management */
static unsigned int atom_table_sz = 4451;
static unsigned int atom_cnt;
static struct atom_str **atom_table;

/* The .pack file being generated */
static int pack_fd;
static unsigned long pack_offset;
static unsigned char pack_sha1[20];
static z_stream zs;
static int zs_live;
static void* zs_out;
static unsigned long zs_outlen;

/* Table of objects we've written. */
static unsigned int object_entry_alloc = 1000;
static struct object_entry_pool *blocks;
static struct object_entry *object_table[1 << 16];

/* Our last blob */
static struct last_object last_blob;

/* Tree management */
static unsigned int tree_entry_alloc = 1000;
static void *avail_tree_entry;
static unsigned int avail_tree_table_sz = 100;
static struct avail_tree_content **avail_tree_table;

/* Branch data */
static unsigned int max_active_branches = 5;
static unsigned int cur_active_branches;
static unsigned int branch_table_sz = 1039;
static struct branch **branch_table;
static struct branch *active_branches;

/* Input stream parsing */
static struct strbuf command_buf;
static unsigned long command_mark;


static void alloc_objects(int cnt)
{
	struct object_entry_pool *b;

	b = xmalloc(sizeof(struct object_entry_pool)
		+ cnt * sizeof(struct object_entry));
	b->next_pool = blocks;
	b->next_free = b->entries;
	b->end = b->entries + cnt;
	blocks = b;
	alloc_count += cnt;
}

static struct object_entry* new_object(unsigned char *sha1)
{
	struct object_entry *e;

	if (blocks->next_free == blocks->end)
		alloc_objects(object_entry_alloc);

	e = blocks->next_free++;
	memcpy(e->sha1, sha1, sizeof(e->sha1));
	return e;
}

static struct object_entry* find_object(unsigned char *sha1)
{
	unsigned int h = sha1[0] << 8 | sha1[1];
	struct object_entry *e;
	for (e = object_table[h]; e; e = e->next)
		if (!memcmp(sha1, e->sha1, sizeof(e->sha1)))
			return e;
	return NULL;
}

static struct object_entry* insert_object(unsigned char *sha1)
{
	unsigned int h = sha1[0] << 8 | sha1[1];
	struct object_entry *e = object_table[h];
	struct object_entry *p = NULL;

	while (e) {
		if (!memcmp(sha1, e->sha1, sizeof(e->sha1)))
			return e;
		p = e;
		e = e->next;
	}

	e = new_object(sha1);
	e->next = NULL;
	e->offset = 0;
	if (p)
		p->next = e;
	else
		object_table[h] = e;
	return e;
}

static unsigned int hc_str(const char *s, size_t len)
{
	unsigned int r = 0;
	while (len-- > 0)
		r = r * 31 + *s++;
	return r;
}

static void* pool_alloc(size_t len)
{
	struct mem_pool *p;
	void *r;

	for (p = mem_pool; p; p = p->next_pool)
		if ((p->end - p->next_free >= len))
			break;

	if (!p) {
		if (len >= (mem_pool_alloc/2)) {
			total_allocd += len;
			return xmalloc(len);
		}
		total_allocd += sizeof(struct mem_pool) + mem_pool_alloc;
		p = xmalloc(sizeof(struct mem_pool) + mem_pool_alloc);
		p->next_pool = mem_pool;
		p->next_free = p->space;
		p->end = p->next_free + mem_pool_alloc;
		mem_pool = p;
	}

	r = p->next_free;
	p->next_free += len;
	return r;
}

static void* pool_calloc(size_t count, size_t size)
{
	size_t len = count * size;
	void *r = pool_alloc(len);
	memset(r, 0, len);
	return r;
}

static char* pool_strdup(const char *s)
{
	char *r = pool_alloc(strlen(s) + 1);
	strcpy(r, s);
	return r;
}

static struct atom_str* to_atom(const char *s, size_t len)
{
	unsigned int hc = hc_str(s, len) % atom_table_sz;
	struct atom_str *c;
	
	for (c = atom_table[hc]; c; c = c->next_atom)
		if (c->str_len == len && !strncmp(s, c->str_dat, len))
			return c;

	c = pool_alloc(sizeof(struct atom_str) + len + 1);
	c->str_len = len;
	strncpy(c->str_dat, s, len);
	c->str_dat[len] = 0;
	c->next_atom = atom_table[hc];
	atom_table[hc] = c;
	atom_cnt++;
	return c;
}

static struct branch* lookup_branch(const char *name)
{
	unsigned int hc = hc_str(name, strlen(name)) % branch_table_sz;
	struct branch *b;

	for (b = branch_table[hc]; b; b = b->table_next_branch)
		if (!strcmp(name, b->name))
			return b;
	return NULL;
}

static struct branch* new_branch(const char *name)
{
	unsigned int hc = hc_str(name, strlen(name)) % branch_table_sz;
	struct branch* b = lookup_branch(name);

	if (b)
		die("Invalid attempt to create duplicate branch: %s", name);
	if (check_ref_format(name))
		die("Branch name doesn't conform to GIT standards: %s", name);

	b = pool_calloc(1, sizeof(struct branch));
	b->name = pool_strdup(name);
	b->table_next_branch = branch_table[hc];
	branch_table[hc] = b;
	branch_count++;
	return b;
}

static unsigned int hc_entries(unsigned int cnt)
{
	cnt = cnt & 7 ? (cnt / 8) + 1 : cnt / 8;
	return cnt < avail_tree_table_sz ? cnt : avail_tree_table_sz - 1;
}

static struct tree_content* new_tree_content(unsigned int cnt)
{
	struct avail_tree_content *f, *l = NULL;
	struct tree_content *t;
	unsigned int hc = hc_entries(cnt);

	for (f = avail_tree_table[hc]; f; l = f, f = f->next_avail)
		if (f->entry_capacity >= cnt)
			break;

	if (f) {
		if (l)
			l->next_avail = f->next_avail;
		else
			avail_tree_table[hc] = f->next_avail;
	} else {
		cnt = cnt & 7 ? ((cnt / 8) + 1) * 8 : cnt;
		f = pool_alloc(sizeof(*t) + sizeof(t->entries[0]) * cnt);
		f->entry_capacity = cnt;
	}

	t = (struct tree_content*)f;
	t->entry_count = 0;
	return t;
}

static void release_tree_entry(struct tree_entry *e);
static void release_tree_content(struct tree_content *t)
{
	struct avail_tree_content *f = (struct avail_tree_content*)t;
	unsigned int hc = hc_entries(f->entry_capacity);
	unsigned int i;
	for (i = 0; i < t->entry_count; i++)
		release_tree_entry(t->entries[i]);
	f->next_avail = avail_tree_table[hc];
	avail_tree_table[hc] = f;
}

static struct tree_content* grow_tree_content(
	struct tree_content *t,
	int amt)
{
	struct tree_content *r = new_tree_content(t->entry_count + amt);
	r->entry_count = t->entry_count;
	memcpy(r->entries,t->entries,t->entry_count*sizeof(t->entries[0]));
	release_tree_content(t);
	return r;
}

static struct tree_entry* new_tree_entry()
{
	struct tree_entry *e;

	if (!avail_tree_entry) {
		unsigned int n = tree_entry_alloc;
		avail_tree_entry = e = xmalloc(n * sizeof(struct tree_entry));
		while (n--) {
			*((void**)e) = e + 1;
			e++;
		}
	}

	e = avail_tree_entry;
	avail_tree_entry = *((void**)e);
	return e;
}

static void release_tree_entry(struct tree_entry *e)
{
	if (e->tree)
		release_tree_content(e->tree);
	*((void**)e) = avail_tree_entry;
	avail_tree_entry = e;
}

static void yread(int fd, void *buffer, size_t length)
{
	ssize_t ret = 0;
	while (ret < length) {
		ssize_t size = xread(fd, (char *) buffer + ret, length - ret);
		if (!size)
			die("Read from descriptor %i: end of stream", fd);
		if (size < 0)
			die("Read from descriptor %i: %s", fd, strerror(errno));
		ret += size;
	}
}

static void ywrite(int fd, void *buffer, size_t length)
{
	ssize_t ret = 0;
	while (ret < length) {
		ssize_t size = xwrite(fd, (char *) buffer + ret, length - ret);
		if (!size)
			die("Write to descriptor %i: end of file", fd);
		if (size < 0)
			die("Write to descriptor %i: %s", fd, strerror(errno));
		ret += size;
	}
}

static size_t encode_header(
	enum object_type type,
	size_t size,
	unsigned char *hdr)
{
	int n = 1;
	unsigned char c;

	if (type < OBJ_COMMIT || type > OBJ_DELTA)
		die("bad type %d", type);

	c = (type << 4) | (size & 15);
	size >>= 4;
	while (size) {
		*hdr++ = c | 0x80;
		c = size & 0x7f;
		size >>= 7;
		n++;
	}
	*hdr = c;
	return n;
}

static void zs_finish()
{
	if (zs_live) {
		while (deflate(&zs, Z_FINISH) == Z_OK) {
			if (zs.total_out) {
				ywrite(pack_fd, zs_out, zs.total_out);
				pack_offset += zs.total_out;
				zs.next_out = zs_out;
				zs.avail_out = zs_outlen;
				zs.total_out = 0;
			}
		}
		deflateEnd(&zs);
		free(zs_out);
		stream_cnt++;
	}
}

static int store_object(
	enum object_type type,
	void *dat,
	size_t datlen,
	struct last_object *last,
	unsigned char *sha1out)
{
	void *delta;
	struct object_entry *e;
	unsigned char hdr[96];
	unsigned char sha1[20];
	unsigned long hdrlen, deltalen;
	SHA_CTX c;

	hdrlen = sprintf((char*)hdr,"%s %lu",type_names[type],datlen) + 1;
	SHA1_Init(&c);
	SHA1_Update(&c, hdr, hdrlen);
	SHA1_Update(&c, dat, datlen);
	SHA1_Final(sha1, &c);
	if (sha1out)
		memcpy(sha1out, sha1, sizeof(sha1));

	e = insert_object(sha1);
	if (e->offset) {
		duplicate_count++;
		duplicate_count_by_type[type]++;
		return 1;
	}
	e->type = type;
	e->offset = pack_offset;
	object_count++;
	object_count_by_type[type]++;

	if (last && last->data && last->depth < max_depth)
		delta = diff_delta(last->data, last->len,
			dat, datlen,
			&deltalen, 0);
	else {
		zs_finish();
		delta = 0;
		memset(&zs, 0, sizeof(zs));
		deflateInit(&zs, zlib_compression_level);
		zs_live = 1;
		zs_outlen = 1024*1024;
		zs_out = xmalloc(zs_outlen);
		zs.next_out = zs_out;
		zs.avail_out = zs_outlen;
	}

	if (delta) {
		last->depth++;
		zs.next_in = delta;
		zs.avail_in = deltalen;
		hdrlen = encode_header(OBJ_DELTA, deltalen, hdr);
		ywrite(pack_fd, hdr, hdrlen);
		ywrite(pack_fd, last->sha1, sizeof(sha1));
		pack_offset += hdrlen + sizeof(sha1);
	} else {
		if (last)
			last->depth = 0;
		zs.next_in = dat;
		zs.avail_in = datlen;
		hdrlen = encode_header(type, datlen, hdr);
		ywrite(pack_fd, hdr, hdrlen);
		pack_offset += hdrlen;
	}

	while (deflate(&zs, Z_NO_FLUSH) == Z_OK) {
		if (zs.total_out) {
			ywrite(pack_fd, zs_out, zs.total_out);
			pack_offset += zs.total_out;
			zs.next_out = zs_out;
			zs.avail_out = zs_outlen;
			zs.total_out = 0;
		}
	}

	if (delta)
		free(delta);
	if (last) {
		if (last->data)
			free(last->data);
		last->data = dat;
		last->len = datlen;
		memcpy(last->sha1, sha1, sizeof(sha1));
	}
	return 0;
}

static const char *get_mode(const char *str, unsigned int *modep)
{
	unsigned char c;
	unsigned int mode = 0;

	while ((c = *str++) != ' ') {
		if (c < '0' || c > '7')
			return NULL;
		mode = (mode << 3) + (c - '0');
	}
	*modep = mode;
	return str;
}

static void load_tree(struct tree_entry *root)
{
	struct object_entry *myoe;
	struct tree_content *t;
	unsigned long size;
	char *buf;
	const char *c;
	char type[20];

	root->tree = t = new_tree_content(8);
	if (!memcmp(root->sha1, null_sha1, 20))
		return;

	myoe = find_object(root->sha1);
	if (myoe) {
		die("FIXME");
	} else {
		buf = read_sha1_file(root->sha1, type, &size);
		if (!buf || strcmp(type, tree_type))
			die("Can't load existing tree %s", sha1_to_hex(root->sha1));
	}

	c = buf;
	while (c != (buf + size)) {
		struct tree_entry *e = new_tree_entry();

		if (t->entry_count == t->entry_capacity)
			root->tree = t = grow_tree_content(t, 8);
		t->entries[t->entry_count++] = e;

		e->tree = NULL;
		c = get_mode(c, &e->mode);
		if (!c)
			die("Corrupt mode in %s", sha1_to_hex(root->sha1));
		e->name = to_atom(c, strlen(c));
		c += e->name->str_len + 1;
		memcpy(e->sha1, c, sizeof(e->sha1));
		c += 20;
	}
	free(buf);
}

static int tecmp (const void *_a, const void *_b)
{
	struct tree_entry *a = *((struct tree_entry**)_a);
	struct tree_entry *b = *((struct tree_entry**)_b);
	return base_name_compare(
		a->name->str_dat, a->name->str_len, a->mode,
		b->name->str_dat, b->name->str_len, b->mode);
}

static void store_tree(struct tree_entry *root)
{
	struct tree_content *t = root->tree;
	unsigned int i;
	size_t maxlen;
	char *buf, *c;

	if (memcmp(root->sha1, null_sha1, 20))
		return;

	maxlen = 0;
	for (i = 0; i < t->entry_count; i++) {
		maxlen += t->entries[i]->name->str_len + 34;
		if (t->entries[i]->tree)
			store_tree(t->entries[i]);
	}

	qsort(t->entries, t->entry_count, sizeof(t->entries[0]), tecmp);
	buf = c = xmalloc(maxlen);
	for (i = 0; i < t->entry_count; i++) {
		struct tree_entry *e = t->entries[i];
		c += sprintf(c, "%o", e->mode);
		*c++ = ' ';
		strcpy(c, e->name->str_dat);
		c += e->name->str_len + 1;
		memcpy(c, e->sha1, 20);
		c += 20;
	}
	store_object(OBJ_TREE, buf, c - buf, NULL, root->sha1);
	free(buf);
}

static int tree_content_set(
	struct tree_entry *root,
	const char *p,
	const unsigned char *sha1,
	const unsigned int mode)
{
	struct tree_content *t = root->tree;
	const char *slash1;
	unsigned int i, n;
	struct tree_entry *e;

	slash1 = strchr(p, '/');
	if (slash1)
		n = slash1 - p;
	else
		n = strlen(p);

	for (i = 0; i < t->entry_count; i++) {
		e = t->entries[i];
		if (e->name->str_len == n && !strncmp(p, e->name->str_dat, n)) {
			if (!slash1) {
				if (e->mode == mode && !memcmp(e->sha1, sha1, 20))
					return 0;
				e->mode = mode;
				memcpy(e->sha1, sha1, 20);
				if (e->tree) {
					release_tree_content(e->tree);
					e->tree = NULL;
				}
				memcpy(root->sha1, null_sha1, 20);
				return 1;
			}
			if (!S_ISDIR(e->mode)) {
				e->tree = new_tree_content(8);
				e->mode = S_IFDIR;
			}
			if (!e->tree)
				load_tree(e);
			if (tree_content_set(e, slash1 + 1, sha1, mode)) {
				memcpy(root->sha1, null_sha1, 20);
				return 1;
			}
			return 0;
		}
	}

	if (t->entry_count == t->entry_capacity)
		root->tree = t = grow_tree_content(t, 8);
	e = new_tree_entry();
	e->name = to_atom(p, n);
	t->entries[t->entry_count++] = e;
	if (slash1) {
		e->tree = new_tree_content(8);
		e->mode = S_IFDIR;
		tree_content_set(e, slash1 + 1, sha1, mode);
	} else {
		e->tree = NULL;
		e->mode = mode;
		memcpy(e->sha1, sha1, 20);
	}
	memcpy(root->sha1, null_sha1, 20);
	return 1;
}

static int tree_content_remove(struct tree_entry *root, const char *p)
{
	struct tree_content *t = root->tree;
	const char *slash1;
	unsigned int i, n;
	struct tree_entry *e;

	slash1 = strchr(p, '/');
	if (slash1)
		n = slash1 - p;
	else
		n = strlen(p);

	for (i = 0; i < t->entry_count; i++) {
		e = t->entries[i];
		if (e->name->str_len == n && !strncmp(p, e->name->str_dat, n)) {
			if (!slash1 || !S_ISDIR(e->mode))
				goto del_entry;
			if (!e->tree)
				load_tree(e);
			if (tree_content_remove(e, slash1 + 1)) {
				if (!e->tree->entry_count)
					goto del_entry;
				memcpy(root->sha1, null_sha1, 20);
				return 1;
			}
			return 0;
		}
	}
	return 0;

del_entry:
	for (i++; i < t->entry_count; i++)
		t->entries[i-1] = t->entries[i];
	t->entry_count--;
	release_tree_entry(e);
	memcpy(root->sha1, null_sha1, 20);
	return 1;
}

static void init_pack_header()
{
	const char* magic = "PACK";
	unsigned long version = 3;
	unsigned long zero = 0;

	version = htonl(version);
	ywrite(pack_fd, (char*)magic, 4);
	ywrite(pack_fd, &version, 4);
	ywrite(pack_fd, &zero, 4);
	pack_offset = 4 * 3;
}

static void fixup_header_footer()
{
	SHA_CTX c;
	char hdr[8];
	unsigned long cnt;
	char *buf;
	size_t n;

	zs_finish();
	if (lseek(pack_fd, 0, SEEK_SET) != 0)
		die("Failed seeking to start: %s", strerror(errno));

	SHA1_Init(&c);
	yread(pack_fd, hdr, 8);
	SHA1_Update(&c, hdr, 8);

	cnt = htonl(object_count);
	SHA1_Update(&c, &cnt, 4);
	ywrite(pack_fd, &cnt, 4);

	buf = xmalloc(128 * 1024);
	for (;;) {
		n = xread(pack_fd, buf, 128 * 1024);
		if (n <= 0)
			break;
		SHA1_Update(&c, buf, n);
	}
	free(buf);

	SHA1_Final(pack_sha1, &c);
	ywrite(pack_fd, pack_sha1, sizeof(pack_sha1));
}

static int oecmp (const void *_a, const void *_b)
{
	struct object_entry *a = *((struct object_entry**)_a);
	struct object_entry *b = *((struct object_entry**)_b);
	return memcmp(a->sha1, b->sha1, sizeof(a->sha1));
}

static void write_index(const char *idx_name)
{
	struct sha1file *f;
	struct object_entry **idx, **c, **last;
	struct object_entry *e;
	struct object_entry_pool *o;
	unsigned int array[256];
	int i;

	/* Build the sorted table of object IDs. */
	idx = xmalloc(object_count * sizeof(struct object_entry*));
	c = idx;
	for (o = blocks; o; o = o->next_pool)
		for (e = o->entries; e != o->next_free; e++)
			*c++ = e;
	last = idx + object_count;
	qsort(idx, object_count, sizeof(struct object_entry*), oecmp);

	/* Generate the fan-out array. */
	c = idx;
	for (i = 0; i < 256; i++) {
		struct object_entry **next = c;;
		while (next < last) {
			if ((*next)->sha1[0] != i)
				break;
			next++;
		}
		array[i] = htonl(next - idx);
		c = next;
	}

	f = sha1create("%s", idx_name);
	sha1write(f, array, 256 * sizeof(int));
	for (c = idx; c != last; c++) {
		unsigned int offset = htonl((*c)->offset);
		sha1write(f, &offset, 4);
		sha1write(f, (*c)->sha1, sizeof((*c)->sha1));
	}
	sha1write(f, pack_sha1, sizeof(pack_sha1));
	sha1close(f, NULL, 1);
	free(idx);
}

static void dump_branches()
{
	static const char *msg = "fast-import";
	unsigned int i;
	struct branch *b;
	struct ref_lock *lock;

	for (i = 0; i < branch_table_sz; i++) {
		for (b = branch_table[i]; b; b = b->table_next_branch) {
			lock = lock_any_ref_for_update(b->name, NULL, 0);
			if (!lock || write_ref_sha1(lock, b->sha1, msg) < 0)
				die("Can't write %s", b->name);
		}
	}
}

static void read_next_command()
{
	read_line(&command_buf, stdin, '\n');
}

static void cmd_mark()
{
	if (!strncmp("mark :", command_buf.buf, 6)) {
		command_mark = strtoul(command_buf.buf + 6, NULL, 10);
		read_next_command();
	}
	else
		command_mark = 0;
}

static void* cmd_data (size_t *size)
{
	size_t n = 0;
	void *buffer;
	size_t length;

	if (strncmp("data ", command_buf.buf, 5))
		die("Expected 'data n' command, found: %s", command_buf.buf);

	length = strtoul(command_buf.buf + 5, NULL, 10);
	buffer = xmalloc(length);

	while (n < length) {
		size_t s = fread((char*)buffer + n, 1, length - n, stdin);
		if (!s && feof(stdin))
			die("EOF in data (%lu bytes remaining)", length - n);
		n += s;
	}

	if (fgetc(stdin) != '\n')
		die("An lf did not trail the binary data as expected.");

	*size = length;
	return buffer;
}

static void cmd_new_blob()
{
	size_t datlen;
	void *dat;
	unsigned char sha1[20];

	read_next_command();
	cmd_mark();
	dat = cmd_data(&datlen);

	if (store_object(OBJ_BLOB, dat, datlen, &last_blob, sha1)) 
		free(dat);
}

static void unload_one_branch()
{
	while (cur_active_branches >= max_active_branches) {
		unsigned long min_commit = ULONG_MAX;
		struct branch *e, *l = NULL, *p = NULL;

		for (e = active_branches; e; e = e->active_next_branch) {
			if (e->last_commit < min_commit) {
				p = l;
				min_commit = e->last_commit;
			}
			l = e;
		}

		if (p) {
			e = p->active_next_branch;
			p->active_next_branch = e->active_next_branch;
		} else {
			e = active_branches;
			active_branches = e->active_next_branch;
		}
		e->active_next_branch = NULL;
		if (e->branch_tree.tree) {
			release_tree_content(e->branch_tree.tree);
			e->branch_tree.tree = NULL;
		}
		cur_active_branches--;
	}
}

static void load_branch(struct branch *b)
{
	load_tree(&b->branch_tree);
	b->active_next_branch = active_branches;
	active_branches = b;
	cur_active_branches++;
}

static void file_change_m(struct branch *b)
{
	const char *p = command_buf.buf + 2;
	char *p_uq;
	const char *endp;
	struct object_entry *oe;
	unsigned char sha1[20];
	unsigned int mode;
	char type[20];

	p = get_mode(p, &mode);
	if (!p)
		die("Corrupt mode: %s", command_buf.buf);
	switch (mode) {
	case S_IFREG | 0644:
	case S_IFREG | 0755:
	case 0644:
	case 0755:
		/* ok */
		break;
	default:
		die("Corrupt mode: %s", command_buf.buf);
	}

	if (get_sha1_hex(p, sha1))
		die("Invalid SHA1: %s", command_buf.buf);
	p += 40;
	if (*p++ != ' ')
		die("Missing space after SHA1: %s", command_buf.buf);

	p_uq = unquote_c_style(p, &endp);
	if (p_uq) {
		if (*endp)
			die("Garbage after path in: %s", command_buf.buf);
		p = p_uq;
	}

	oe = find_object(sha1);
	if (oe) {
		if (oe->type != OBJ_BLOB)
			die("Not a blob (actually a %s): %s",
				command_buf.buf, type_names[oe->type]);
	} else {
		if (sha1_object_info(sha1, type, NULL))
			die("Blob not found: %s", command_buf.buf);
		if (strcmp(blob_type, type))
			die("Not a blob (actually a %s): %s",
				command_buf.buf, type);
	}

	tree_content_set(&b->branch_tree, p, sha1, S_IFREG | mode);

	if (p_uq)
		free(p_uq);
}

static void file_change_d(struct branch *b)
{
	const char *p = command_buf.buf + 2;
	char *p_uq;
	const char *endp;

	p_uq = unquote_c_style(p, &endp);
	if (p_uq) {
		if (*endp)
			die("Garbage after path in: %s", command_buf.buf);
		p = p_uq;
	}
	tree_content_remove(&b->branch_tree, p);
	if (p_uq)
		free(p_uq);
}

static void cmd_new_commit()
{
	struct branch *b;
	void *msg;
	size_t msglen;
	char *str_uq;
	const char *endp;
	char *sp;
	char *author = NULL;
	char *committer = NULL;
	char *body;

	/* Obtain the branch name from the rest of our command */
	sp = strchr(command_buf.buf, ' ') + 1;
	str_uq = unquote_c_style(sp, &endp);
	if (str_uq) {
		if (*endp)
			die("Garbage after ref in: %s", command_buf.buf);
		sp = str_uq;
	}
	b = lookup_branch(sp);
	if (!b)
		die("Branch not declared: %s", sp);
	if (str_uq)
		free(str_uq);

	read_next_command();
	cmd_mark();
	if (!strncmp("author ", command_buf.buf, 7)) {
		author = strdup(command_buf.buf);
		read_next_command();
	}
	if (!strncmp("committer ", command_buf.buf, 10)) {
		committer = strdup(command_buf.buf);
		read_next_command();
	}
	if (!committer)
		die("Expected committer but didn't get one");
	msg = cmd_data(&msglen);

	/* ensure the branch is active/loaded */
	if (!b->branch_tree.tree) {
		unload_one_branch();
		load_branch(b);
	}

	/* file_change* */
	for (;;) {
		read_next_command();
		if (1 == command_buf.len)
			break;
		else if (!strncmp("M ", command_buf.buf, 2))
			file_change_m(b);
		else if (!strncmp("D ", command_buf.buf, 2))
			file_change_d(b);
		else
			die("Unsupported file_change: %s", command_buf.buf);
	}

	/* build the tree and the commit */
	store_tree(&b->branch_tree);
	body = xmalloc(97 + msglen
		+ (author
			? strlen(author) + strlen(committer)
			: 2 * strlen(committer)));
	sp = body;
	sp += sprintf(sp, "tree %s\n", sha1_to_hex(b->branch_tree.sha1));
	if (memcmp(b->sha1, null_sha1, 20))
		sp += sprintf(sp, "parent %s\n", sha1_to_hex(b->sha1));
	if (author)
		sp += sprintf(sp, "%s\n", author);
	else
		sp += sprintf(sp, "author %s\n", committer + 10);
	sp += sprintf(sp, "%s\n\n", committer);
	memcpy(sp, msg, msglen);
	sp += msglen;
	if (author)
		free(author);
	free(committer);
	free(msg);

	store_object(OBJ_COMMIT, body, sp - body, NULL, b->sha1);
	free(body);
	b->last_commit = object_count_by_type[OBJ_COMMIT];
}

static void cmd_new_branch()
{
	struct branch *b;
	char *str_uq;
	const char *endp;
	char *sp;

	/* Obtain the new branch name from the rest of our command */
	sp = strchr(command_buf.buf, ' ') + 1;
	str_uq = unquote_c_style(sp, &endp);
	if (str_uq) {
		if (*endp)
			die("Garbage after ref in: %s", command_buf.buf);
		sp = str_uq;
	}
	b = new_branch(sp);
	if (str_uq)
		free(str_uq);
	read_next_command();

	/* from ... */
	if (!strncmp("from ", command_buf.buf, 5)) {
		const char *from;
		struct branch *s;

		from = strchr(command_buf.buf, ' ') + 1;
		str_uq = unquote_c_style(from, &endp);
		if (str_uq) {
			if (*endp)
				die("Garbage after string in: %s", command_buf.buf);
			from = str_uq;
		}

		s = lookup_branch(from);
		if (b == s)
			die("Can't create a branch from itself: %s", b->name);
		else if (s) {
			memcpy(b->sha1, s->sha1, 20);
			memcpy(b->branch_tree.sha1, s->branch_tree.sha1, 20);
		} else if (!get_sha1(from, b->sha1)) {
			if (!memcmp(b->sha1, null_sha1, 20))
				memcpy(b->branch_tree.sha1, null_sha1, 20);
			else {
				unsigned long size;
				char *buf;

				buf = read_object_with_reference(b->sha1,
					type_names[OBJ_COMMIT], &size, b->sha1);
				if (!buf || size < 46)
					die("Not a valid commit: %s", from);
				if (memcmp("tree ", buf, 5)
					|| get_sha1_hex(buf + 5, b->branch_tree.sha1))
					die("The commit %s is corrupt", sha1_to_hex(b->sha1));
				free(buf);
			}
		} else
			die("Invalid ref name or SHA1 expression: %s", from);

		if (str_uq)
			free(str_uq);
		read_next_command();
	} else {
		memcpy(b->sha1, null_sha1, 20);
		memcpy(b->branch_tree.sha1, null_sha1, 20);
	}

	if (command_buf.eof || command_buf.len > 1)
		die("An lf did not terminate the branch command as expected.");
}

int main(int argc, const char **argv)
{
	const char *base_name = argv[1];
	int est_obj_cnt = atoi(argv[2]);
	char *pack_name;
	char *idx_name;
	struct stat sb;

	setup_ident();
	git_config(git_default_config);

	pack_name = xmalloc(strlen(base_name) + 6);
	sprintf(pack_name, "%s.pack", base_name);
	idx_name = xmalloc(strlen(base_name) + 5);
	sprintf(idx_name, "%s.idx", base_name);

	pack_fd = open(pack_name, O_RDWR|O_CREAT|O_EXCL, 0666);
	if (pack_fd < 0)
		die("Can't create %s: %s", pack_name, strerror(errno));

	init_pack_header();
	alloc_objects(est_obj_cnt);
	strbuf_init(&command_buf);

	atom_table = xcalloc(atom_table_sz, sizeof(struct atom_str*));
	branch_table = xcalloc(branch_table_sz, sizeof(struct branch*));
	avail_tree_table = xcalloc(avail_tree_table_sz, sizeof(struct avail_tree_content*));

	for (;;) {
		read_next_command();
		if (command_buf.eof)
			break;
		else if (!strcmp("blob", command_buf.buf))
			cmd_new_blob();
		else if (!strncmp("branch ", command_buf.buf, 7))
			cmd_new_branch();
		else if (!strncmp("commit ", command_buf.buf, 7))
			cmd_new_commit();
		else
			die("Unsupported command: %s", command_buf.buf);
	}

	fixup_header_footer();
	close(pack_fd);
	write_index(idx_name);
	dump_branches();

	fprintf(stderr, "%s statistics:\n", argv[0]);
	fprintf(stderr, "---------------------------------------------------\n");
	fprintf(stderr, "Alloc'd objects: %10lu (%10lu overflow  )\n", alloc_count, alloc_count - est_obj_cnt);
	fprintf(stderr, "Total objects:   %10lu (%10lu duplicates)\n", object_count, duplicate_count);
	fprintf(stderr, "      blobs  :   %10lu (%10lu duplicates)\n", object_count_by_type[OBJ_BLOB], duplicate_count_by_type[OBJ_BLOB]);
	fprintf(stderr, "      trees  :   %10lu (%10lu duplicates)\n", object_count_by_type[OBJ_TREE], duplicate_count_by_type[OBJ_TREE]);
	fprintf(stderr, "      commits:   %10lu (%10lu duplicates)\n", object_count_by_type[OBJ_COMMIT], duplicate_count_by_type[OBJ_COMMIT]);
	fprintf(stderr, "      tags   :   %10lu (%10lu duplicates)\n", object_count_by_type[OBJ_TAG], duplicate_count_by_type[OBJ_TAG]);
	fprintf(stderr, "Total streams:   %10u\n", stream_cnt);
	fprintf(stderr, "Total branches:  %10lu\n", branch_count);
	fprintf(stderr, "Total atoms:     %10u\n", atom_cnt);
	fprintf(stderr, "Memory total:    %10lu KiB\n", (total_allocd + alloc_count*sizeof(struct object_entry))/1024);
	fprintf(stderr, "       pools:    %10lu KiB\n", total_allocd/1024);
	fprintf(stderr, "     objects:    %10lu KiB\n", (alloc_count*sizeof(struct object_entry))/1024);
	fprintf(stderr, "---------------------------------------------------\n");

	stat(pack_name, &sb);
	fprintf(stderr, "Pack size:       %10lu KiB\n", (unsigned long)(sb.st_size/1024));
	stat(idx_name, &sb);
	fprintf(stderr, "Index size:      %10lu KiB\n", (unsigned long)(sb.st_size/1024));

	fprintf(stderr, "\n");

	return 0;
}


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-18 12:53   ` Jon Smirl
@ 2006-08-18 16:30     ` Nicolas Pitre
  2006-08-18 16:56       ` Jon Smirl
  0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-18 16:30 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn Pearce, git

On Fri, 18 Aug 2006, Jon Smirl wrote:

> I attached Shawn's code. He is gone until Monday and can't defend it.

I will have a look at it next week as I'll be gone for the weekend as 
well.

> Do note that I am running this on the Mozilla CVS which is over 10
> years old. Some of the files have over 2,000 deltas. I average 10
> deltas per file but the distribution is not at all even. Many files
> get checked-in and never changed, for example 1000's of images.

This is IMHO more evidence that something is wrong with the results 
you obtained.


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-18 16:30     ` Nicolas Pitre
@ 2006-08-18 16:56       ` Jon Smirl
  2006-08-21  3:45         ` Nicolas Pitre
  0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-18 16:56 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn Pearce, git

On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> On Fri, 18 Aug 2006, Jon Smirl wrote:
>
> > I attached Shawn's code. He is gone until Monday and can't defend it.
>
> I will have a look at it next week as I'll be gone for the weekend as
> well.

I looked at it some and couldn't see anything obviously wrong with it,
but it wasn't a detailed inspection.

As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB.
The 295MB git pack number does not have commits and trees in it, it is
revisions only.

It is not impossible that when the trees and commits are added to the
295MB number that it won't end up being near the 541MB number. We
already know xdiff is better at generating diffs than CVS diff. And I
still have to add 250K log entries and trees. Those are going to add
at least 100MB.

I will get back to work on the import code over the weekend. One of my
2 year olds gave me a cold so I have been useless the last couple of
days.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-18 16:56       ` Jon Smirl
@ 2006-08-21  3:45         ` Nicolas Pitre
  2006-08-21  6:46           ` Shawn Pearce
  0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-21  3:45 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn Pearce, git

On Fri, 18 Aug 2006, Jon Smirl wrote:

> On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> > On Fri, 18 Aug 2006, Jon Smirl wrote:
> >
> > > I attached Shawn's code. He is gone until Monday and can't defend it.
> >
> > I will have a look at it next week as I'll be gone for the weekend as
> > well.
> 
> I looked at it some and couldn't see anything obviously wrong with it,
> but it wasn't a detailed inspection.

I looked at it too and the code looks OK.

This doesn't mean there is no problem at a higher level though.  The 
deltification process is extremely crude and I think this is the cause 
of the original pack size.

For example, last April we discovered that a small change in the 
heuristics to determine base delta objects in git-pack-objects could 
create a pack size regression up to 4x the size of the same pack created 
before such change.

It is also possible to have a denser delta stream but once deflated it 
is larger than a less dense delta to start with.

Just to say that many tweaks and heuristics have been implemented and 
studied in git-pack-objects for over a year now in order to get the 
really small packs we have today.  And a really subtle and 
inocent-looking change can break it size wize.

So what I think is happening with the fastimport code is that the delta 
selection is not really good.  It is certainly much better than no delta 
at all but still not optimal which smells deja vu to me.  Then by 
deflating them all together the redundent information that the bad delta 
set still carries along is eliminated -- thanks to zlib sort of 
mitigating the real issue.

But... as my recent experiments show, the grouping of related deltas 
into a single zlib stream doesn't produce significant improvements when 
implemented directly into git-pack-objects.  Certainly not worth the 
inconvenients and costs it brings along.  I even think that if you used 
git-repack -a -f on the pack produced by the import process, with only 
delta deflated individually just like it did originally, then the 
repacked pack would _also_ shrink significantly.  Most probably around 
4x just like you observed with the grouping of deltas in the same zlib 
stream.

Not only would git-repack make it much smaller, but it also provicdes a 
much better layout where all objects for recent commits are all stored 
together at the beginning of the pack.  The fastimport code is instead 
storing them scattered all over the pack for every commit by making all 
revisions of each file next to each other which will cause horrible 
access patterns and really bad IO.

So I think that trying to make fastimport too clever is wrong.  It 
should instead focus on creating an initial pack as fast as possible and 
then rely on a final git-repack pass to produce the shrinked pack.  I 
really doubt the import code could ever make a better job than 
git-pack-objects does.

If I can make a suggestion, you should forget about this multiple deltas 
in one zlib stream for now and focus on making the import process work 
all the way to tree and commit objects instead.  Then, only then, if 
git-repack -a -f doesn't produce satisfactory pack size we could look at 
better pack encoding.  And so far the grouping of related deltas in one 
zlib stream is _not_ a better encoding given the rather small 
improvement over unmodified git-pack-objects vs the inconvenients and 
cost it brings with it.

> As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB.
> The 295MB git pack number does not have commits and trees in it, it is
> revisions only.

Running git-repack -a -f from a recent GIT on the Mozilla repo converted 
through cvsps and friends produces a pack smaller than 500MB.  I even 
brought it down to 430MB by using non default delta window and depth.

Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-21  3:45         ` Nicolas Pitre
@ 2006-08-21  6:46           ` Shawn Pearce
  2006-08-21 10:24             ` Jakub Narebski
  2006-08-21 16:23             ` Jon Smirl
  0 siblings, 2 replies; 36+ messages in thread
From: Shawn Pearce @ 2006-08-21  6:46 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jon Smirl, git

Nicolas Pitre <nico@cam.org> wrote:
> I looked at it too and the code looks OK.

Based on the stats Jon provided me it appeared as though the code
skipped about 9% of the objects, and at the time I thought I new
why but looking at it again right now I'm not sure why.  :-)

Assuming the objects were roughly the same size (probably a bad
assumption, but lets use it anyway) we're talking about approx. 9%
increase, bringing the size to around 360 MB.

Since that is still only blobs and lacks trees&commits, and you were
able to get a pack from the earlier import attempts down to 430 MB
(including trees&commits) everything you pointed out with regards
to delta base selection not being optimal is probably the major
culprit here.

> So I think that trying to make fastimport too clever is wrong.  It 
> should instead focus on creating an initial pack as fast as possible and 
> then rely on a final git-repack pass to produce the shrinked pack.  I 
> really doubt the import code could ever make a better job than 
> git-pack-objects does.

I'm mixed on that.

My goal with fast-import was to construct a pack as cheaply as
possible in such a way that IO is relatively minimized during
pack construction and later during an initial repacking, as well
as to avoid all fork'ing costs associated with many update-index
and repack calls.

fast-import is assuming the front end driving it is being somewhat
nice by feeding revisions in newest-oldest order, one file at a time.
If this assumption is false it will produce a pack but one that is
so large the IO to repack it will be horrible (but probably would
still be better than if these were all loose objects).

fast-import is also assuming that pack construction speed and memory
usage required to build that pack are more important than subsequent
reads of the resulting pack.

Why?

Because I fully expected the pack to undergo a `git-repack -a -d -f`
with a rather large delta depth before it really got used by
clients which cared about performance or disk space.  :-)

Whether or not git-repack can handle such a large pack is another
story.  Apparently it can already take a ~600 MB pack and do
something useful to it in a reasonable time span so I'm putting
any limits that might be in there off for now.

But the more I work with Jon on this Mozilla import process the more
I'm realizing that:

 - fast-import.c's other name is "GIT in 1325 lines of C code, all
   of which more or less already exists elsewhere in a form that
   wasn't useful in this context without refactoring it in major
   ways so I got lazy and rewrote it in another equally useless way";

 - the strategy of how I'm slamming a very large number of objects
   into a pack may be useful in situations other than a foreign
   SCM import process.  I can see someone wanting to create a
   large commit with a lot of modified objects.  Doing this with
   update-index and write-tree into loose objects would be far
   slower than just generating a new pack if the number of objects
   you are writing exceeds about 100 on Windows or ~1k on UNIX;

 - Jon has lots of CPU time to spare and possibly even some memory.
   fast-import is barely using CPU and barely uses memory (its quite
   lightweight!) but its definately producing a suboptimal pack
   as a result.  Just doing slightly better delta selection before
   committing an object to the output pack may prevent the need to
   use -f during git-repack and produce something almost as optimal.

Right now fast-import is only lacking the following major features,
at which point it *should* be able to process the entire Mozilla
repository, assuming Jon can get his symbol closing problem in
cvs2svn solved:

  - reload a branch which has been swapped out (requires reading
    the pack that is currently being constructed back into memory,
    which since its not really a valid pack yet and lacks an index
    reusing the code in sha1_file.c is slightly interesting);

  - generate a tag object into the pack;

  - track association ID numbers to SHA1 IDs (needed to properly
    create a branch from a prior commit or a tag from a prior commit
    as the prior commit's SHA ID isn't known to the front end);

Working on the multiple objects in one zlib stream code over the
weekend also showed me how to use sha1_file.c to perform the first
of the three (I think) and the remaining two are trivial.  So my
weekend wasn't a total loss.  :-)

I plan on getting that code finished up in the next day or so and
get a new draft of it off to Jon.  Maybe by then he'll also have
something coming out of cvs2svn's pass 9.

> > As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB.
> > The 295MB git pack number does not have commits and trees in it, it is
> > revisions only.
> 
> Running git-repack -a -f from a recent GIT on the Mozilla repo converted 
> through cvsps and friends produces a pack smaller than 500MB.  I even 
> brought it down to 430MB by using non default delta window and depth.

I'd still like to find a way to beat that 430 MB current
best-yet-still-valid size.  :-)

I think we'll get there.  Shaving off 4-5% with a pack specific
dictionary may be worthwhile on such large packs, especially ones
that are considered "historical" and probably will never get repacked
again.  Folding objects into a single zlib stream may also be worth
it, but for now I'm going to wait and see what mileage we can get
from the existing repack code and the existing pack file format.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-21  6:46           ` Shawn Pearce
@ 2006-08-21 10:24             ` Jakub Narebski
  2006-08-21 16:23             ` Jon Smirl
  1 sibling, 0 replies; 36+ messages in thread
From: Jakub Narebski @ 2006-08-21 10:24 UTC (permalink / raw)
  To: git

Shawn Pearce wrote:

>  - the strategy of how I'm slamming a very large number of objects
>    into a pack may be useful in situations other than a foreign
>    SCM import process.  I can see someone wanting to create a
>    large commit with a lot of modified objects.  Doing this with
>    update-index and write-tree into loose objects would be far
>    slower than just generating a new pack if the number of objects
>    you are writing exceeds about 100 on Windows or ~1k on UNIX;

Like e.g. initial import of large project, or incorporating some subproject
into a projects (e.g. gitk and gitweb in git)?

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-21  6:46           ` Shawn Pearce
  2006-08-21 10:24             ` Jakub Narebski
@ 2006-08-21 16:23             ` Jon Smirl
  1 sibling, 0 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-21 16:23 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Nicolas Pitre, git

On 8/21/06, Shawn Pearce <spearce@spearce.org> wrote:
> Based on the stats Jon provided me it appeared as though the code
> skipped about 9% of the objects, and at the time I thought I new
> why but looking at it again right now I'm not sure why.  :-)

9% of the objects may be duplicates because of branching. I have no
way of detecting them since I am not maintaining a map of sha1s
encountered.  The way CVS branching works some revisions are built
using forward diffs and some reverse, I can't tell when they will end
up with the same result.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-18  4:03 ` Nicolas Pitre
  2006-08-18 12:53   ` Jon Smirl
@ 2006-08-18 13:15   ` Jon Smirl
  2006-08-18 13:36     ` Johannes Schindelin
  2006-08-18 16:25     ` Nicolas Pitre
  1 sibling, 2 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-18 13:15 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn Pearce, git

On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> A better way to get such a size saving is to increase the window and
> depth parameters.  For example, a window of 20 and depth of 20 can
> usually provide a pack size saving greater than 11% with none of the
> disadvantages mentioned above.

Our window size is effectively infinite. I am handing him all of the
revisions from a single file in optimal order. This includes branches.
He takes these revisions, runs xdiff on them, and then puts the entire
result into a single zlib blob.

I suspect the size reduction is directly proportional to the age of
the repository. The kernel repository only has three years worth of
data in it.  Linus has the full history in another repository that is
not in general distribution. We can get it from him when he gets back
from vacation.

If the repository doesn't contain long delta chains the optimization
doesn't help that much. On the other hand it doesn't hurt either since
the chains weren't long.  My repository is four times as old as the
kernel one and I am getting 4x the benefit.

This is a good format for archival data that is infrequency accessed.
That is why I proposed a two pack system. One pack would contain the
archival data and be highly optimized for size and the second pack
would contain recent changes and be optimized for speed. The size
optimization is important for controlling bandwidth costs.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-18 13:15   ` Jon Smirl
@ 2006-08-18 13:36     ` Johannes Schindelin
  2006-08-18 13:50       ` Jon Smirl
  2006-08-18 16:25     ` Nicolas Pitre
  1 sibling, 1 reply; 36+ messages in thread
From: Johannes Schindelin @ 2006-08-18 13:36 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Nicolas Pitre, Shawn Pearce, git

Hi,

On Fri, 18 Aug 2006, Jon Smirl wrote:

> I suspect the size reduction is directly proportional to the age of
> the repository. The kernel repository only has three years worth of
> data in it.  Linus has the full history in another repository that is
> not in general distribution. We can get it from him when he gets back
> from vacation.

Maybe you mean

http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/tglx/history.git

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-18 13:36     ` Johannes Schindelin
@ 2006-08-18 13:50       ` Jon Smirl
  2006-08-19 19:25         ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-18 13:50 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Nicolas Pitre, Shawn Pearce, git

On 8/18/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Fri, 18 Aug 2006, Jon Smirl wrote:
>
> > I suspect the size reduction is directly proportional to the age of
> > the repository. The kernel repository only has three years worth of
> > data in it.  Linus has the full history in another repository that is
> > not in general distribution. We can get it from him when he gets back
> > from vacation.
>
> Maybe you mean
>
> http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/tglx/history.git

That one only goes to 2002, the full one goes back to around 1990.

>
> Ciao,
> Dscho
>
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-18 13:50       ` Jon Smirl
@ 2006-08-19 19:25         ` Linus Torvalds
  0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2006-08-19 19:25 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Johannes Schindelin, Nicolas Pitre, Shawn Pearce, git

On Fri, 18 Aug 2006, Jon Smirl wrote:

> On 8/18/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > Hi,
> > 
> > On Fri, 18 Aug 2006, Jon Smirl wrote:
> > 
> > > I suspect the size reduction is directly proportional to the age of
> > > the repository. The kernel repository only has three years worth of
> > > data in it.  Linus has the full history in another repository that is
> > > not in general distribution. We can get it from him when he gets back
> > > from vacation.
> > 
> > Maybe you mean
> > 
> > http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/tglx/history.git
> 
> That one only goes to 2002, the full one goes back to around 1990.

I don't actually have such a "full" history. It would be wonderful if 
somebody took the time to try to piece such a thing together (and git 
actually makes that a _lot_ easier than some other SCM's, because you can 
just import random versions in any order, and then re-stich just the 
commit history when you add a new thing in the middle, without generating 
any new trees or deltas or anything strange at all).

But it's a lot of work. I tried to do a "Linux-Historic" archive about a 
year ago (and imported some of the old kernels I had), but I gave up, just 
because it was such a pain to try to do a good job and try to find old 
release notes etc to import into the changelogs etc. 

Oh, well.

So the only "old" history I have is indeed that BK conversion by Thomas 
Gleixner. Any pre-BK stuff only exists as patches and tar-balls on various 
ftp sites (and I don't have any magic repository of my own, so everybody 
else can do exactly as well as I could, with possibly the exception that I 
might remember some random details about some old release history - but 
considering my memory, that's pretty unlikely too. Google is your friend)

			Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-18 13:15   ` Jon Smirl
  2006-08-18 13:36     ` Johannes Schindelin
@ 2006-08-18 16:25     ` Nicolas Pitre
  2006-08-21  7:06       ` Shawn Pearce
  1 sibling, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-18 16:25 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn Pearce, git

On Fri, 18 Aug 2006, Jon Smirl wrote:

> On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> > A better way to get such a size saving is to increase the window and
> > depth parameters.  For example, a window of 20 and depth of 20 can
> > usually provide a pack size saving greater than 11% with none of the
> > disadvantages mentioned above.
> 
> Our window size is effectively infinite. I am handing him all of the
> revisions from a single file in optimal order. This includes branches.

In GIT packing terms this is infinite delta _depth_ not _window_.

> He takes these revisions, runs xdiff on them, and then puts the entire
> result into a single zlib blob.

This is not a good idea to have infinite delta depth.  The time to 
browse the repository history then becomes exponential with the number 
of revisions making the value of such a repository a bit questionnable 
(you could as well only preserve the last 2 years of history instead 
since further than that with infinite delta depth is likely to be too 
slow and painful to use).

But just for comparison I did a repack -a -f on the kernel repository 
with --window=50 --depth=5000 which should be a good approximation of 
the best possible delta matchingwith infinite depth.

Default delta params (window=10 depth=10) : 122103455 
Agressive deltas (window=50 depth=5000) : 105870516
Reduction : 13%

OK let's try it with delta chains in the same zlib stream using the 
patch I posted yesterday (with a minor tweak allowing the usage of -f 
with git-repack).

Agressive and grouped deltas (window=50 depth=5000 : 99860685

This is a mere 5.7% reduction over the non grouped deltas, less than the 
11% reduction I obtained yesterday when the delta depth is kept 
reasonably short.

The increased delta depth is likely to make a large difference on old 
repos with long history, maybe more so and with much less 
complexity than the delta grouping.

> I suspect the size reduction is directly proportional to the age of
> the repository. The kernel repository only has three years worth of
> data in it.  Linus has the full history in another repository that is
> not in general distribution. We can get it from him when he gets back
> from vacation.
> 
> If the repository doesn't contain long delta chains the optimization
> doesn't help that much. On the other hand it doesn't hurt either since
> the chains weren't long.  My repository is four times as old as the
> kernel one and I am getting 4x the benefit.

No that cannot be right.

Let's assume every whole objects are 10 in size and every deltas are 1.  
You therefore can have 1 base object and 10 delta objects, effectively 
storing 11 objects for a size of 20.  You therefore have a 1.8 vs 10 
size ratio.

If the delta depth is 100 then you potentially have 1 base object and 
100 deltas for a size ratio of 1.1 vs 10.

If the delta depth is 1000 the ratiobecomes 1.01 vs 10.

The size saving is therefore _not_ proportional with the age of the 
repository.  It rather tend to be asymptotic with the delta ratio (but 
impose an exponential runtime cost when fetching objects out of it).

The fact that your 4x old repository has 
a 4x size saving 
can be due only to packing malfunction I would say.

Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-18 16:25     ` Nicolas Pitre
@ 2006-08-21  7:06       ` Shawn Pearce
  2006-08-21 14:07         ` Jon Smirl
  2006-08-21 15:46         ` Nicolas Pitre
  0 siblings, 2 replies; 36+ messages in thread
From: Shawn Pearce @ 2006-08-21  7:06 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jon Smirl, git

Nicolas Pitre <nico@cam.org> wrote:
> On Fri, 18 Aug 2006, Jon Smirl wrote:
> 
> > On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> > > A better way to get such a size saving is to increase the window and
> > > depth parameters.  For example, a window of 20 and depth of 20 can
> > > usually provide a pack size saving greater than 11% with none of the
> > > disadvantages mentioned above.
> > 
> > Our window size is effectively infinite. I am handing him all of the
> > revisions from a single file in optimal order. This includes branches.
> 
> In GIT packing terms this is infinite delta _depth_ not _window_.

We're not using infinite anything.

fast-import is basically doing window=1 and depth=10.

We only examine the last blob to see if we can get a delta against
it.  If we do we write that delta out; otherwise we reset our delta
chain and write the complete object.  We also reset our chain after
writing out 10 deltas, each of which used the immediately prior
object as its base.

Since I just found out that in some cases the Mozilla repository has
1000s of revisions per file[*1*] and in others only 1 revision per
file we probably should be adjusting this depth to have a maximum
of 500 while also having the frontend send us a "I'm switching
files now" marker so we know to not even bother trying to delta
the new blob against the last blob as they are likely to not
delta well[*2*].

> Default delta params (window=10 depth=10) : 122103455 
> Agressive deltas (window=50 depth=5000) : 105870516
> Agressive and grouped deltas (window=50 depth=5000 : 99860685

Although complex the aggressive and grouped deltas appears to
have saved you 18.2% on this repository.  That's not something
to ignore.  A reasonably optimal local pack dictionary could save
at least 4%[*3*].  Whacking 22% off a 400 MB pack is saving 88 MB.
Transferring that over the network on an initial clone is like
downloading all of Eclipse.  Or an uncompressed kernel tarball...

[*1*] Jon noted this in another email in this thread but I'm too
      lazy to lookup the hyperlink right now.

[*2*] Granted in some cases they may delta very well against each
      other but I think the probablity of that occuring is low
	  enough that its not worth worrying about in fast-import.c;
	  we can let repack's strategy deal with it instead.

[*3*] I wrote a brain-dead simple local dictionary selecter in Perl.
      Its horribly far from being ideal.  But it is consistently
      saving us 4% on the GIT and the Mozilla repository and its
	  pretty darn fast.  Shockingly the C keywords didn't gain
	  us very much here; its project specific text that's the
	  real win.

	  Looking at chunks which are frequently copied in deltas
	  from base objects and breaking those chunks up into
	  smaller common chunks, then loading those most frequent
	  common chunks into the pack dictionary would most likely
	  produce far better results.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-21  7:06       ` Shawn Pearce
@ 2006-08-21 14:07         ` Jon Smirl
  2006-08-21 15:46         ` Nicolas Pitre
  1 sibling, 0 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-21 14:07 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Nicolas Pitre, git

Mozilla CVS files have between 1 and 1700 deltas. The average is ten,
but the deviation is high. I believe less than 10 files have over 1000
deltas, they are all in the root directory and related to the build
process. Many files have no deltas, or because of CVS artifacts all of
the revisions are identical.

I am still IO bound. Random access IO is the problem, not stream IO. I
have to open and read 110,000 (5GB total) files. It takes about 2hrs
to do all of the IO. I'm not CPU bound yet but as we make things more
efficient, I am getting closer to being CPU bound.

Forking is not an option. It can takes days to fork 1M copies of an
app. I have used oprofile on parsecvs. It spends 60% of the time in
the kernel processing fork calls. Parsecvs runs for 6hrs on mozcvs and
dies without finishing.

I am back to working on the branch code. I'm over the cold I got from
my 2 yr old. It is slow going now. I am in the phase where the import
process runs without error 5-10 minutes and then dies from some
unusual branch case. I fix it up and try again. I am slowly
identifying and removing all the code in cvs2svn that puts the
branches and symbols into their own subdirectories.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-21  7:06       ` Shawn Pearce
  2006-08-21 14:07         ` Jon Smirl
@ 2006-08-21 15:46         ` Nicolas Pitre
  2006-08-21 16:14           ` Jon Smirl
  1 sibling, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-21 15:46 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jon Smirl, git

On Mon, 21 Aug 2006, Shawn Pearce wrote:

> Nicolas Pitre <nico@cam.org> wrote:
> > On Fri, 18 Aug 2006, Jon Smirl wrote:
> > 
> > > On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> > > > A better way to get such a size saving is to increase the window and
> > > > depth parameters.  For example, a window of 20 and depth of 20 can
> > > > usually provide a pack size saving greater than 11% with none of the
> > > > disadvantages mentioned above.
> > > 
> > > Our window size is effectively infinite. I am handing him all of the
> > > revisions from a single file in optimal order. This includes branches.
> > 
> > In GIT packing terms this is infinite delta _depth_ not _window_.
> 
> We're not using infinite anything.
> 
> fast-import is basically doing window=1 and depth=10.

Yes, that is what I figured out looking at the code.

> We only examine the last blob to see if we can get a delta against
> it.  If we do we write that delta out; otherwise we reset our delta
> chain and write the complete object.  We also reset our chain after
> writing out 10 deltas, each of which used the immediately prior
> object as its base.

I think this is a perfectly sensible thing to do for the initial import 
pass.  This will get the data in GIT format quickly and save disk space 
over no deltas at all.  No need to seek for optimal packing at that 
point but only lay things out for a faster git-repack pass to complete 
the import.

Also, like I said before, you cannot get a nice object layout from your 
import process since all revisions of each file end up next to each 
other while in practice what you want is all objects for each _commit_ 
to be next to each other.  Otherwise checking out a commit will seek all 
over the pack like mad.  A final repack using git-repack will create 
that nice object layout for you and that is something you cannot do 
on the fly.

The git-pack-objects code is also where all the accumulated knowledge 
for best delta packing heuristics has been created over the last year.  
It would be counter productive to cut yourself from its value.  If 
anything more clever can be done with packing it probably should be done 
there.

Finally, I don't see how you could create nice deltas for tree objects 
on the fly like you do with blobs without making your code significantly 
more complex and use more memory due to the recursive nature of 
directory information.  This is no a problem for git-pack-objects since 
tree objects are no different from blob objects when deltifying them.

Again this is why I think you should focus on the quickest generation of 
a reasonable intermediate pack and let git-repack make it optimal.

> Since I just found out that in some cases the Mozilla repository has
> 1000s of revisions per file[*1*] and in others only 1 revision per
> file we probably should be adjusting this depth to have a maximum
> of 500 while also having the frontend send us a "I'm switching
> files now" marker so we know to not even bother trying to delta
> the new blob against the last blob as they are likely to not
> delta well[*2*].

No please don't do that.  A deep delta chain won't produce a dramatic 
size difference but it _will_ make it really costly to read objects back 
from the pack later on.  For example it might make the next git-repack 
many times longer but the size saving will not be more than 20%.  If you 
agree with me that a final git-repack is the best thing to do then the 
intermediate pack doesn't have to be that small.

> Although complex the aggressive and grouped deltas appears to
> have saved you 18.2% on this repository.  That's not something
> to ignore.

But the aggressive delta alone saved 13% already.  The remaining saving 
is questionable since, like I said, it prevents the delta data reuse 
optimization which is so nice in order to reduce the load on servers to 
nearly zero.

> A reasonably optimal local pack dictionary could save
> at least 4%[*3*].  Whacking 22% off a 400 MB pack is saving 88 MB.

This is still speculations though.  The separate dictionary is certainly 
a good idea to try out and it is way more flexible than the grouped 
deltas.  But I doubt the combined concepts will all add up size 
savings linearly.

> Transferring that over the network on an initial clone is like
> downloading all of Eclipse.  Or an uncompressed kernel tarball...

... something that most people can afford these days.  Plus, not using 
grouped objects would make future shallow clones much less costly on the 
server as well, while this is something that most people will be most 
interested in with such large repos and which has much higher bandwidth 
saver value.

Of course this is a trade-off.  Tighter packs are likely to be more 
costly to use.  So far we've contained that cost to the _initial_ 
generation of the pack (using -a -f with git-repack) while subsequent 
repacking for network transfer were basically free on CPU.  The grouped 
object concept saves around 7 to 11% on pack size in my tests, but it 
impose a non negligible cost on any further repacking as well as an 
additional cost on the fetching of objects out of the pack.

> [*3*] I wrote a brain-dead simple local dictionary selecter in Perl.
>       Its horribly far from being ideal.  But it is consistently
>       saving us 4% on the GIT and the Mozilla repository and its
> 	  pretty darn fast.  Shockingly the C keywords didn't gain
> 	  us very much here; its project specific text that's the
> 	  real win.
> 
> 	  Looking at chunks which are frequently copied in deltas
> 	  from base objects and breaking those chunks up into
> 	  smaller common chunks, then loading those most frequent
> 	  common chunks into the pack dictionary would most likely
> 	  produce far better results.

I think such a global dictionary per pack is something that could be 
really nice.  Again it limits the cost to the initial repack process 
while having no cost impact neither on the delta reuse optimization nor 
on the object checkout which is not the case for object grouping.  
Certainly worth experimenting ... but independently from any import 
process I think.

Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-21 15:46         ` Nicolas Pitre
@ 2006-08-21 16:14           ` Jon Smirl
  2006-08-21 17:48             ` Nicolas Pitre
  0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-21 16:14 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Shawn Pearce, git

How about making the length of delta chains an exponential function of
the number of revs? In Mozilla configure.in has 1,700 revs and it is a
250K file. If you store a full copy every 10 revs that is 43MB
(prezip) of data that almost no one is going to look at. The chains
lengths should reflect the relative probability that someone is going
to ask to see the revs. That is not at all a uniform function.

Personally I am still in favor of a two pack system. One archival pack
stores everything in a single chain and size, not speed, is it's most
important attribute. It is marked readonly and only functions as an
archive; git-repack never touches it. It might even use a more compact
compression algorithm.

The second pack is for storing more recent revisions. The archival
pack would be constructed such that none of the files needed for the
head revisions of any branch are in it. They would all be in the
second pack.

After time the second pack may grow large and another archival pack
can be created. The first one would still be maintained in it's
readonly form. git could be optimized to always search for objects in
non-archival packs before even opening the index of an archival one.

This may be a path to partial repositories. Instead of downloading the
real archival pack I could download just an index for it. The index
entries would be marked to indicate that these objects are valid but
not-present.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-21 16:14           ` Jon Smirl
@ 2006-08-21 17:48             ` Nicolas Pitre
  2006-08-21 17:55               ` Nicolas Pitre
  0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-21 17:48 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn Pearce, git

On Mon, 21 Aug 2006, Jon Smirl wrote:

> How about making the length of delta chains an exponential function of
> the number of revs? In Mozilla configure.in has 1,700 revs and it is a
> 250K file. If you store a full copy every 10 revs that is 43MB
> (prezip) of data that almost no one is going to look at.
> The chains
> lengths should reflect the relative probability that someone is going
> to ask to see the revs. That is not at all a uniform function.

1) You can do that already with stock git-repack.

2) The current git-pack-objects code can and does produce a delta "tree" 
   off of a single base object.  It doesn't have to be a linear list.

Therefore, even if the default depth is 10, you may as well have many 
deltas pointing to the _same_ base object effectively making the 
compression ratio much larger than 1/10.

If for example each object has 2 delta childs, and each of those deltas 
also have 2 delta childs, you could have up to 39366 delta objects 
attached to a _single_ undeltified base object.

And of course the git-pack-objects code doesn't limit the number of 
delta childs in any way so this could theoritically be infinite even 
though the max depth is 10.  OK the delta matching window limits that 
somehow but nothing prevents you from repacking with a larger window 
since that parameter has no penalty on the reading of objects out of the 
pack.

> Personally I am still in favor of a two pack system. One archival pack
> stores everything in a single chain and size, not speed, is it's most
> important attribute. It is marked readonly and only functions as an
> archive; git-repack never touches it. It might even use a more compact
> compression algorithm.
> 
> The second pack is for storing more recent revisions. The archival
> pack would be constructed such that none of the files needed for the
> head revisions of any branch are in it. They would all be in the
> second pack.

Personally I'm still against that.  All arguments put forward for a 
different or multiple packing system are based on unproven assumptions 
so far and none of those arguments actually present significant 
advantages over the current system.  For example, I was really intriged 
by the potential of object grouping into a single zlib stream at first, 
but it turns out not to be so great after actual testing.

I still think that a global zlib dictionary is a good idea.  Not because 
it looks like it'll make packs enormously smaller, but rather because it 
impose no performance regression over the current system.  And I 
strongly believe that you have to have a really strong case for 
promoting a solution that carries performance regression, something like 
over 25% smaller packs for example.  But so far it didn't happen.

> This may be a path to partial repositories. Instead of downloading the
> real archival pack I could download just an index for it. The index
> entries would be marked to indicate that these objects are valid but
> not-present.

An index without the actual objects is simply useless.  You could do 
without the index entirely in that case anyway since the mere presence 
of an entry in the index doesn't give you anything really useful.

Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-21 17:48             ` Nicolas Pitre
@ 2006-08-21 17:55               ` Nicolas Pitre
  2006-08-21 18:01                 ` Nicolas Pitre
  0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-21 17:55 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn Pearce, git

On Mon, 21 Aug 2006, Nicolas Pitre wrote:

> If for example each object has 2 delta childs, and each of those deltas 
> also have 2 delta childs, you could have up to 39366 delta objects 
> attached to a _single_ undeltified base object.

Sorry I've got the math wrong.  That is 3582 deltas, given a 
conservative number of delta childs = 2.


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Huge win, compressing a window of delta runs as a unit
  2006-08-21 17:55               ` Nicolas Pitre
@ 2006-08-21 18:01                 ` Nicolas Pitre
  0 siblings, 0 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-21 18:01 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn Pearce, git

On Mon, 21 Aug 2006, Nicolas Pitre wrote:

> On Mon, 21 Aug 2006, Nicolas Pitre wrote:
> 
> > If for example each object has 2 delta childs, and each of those deltas 
> > also have 2 delta childs, you could have up to 39366 delta objects 
> > attached to a _single_ undeltified base object.
> 
> Sorry I've got the math wrong.  That is 3582 deltas, given a 
> conservative number of delta childs = 2.

OK I just can't count.  That is 2046.

But since I made a fool of myself already, why not try with 3 delta 
childs for a delta depth of 10 just for fun, now that I should have it 
right.  The answer is 88572.

Anyway I hope you've got my point now.  ;-)


Nicolas

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2006-08-21 18:02 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl
2006-08-17  4:07 ` Shawn Pearce
2006-08-17  7:56   ` Johannes Schindelin
2006-08-17  8:07     ` Johannes Schindelin
2006-08-17 14:36       ` Jon Smirl
2006-08-17 15:45         ` Johannes Schindelin
2006-08-17 16:33           ` Nicolas Pitre
2006-08-17 17:05             ` Johannes Schindelin
2006-08-17 17:22             ` Jon Smirl
2006-08-17 18:15               ` Nicolas Pitre
2006-08-17 17:17           ` Jon Smirl
2006-08-17 17:32             ` Nicolas Pitre
2006-08-17 18:06               ` Jon Smirl
2006-08-17 17:22   ` Nicolas Pitre
2006-08-17 18:03     ` Jon Smirl
2006-08-17 18:24       ` Nicolas Pitre
2006-08-18  4:03 ` Nicolas Pitre
2006-08-18 12:53   ` Jon Smirl
2006-08-18 16:30     ` Nicolas Pitre
2006-08-18 16:56       ` Jon Smirl
2006-08-21  3:45         ` Nicolas Pitre
2006-08-21  6:46           ` Shawn Pearce
2006-08-21 10:24             ` Jakub Narebski
2006-08-21 16:23             ` Jon Smirl
2006-08-18 13:15   ` Jon Smirl
2006-08-18 13:36     ` Johannes Schindelin
2006-08-18 13:50       ` Jon Smirl
2006-08-19 19:25         ` Linus Torvalds
2006-08-18 16:25     ` Nicolas Pitre
2006-08-21  7:06       ` Shawn Pearce
2006-08-21 14:07         ` Jon Smirl
2006-08-21 15:46         ` Nicolas Pitre
2006-08-21 16:14           ` Jon Smirl
2006-08-21 17:48             ` Nicolas Pitre
2006-08-21 17:55               ` Nicolas Pitre
2006-08-21 18:01                 ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).