* Huge win, compressing a window of delta runs as a unit
@ 2006-08-16 17:20 Jon Smirl
2006-08-17 4:07 ` Shawn Pearce
2006-08-18 4:03 ` Nicolas Pitre
0 siblings, 2 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-16 17:20 UTC (permalink / raw)
To: Shawn Pearce, git
Shawn put together a new version of his import utility that packs all
of the deltas from a run into a single blob instead of one blob per
delta. The idea is to put 10 or more deltas into each delta entry
instead of one. The index format would map the 10 sha1's to a single
packed delta entry which would be expanded when needed. Note that you
probably needed multiple entries out of the delta pack to generate the
revision you were looking for so this is no real loss on extraction.
I ran it overnight on mozcvs. If his delta pack code is correct this
is a huge win.
One entry per delta - 845,42,0150
Packed deltas - 295,018,474
65% smaller
The effect of packing the deltas is to totally eliminate many of the
redundant zlib dictionaries.
This is without using a zlib dictionary which gains another 4% form a
4KB dictionary.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl
@ 2006-08-17 4:07 ` Shawn Pearce
2006-08-17 7:56 ` Johannes Schindelin
2006-08-17 17:22 ` Nicolas Pitre
2006-08-18 4:03 ` Nicolas Pitre
1 sibling, 2 replies; 36+ messages in thread
From: Shawn Pearce @ 2006-08-17 4:07 UTC (permalink / raw)
To: Jon Smirl; +Cc: git
Jon Smirl <jonsmirl@gmail.com> wrote:
> Shawn put together a new version of his import utility that packs all
> of the deltas from a run into a single blob instead of one blob per
> delta. The idea is to put 10 or more deltas into each delta entry
> instead of one. The index format would map the 10 sha1's to a single
> packed delta entry which would be expanded when needed. Note that you
> probably needed multiple entries out of the delta pack to generate the
> revision you were looking for so this is no real loss on extraction.
>
> I ran it overnight on mozcvs. If his delta pack code is correct this
> is a huge win.
>
> One entry per delta - 845,42,0150
> Packed deltas - 295,018,474
> 65% smaller
>
> The effect of packing the deltas is to totally eliminate many of the
> redundant zlib dictionaries.
I'm going to try to integrate this into core GIT this weekend.
My current idea is to make use of the OBJ_EXT type flag to add
an extended header field behind the length which describes the
"chunk" as being a delta chain compressed in one zlib stream.
I'm not overly concerned about saving lots of space in the header
here as it looks like we're winning a huge amount of pack space,
so the extended header will probably itself be a couple of bytes.
This keeps the shorter reserved types free for other great ideas. :)
My primary goal of integrating it into core GIT is to take
advantage of verify-pack to check the file fast-import is producing.
Plus having support for it in sha1_file.c will make it easier to
performance test the common access routines that need to be fast,
like commit and tree walking.
My secondary goal is to get a patchset which other folks can try
on their own workloads to see if its as effective as what Jon is
seeing on the Mozilla archive.
Unfortunately I can't think of a way to make this type of pack
readable by older software. So this could be creating a pretty
big change in the pack format, relatively speaking. :)
--
Shawn.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 4:07 ` Shawn Pearce
@ 2006-08-17 7:56 ` Johannes Schindelin
2006-08-17 8:07 ` Johannes Schindelin
2006-08-17 17:22 ` Nicolas Pitre
1 sibling, 1 reply; 36+ messages in thread
From: Johannes Schindelin @ 2006-08-17 7:56 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Jon Smirl, git
Hi,
On Thu, 17 Aug 2006, Shawn Pearce wrote:
> Jon Smirl <jonsmirl@gmail.com> wrote:
> >
> > One entry per delta - 845,42,0150
> > Packed deltas - 295,018,474
> > 65% smaller
Impressed.
> Unfortunately I can't think of a way to make this type of pack
> readable by older software. So this could be creating a pretty
> big change in the pack format, relatively speaking. :)
Not necessarily: the format could stay the same, where multiple object
names point to the same entry.
It should not be a problem, as long as this change is kept local: the same
git tools which created this kind of pack have to read it.
However, I see some real problems when a public repository has a pack like
this: if the client is older, the server has to do some real work,
especially in the case of a fresh clone. It gets even worse with a public
HTTP repo.
The obvious short-time solution would be to make this opt-in, and use it
only on private repositories.
Then, one could think about how to handle the delta-chains in the pack
protocol (it should be easy, since you just have to store one object
instead of a few), and make this a "supports" option.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 7:56 ` Johannes Schindelin
@ 2006-08-17 8:07 ` Johannes Schindelin
2006-08-17 14:36 ` Jon Smirl
0 siblings, 1 reply; 36+ messages in thread
From: Johannes Schindelin @ 2006-08-17 8:07 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Jon Smirl, git
Hi,
On Thu, 17 Aug 2006, Johannes Schindelin wrote:
> However, I see some real problems when a public repository has a pack like
> this: if the client is older, the server has to do some real work,
> especially in the case of a fresh clone. It gets even worse with a public
> HTTP repo.
>
> The obvious short-time solution would be to make this opt-in, and use it
> only on private repositories.
>
> Then, one could think about how to handle the delta-chains in the pack
> protocol (it should be easy, since you just have to store one object
> instead of a few), and make this a "supports" option.
Oh, and I completely forgot: IIUC you have to unpack the (possibly big)
delta-chain until you have the delta you really want, right? So, you might
want to have a clever mechanism to keep the delta-chains short for recent
objects (i.e. objects you are likely to need more often).
At least, the delta-chains should be limited by size (_not_ by number of
deltas: you can have huge deltas, and if you have to unpack 5 huge deltas
before getting to the huge delta you really need, it takes really long).
Ciao,
Dscho
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 8:07 ` Johannes Schindelin
@ 2006-08-17 14:36 ` Jon Smirl
2006-08-17 15:45 ` Johannes Schindelin
0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-17 14:36 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Shawn Pearce, git
On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> At least, the delta-chains should be limited by size (_not_ by number of
> deltas: you can have huge deltas, and if you have to unpack 5 huge deltas
> before getting to the huge delta you really need, it takes really long).
This is not an obvious conclusion. Reducing our pack from 845MB to
292MB eliminates a huge amount of IO. All of the work I am doing with
this pack has been totally IO bound. The CPU time to get from one
delta to another is minor compared to the IO time increase if you
break the chains up. Each time you break the chain you have to store a
full copy of the file, the IO increase from doing this out weighs the
CPU time reduction.
I would use a two pack model. One pack holds the 292MB archive and the
second pack is made from recent changes. They can both be the same
format since the chains in the second pack won't be very long. We
might want to put a flag into an archive pack that tells git-repack
not to touch it by default.
As for public servers there is an immediate win in shifting to the new
pack format. Three hour downloads vs one hour, plus the bandwidth
bills. Are the tools smart enough to say this is a newer pack format,
upgrade? It takes far less than two hours to upgrade your git install.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 14:36 ` Jon Smirl
@ 2006-08-17 15:45 ` Johannes Schindelin
2006-08-17 16:33 ` Nicolas Pitre
2006-08-17 17:17 ` Jon Smirl
0 siblings, 2 replies; 36+ messages in thread
From: Johannes Schindelin @ 2006-08-17 15:45 UTC (permalink / raw)
To: Jon Smirl; +Cc: Shawn Pearce, git
Hi,
On Thu, 17 Aug 2006, Jon Smirl wrote:
> On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > At least, the delta-chains should be limited by size (_not_ by number of
> > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas
> > before getting to the huge delta you really need, it takes really long).
>
> This is not an obvious conclusion.
The big win is bought by having _one_ zlib stream for multiple deltas,
right?
So, everytime you want to access the _last_ delta in the chain, you unpack
_all_ of them. This quite obvious conclusion is obviously your reason to
propose two packs, one for the archive of "old" objects, and one for the
"new" objects.
> As for public servers there is an immediate win in shifting to the new
> pack format. Three hour downloads vs one hour, plus the bandwidth
> bills. Are the tools smart enough to say this is a newer pack format,
> upgrade? It takes far less than two hours to upgrade your git install.
Have you thought about a non-upgraded client? The server has to repack in
that case, and if the client wants a clone, you might not even have the
time to kiss your server good-bye before it goes belly up.
There is an obvious choice here as to how fast people would upgrade their
servers.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 15:45 ` Johannes Schindelin
@ 2006-08-17 16:33 ` Nicolas Pitre
2006-08-17 17:05 ` Johannes Schindelin
2006-08-17 17:22 ` Jon Smirl
2006-08-17 17:17 ` Jon Smirl
1 sibling, 2 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-17 16:33 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Jon Smirl, Shawn Pearce, git
On Thu, 17 Aug 2006, Johannes Schindelin wrote:
> Hi,
>
> On Thu, 17 Aug 2006, Jon Smirl wrote:
>
> > On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > > At least, the delta-chains should be limited by size (_not_ by number of
> > > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas
> > > before getting to the huge delta you really need, it takes really long).
> >
> > This is not an obvious conclusion.
>
> The big win is bought by having _one_ zlib stream for multiple deltas,
> right?
>
> So, everytime you want to access the _last_ delta in the chain, you unpack
> _all_ of them.
This is the case whether deltas are separately deflated or not.
> This quite obvious conclusion is obviously your reason to
> propose two packs, one for the archive of "old" objects, and one for the
> "new" objects.
Old objects are usually further down the delta chain and also stored
further from the beginning of the pack. Hence "new" objects could still
have quick access since even if a delta chain is all in the same zlib
stream, it is likely that inflating the whole of the zlib stream to get
"new" objects won't be necessary, just like it is done now where only
the needed deltas are inflated.
> > As for public servers there is an immediate win in shifting to the new
> > pack format. Three hour downloads vs one hour, plus the bandwidth
> > bills. Are the tools smart enough to say this is a newer pack format,
> > upgrade? It takes far less than two hours to upgrade your git install.
>
> Have you thought about a non-upgraded client? The server has to repack in
> that case, and if the client wants a clone, you might not even have the
> time to kiss your server good-bye before it goes belly up.
Nah. The only overhead for a server to feed an "old" pack format from a
"new" pack file is to inflate and deflate some data. That is not _that_
costly.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 16:33 ` Nicolas Pitre
@ 2006-08-17 17:05 ` Johannes Schindelin
2006-08-17 17:22 ` Jon Smirl
1 sibling, 0 replies; 36+ messages in thread
From: Johannes Schindelin @ 2006-08-17 17:05 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Jon Smirl, Shawn Pearce, git
Hi,
On Thu, 17 Aug 2006, Nicolas Pitre wrote:
> On Thu, 17 Aug 2006, Johannes Schindelin wrote:
>
> > On Thu, 17 Aug 2006, Jon Smirl wrote:
> >
> > > On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > > > At least, the delta-chains should be limited by size (_not_ by number of
> > > > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas
> > > > before getting to the huge delta you really need, it takes really long).
> > >
> > > This is not an obvious conclusion.
> >
> > The big win is bought by having _one_ zlib stream for multiple deltas,
> > right?
> >
> > So, everytime you want to access the _last_ delta in the chain, you unpack
> > _all_ of them.
>
> This is the case whether deltas are separately deflated or not.
Oh, now I get it! The delta chain is comprised of exactly those deltas
which are needed to reconstruct a certain object from another object which
was stored undeltified in the pack.
Now it makes sense to me that they could be bundled before being deflated.
In the pack-index, the objects of that delta-chain could just point to
the delta-chain object. Which has a mini-index in its extended header.
Or did I misunderstand again?
Ciao,
Dscho
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 15:45 ` Johannes Schindelin
2006-08-17 16:33 ` Nicolas Pitre
@ 2006-08-17 17:17 ` Jon Smirl
2006-08-17 17:32 ` Nicolas Pitre
1 sibling, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-17 17:17 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Shawn Pearce, git
On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Thu, 17 Aug 2006, Jon Smirl wrote:
>
> > On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > > At least, the delta-chains should be limited by size (_not_ by number of
> > > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas
> > > before getting to the huge delta you really need, it takes really long).
> >
> > This is not an obvious conclusion.
>
> The big win is bought by having _one_ zlib stream for multiple deltas,
> right?
>
> So, everytime you want to access the _last_ delta in the chain, you unpack
> _all_ of them. This quite obvious conclusion is obviously your reason to
> propose two packs, one for the archive of "old" objects, and one for the
> "new" objects.
Do some measurements, the IO vs CPU time trade off is way in favor of
eliminating the IO. It really doesn't take very much CPU to unpack the
delta chain.
The two pack proprosal was not about reducing the delta chain length;
they are reverse deltas, the newest version is always at the front.
Two packs are used to avoid repacking the 280MB pack when you do a
repack command. It takes 2-3 hours to repack 280MB. Even if if you
just copy the old pack to the new it take 30 minutes to do all of the
IO.
> > As for public servers there is an immediate win in shifting to the new
> > pack format. Three hour downloads vs one hour, plus the bandwidth
> > bills. Are the tools smart enough to say this is a newer pack format,
> > upgrade? It takes far less than two hours to upgrade your git install.
>
> Have you thought about a non-upgraded client? The server has to repack in
> that case, and if the client wants a clone, you might not even have the
> time to kiss your server good-bye before it goes belly up.
Is there a pack format version number built into the server procotol?
If not there needs to be one. When there is a mismatch with the server
pack version the client should simply display an error requesting an
upgrade of the client software.
Git should be designed for forward evolution, not infinite backward
compatibility. It is easy to upgrade your client to support the new
protocol. The protocol just needs to ensure that the client reliably
gets an error about the need to upgrade.
Forward evolution implies that a client is able to work with older
servers, but not the inverse, that new servers have to work with old
clients.
> There is an obvious choice here as to how fast people would upgrade their
> servers.
>
> Ciao,
> Dscho
>
>
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 4:07 ` Shawn Pearce
2006-08-17 7:56 ` Johannes Schindelin
@ 2006-08-17 17:22 ` Nicolas Pitre
2006-08-17 18:03 ` Jon Smirl
1 sibling, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-17 17:22 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Jon Smirl, git
On Thu, 17 Aug 2006, Shawn Pearce wrote:
> I'm going to try to integrate this into core GIT this weekend.
> My current idea is to make use of the OBJ_EXT type flag to add
> an extended header field behind the length which describes the
> "chunk" as being a delta chain compressed in one zlib stream.
> I'm not overly concerned about saving lots of space in the header
> here as it looks like we're winning a huge amount of pack space,
> so the extended header will probably itself be a couple of bytes.
> This keeps the shorter reserved types free for other great ideas. :)
We're streaving for optimal data storage here so don't be afraid to use
one of the available types for an "object stream" object. Because when
you think of it, the deflating of multiple objects into a single zlib
stream can be applied to all object types not only deltas. If ever
deflating many blobs into one zlib stream is dimmed worth it then the
encoding will already be ready for it. Also you can leverage existing
code to write headers, etc.
I'd suggest you use OBJ_GROUP = 0 as a new primary object type. Then
the "size" field in the header could then become the number of objects
that are included in the group. Most of the time that will fit in the
low 4 bits of the first header byte, but if there is more than 15
grouped objects then more bits can be used on the following byte.
Anyway so far all the code to generate and parse that is already there.
If ever there is a need for more extensions that could be prefixed with
a pure zero byte (an object group with a zero object count which is
distinguishable from a real group).
Then, having the number of grouped objects, you just have to list the
usual headers for those objects, which are their type and inflated size
just like regular object headers, including the base sha1 for deltas.
Again you already have code to produce and parse those.
And finally just append the objects payload in a single deflated stream.
This way the reading of an object from a group can be optimized if the
object data is located at the beginning of the stream such that you only
need to inflate the amount of bytes leading to the desired data
(possibly caching those for further delta replaying), inflate
the needed data for the desired object and then ignoring the remaining
of the stream.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 16:33 ` Nicolas Pitre
2006-08-17 17:05 ` Johannes Schindelin
@ 2006-08-17 17:22 ` Jon Smirl
2006-08-17 18:15 ` Nicolas Pitre
1 sibling, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-17 17:22 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Johannes Schindelin, Shawn Pearce, git
On 8/17/06, Nicolas Pitre <nico@cam.org> wrote:
> Nah. The only overhead for a server to feed an "old" pack format from a
> "new" pack file is to inflate and deflate some data. That is not _that_
> costly.
It is costly because the Mozilla pack is going to inflate from 295MB
to 845MB. It will take you much longer to download an extra 500MB than
upgrading your client would take.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 17:17 ` Jon Smirl
@ 2006-08-17 17:32 ` Nicolas Pitre
2006-08-17 18:06 ` Jon Smirl
0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-17 17:32 UTC (permalink / raw)
To: Jon Smirl; +Cc: Johannes Schindelin, Shawn Pearce, git
On Thu, 17 Aug 2006, Jon Smirl wrote:
> Is there a pack format version number built into the server procotol?
> If not there needs to be one. When there is a mismatch with the server
> pack version the client should simply display an error requesting an
> upgrade of the client software.
There is a pack version number. Reception of an unknown pack already
produces an error:
if (!pack_version_ok(hdr->hdr_version))
die("unknown pack file version %d", ntohl(hdr->hdr_version));
That will occur really early in a pull or clone when using the native
protocol.
> Git should be designed for forward evolution, not infinite backward
> compatibility. It is easy to upgrade your client to support the new
> protocol. The protocol just needs to ensure that the client reliably
> gets an error about the need to upgrade.
>
> Forward evolution implies that a client is able to work with older
> servers, but not the inverse, that new servers have to work with old
> clients.
And still, if we really wanted, the server could have the ability to
stream an old pack format by simply inflating the grouped objects and
deflating them individually all on the fly, which is not necessarily
that costly. But it might still be a good idea to simply have the
ability to generate both formats and make the grouped deltas the default
only after a while.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 17:22 ` Nicolas Pitre
@ 2006-08-17 18:03 ` Jon Smirl
2006-08-17 18:24 ` Nicolas Pitre
0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-17 18:03 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn Pearce, git
On 8/17/06, Nicolas Pitre <nico@cam.org> wrote:
> We're streaving for optimal data storage here so don't be afraid to use
> one of the available types for an "object stream" object. Because when
> you think of it, the deflating of multiple objects into a single zlib
> stream can be applied to all object types not only deltas. If ever
> deflating many blobs into one zlib stream is dimmed worth it then the
> encoding will already be ready for it. Also you can leverage existing
> code to write headers, etc.
Here are two more case that need to be accounted for in the packs.
1) If you zip something and it gets bigger. We need an entry that just
stores the object without it being zipped. Zipping jpegs or mpegs will
likely make them significantly bigger. Or does zlib like already
detect this case and do a null compression?
2) If you delta something and the delta is bigger than the object
being deltaed. The delta code should detect this and store the full
object instead of the delta. Again jpegs and mpegs will trigger this.
You may even want to say that the delta has to be smaller than 80% of
the full object.
Shawn is planning on looking into true dictionary based compression.
That will generate even more types inside of a pack. If dictionary
compression works out full text search can be added with a little more
overhead. True dictionary based compression has the potential to go
even smaller than the current 280MB. The optimal way to do this is for
each pack to contain it's own dictionary.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 17:32 ` Nicolas Pitre
@ 2006-08-17 18:06 ` Jon Smirl
0 siblings, 0 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-17 18:06 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Johannes Schindelin, Shawn Pearce, git
On 8/17/06, Nicolas Pitre <nico@cam.org> wrote:
> On Thu, 17 Aug 2006, Jon Smirl wrote:
>
> > Is there a pack format version number built into the server procotol?
> > If not there needs to be one. When there is a mismatch with the server
> > pack version the client should simply display an error requesting an
> > upgrade of the client software.
>
> There is a pack version number. Reception of an unknown pack already
> produces an error:
>
> if (!pack_version_ok(hdr->hdr_version))
> die("unknown pack file version %d", ntohl(hdr->hdr_version));
>
> That will occur really early in a pull or clone when using the native
> protocol.
Can this be fixed so that if the pack protocol coming from the server
is greater than the one handled by the client. they client gets a nice
hint about upgrading?
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 17:22 ` Jon Smirl
@ 2006-08-17 18:15 ` Nicolas Pitre
0 siblings, 0 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-17 18:15 UTC (permalink / raw)
To: Jon Smirl; +Cc: Johannes Schindelin, Shawn Pearce, git
On Thu, 17 Aug 2006, Jon Smirl wrote:
> On 8/17/06, Nicolas Pitre <nico@cam.org> wrote:
> > Nah. The only overhead for a server to feed an "old" pack format from a
> > "new" pack file is to inflate and deflate some data. That is not _that_
> > costly.
>
> It is costly because the Mozilla pack is going to inflate from 295MB
> to 845MB. It will take you much longer to download an extra 500MB than
> upgrading your client would take.
I didn't say it isn't costly. I said that it isn't _that_ costly to the
point of bringing down a server.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-17 18:03 ` Jon Smirl
@ 2006-08-17 18:24 ` Nicolas Pitre
0 siblings, 0 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-17 18:24 UTC (permalink / raw)
To: Jon Smirl; +Cc: Shawn Pearce, git
On Thu, 17 Aug 2006, Jon Smirl wrote:
> On 8/17/06, Nicolas Pitre <nico@cam.org> wrote:
> > We're streaving for optimal data storage here so don't be afraid to use
> > one of the available types for an "object stream" object. Because when
> > you think of it, the deflating of multiple objects into a single zlib
> > stream can be applied to all object types not only deltas. If ever
> > deflating many blobs into one zlib stream is dimmed worth it then the
> > encoding will already be ready for it. Also you can leverage existing
> > code to write headers, etc.
>
> Here are two more case that need to be accounted for in the packs.
>
> 1) If you zip something and it gets bigger. We need an entry that just
> stores the object without it being zipped. Zipping jpegs or mpegs will
> likely make them significantly bigger. Or does zlib like already
> detect this case and do a null compression?
zlib appears to do the right thing here.
> 2) If you delta something and the delta is bigger than the object
> being deltaed. The delta code should detect this and store the full
> object instead of the delta. Again jpegs and mpegs will trigger this.
> You may even want to say that the delta has to be smaller than 80% of
> the full object.
I'd invite you to read te try_delta() code in
[builtin-]builtin-pack-objects.c. You might find that we're already way
more agressive than that with rather simple heuristics that have been
tweaked and tuned for over a year now. The commit logs for that file
(before the rename to builtin-pack-objects.c) is also quite
enlightening.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl
2006-08-17 4:07 ` Shawn Pearce
@ 2006-08-18 4:03 ` Nicolas Pitre
2006-08-18 12:53 ` Jon Smirl
2006-08-18 13:15 ` Jon Smirl
1 sibling, 2 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-18 4:03 UTC (permalink / raw)
To: Jon Smirl; +Cc: Shawn Pearce, git
On Wed, 16 Aug 2006, Jon Smirl wrote:
> Shawn put together a new version of his import utility that packs all
> of the deltas from a run into a single blob instead of one blob per
> delta. The idea is to put 10 or more deltas into each delta entry
> instead of one. The index format would map the 10 sha1's to a single
> packed delta entry which would be expanded when needed. Note that you
> probably needed multiple entries out of the delta pack to generate the
> revision you were looking for so this is no real loss on extraction.
>
> I ran it overnight on mozcvs. If his delta pack code is correct this
> is a huge win.
>
> One entry per delta - 845,42,0150
> Packed deltas - 295,018,474
> 65% smaller
Well, I have to state that I don't believe those results to be right.
Not that I question the results you obtained, but rather that Shawn's
pack generation is most probably broken.
Since I was highly suspicious looking at the above size difference (it
is just too good to be true), I did a quick hack to pack-objects.c to
produce packs with all deltas depending on the same root object into the
same zlib stream. In effect I implemented exactly what Shawn pretends
to have done but directly in pack-objects.c so git-repack could be used
with existing repositories like the linux kernel in order to validate
the concept.
Results:
One entry per delta - 122,103,455
Packed (or grouped) deltas : 108,479,014
Only 11% smaller.
I was really enthousiastic at first when I saw the result you obtained,
but I no longer believe it can be right.
Furthermore, such a pack organization has many disadvantages: it
prevents trivial delta data reuse for incremental push/pull operations
which is a huge performance penalty (bad for servers), as well as making
random object extraction from such a pack somewhat more complex. Those
disadvantages greatly outweight the 11% size saving IMHO.
A better way to get such a size saving is to increase the window and
depth parameters. For example, a window of 20 and depth of 20 can
usually provide a pack size saving greater than 11% with none of the
disadvantages mentioned above.
I therefore don't think this idea should be pursued any further.
For completeness you can find my changes to pack-objects below. Beware
that this produces packs that cannot be read back so backup your
repository first before playing with this.
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 448461b..bfae892 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -227,7 +227,7 @@ static int encode_header(enum object_typ
int n = 1;
unsigned char c;
- if (type < OBJ_COMMIT || type > OBJ_DELTA)
+ if (type < 0 || type > OBJ_DELTA)
die("bad type %d", type);
c = (type << 4) | (size & 15);
@@ -332,6 +332,8 @@ static unsigned long write_object(struct
return hdrlen + datalen;
}
+#if 0
+
static unsigned long write_one(struct sha1file *f,
struct object_entry *e,
unsigned long offset)
@@ -349,6 +351,107 @@ static unsigned long write_one(struct sh
return offset;
}
+#else
+
+static unsigned count_delta_childs(struct object_entry *e, unsigned long o)
+{
+ unsigned count = 0;
+ struct object_entry *child = e->delta_child;
+ while (child) {
+ count += 1 + count_delta_childs(child, o);
+ if (child->offset)
+ die("object already seen???");
+ child->offset = o;
+ child = child->delta_sibling;
+ }
+ return count;
+}
+
+static unsigned long write_delta_headers(struct sha1file *f, struct object_entry *e, unsigned long *size)
+{
+ unsigned long offset = 0;
+ struct object_entry *child = e->delta_child;
+ while (child) {
+ unsigned char header[10];
+ unsigned hdrlen;
+ hdrlen = encode_header(OBJ_DELTA, child->delta_size, header);
+ *size += child->delta_size;
+ sha1write(f, header, hdrlen);
+ sha1write(f, child->delta, 20);
+ offset += hdrlen + 20;
+ offset += write_delta_headers(f, child, size);
+ child = child->delta_sibling;
+ }
+ return offset;
+}
+
+static void * concat_delta_data(struct object_entry *e, void *buf)
+{
+ struct object_entry *child = e->delta_child;
+ while (child) {
+ char type[10];
+ unsigned long size;
+ void *data = read_sha1_file(child->sha1, type, &size);
+ if (!data)
+ die("unable to read %s", sha1_to_hex(child->sha1));
+ data = delta_against(data, size, child);
+ memcpy(buf, data, child->delta_size);
+ buf += child->delta_size;
+ free(data);
+ written++;
+ written_delta++;
+ buf = concat_delta_data(child, buf);
+ child = child->delta_sibling;
+ }
+ return buf;
+}
+
+static unsigned long write_one(struct sha1file *f,
+ struct object_entry *e,
+ unsigned long offset)
+{
+ unsigned char header[10];
+ unsigned count, hdrlen;
+ unsigned long size;
+ void *buf, *p;
+
+ if (e->offset)
+ /* offset starts from header size and cannot be zero
+ * if it is written already.
+ */
+ return offset;
+ if (!e->delta) {
+ e->offset = offset;
+ return offset + write_object(f, e);
+ }
+
+ /* find delta root object */
+ while (e->delta)
+ e = e->delta;
+
+ /* count how many deltas depend on it */
+ count = count_delta_childs(e, offset);
+
+ /* write header for object group */
+ hdrlen = encode_header(0, count, header);
+ sha1write(f, header, hdrlen);
+ offset += hdrlen;
+
+ /* write each object's header and find total data size */
+ size = 0;
+ offset += write_delta_headers(f, e, &size);
+
+ /* write concatenated object data buffer */
+ buf = xmalloc(size);
+ p = concat_delta_data(e, buf);
+ offset += sha1write_compressed(f, buf, p-buf);
+ free(buf);
+
+ return offset;
+}
+
+#endif
+
static void write_pack_file(void)
{
int i;
Nicolas
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-18 4:03 ` Nicolas Pitre
@ 2006-08-18 12:53 ` Jon Smirl
2006-08-18 16:30 ` Nicolas Pitre
2006-08-18 13:15 ` Jon Smirl
1 sibling, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-18 12:53 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn Pearce, git
[-- Attachment #1: Type: text/plain, Size: 370 bytes --]
I attached Shawn's code. He is gone until Monday and can't defend it.
Do note that I am running this on the Mozilla CVS which is over 10
years old. Some of the files have over 2,000 deltas. I average 10
deltas per file but the distribution is not at all even. Many files
get checked-in and never changed, for example 1000's of images.
--
Jon Smirl
jonsmirl@gmail.com
[-- Attachment #2: fast-import3.c --]
[-- Type: text/x-csrc, Size: 31535 bytes --]
/*
Format of STDIN stream:
stream ::= cmd*;
cmd ::= new_blob
| new_branch
| new_commit
| new_tag
;
new_blob ::= 'blob' lf
mark?
file_content;
file_content ::= data;
new_branch ::= 'branch' sp ref_str lf
('from' sp (ref_str | hexsha1 | sha1exp_str | idnum) lf)?
lf;
new_commit ::= 'commit' sp ref_str lf
mark?
('author' sp name '<' email '>' ts tz lf)?
'committer' sp name '<' email '>' ts tz lf
commit_msg
file_change*
lf;
commit_msg ::= data;
file_change ::= 'M' sp mode sp (hexsha1 | idnum) sp path_str lf
| 'D' sp path_str lf
;
mode ::= '644' | '755';
new_tag ::= 'tag' sp tag_str lf
'from' sp (ref_str | hexsha1 | sha1exp_str | idnum) lf
'tagger' sp name '<' email '>' ts tz lf
tag_msg;
tag_msg ::= data;
# note: the first idnum in a stream should be 1 and subsequent
# idnums should not have gaps between values as this will cause
# the stream parser to reserve space for the gapped values. An
# idnum can be updated in the future to a new object by issuing
# a new mark directive with the old idnum.
#
mark ::= 'mark' sp idnum lf;
# note: declen indicates the length of binary_data in bytes.
# declen does not include the lf preceeding or trailing the
# binary data.
#
data ::= 'data' sp declen lf
binary_data
lf;
# note: quoted strings are C-style quoting supporting \c for
# common escapes of 'c' (e..g \n, \t, \\, \") or \nnn where nnn
# is the signed byte value in octal. Note that the only
# characters which must actually be escaped to protect the
# stream formatting is: \, " and LF. Otherwise these values
# are UTF8.
#
ref_str ::= ref | '"' quoted(ref) '"' ;
sha1exp_str ::= sha1exp | '"' quoted(sha1exp) '"' ;
tag_str ::= tag | '"' quoted(tag) '"' ;
path_str ::= path | '"' quoted(path) '"' ;
declen ::= # unsigned 32 bit value, ascii base10 notation;
binary_data ::= # file content, not interpreted;
sp ::= # ASCII space character;
lf ::= # ASCII newline (LF) character;
# note: a colon (':') must precede the numerical value assigned to
# an idnum. This is to distinguish it from a ref or tag name as
# GIT does not permit ':' in ref or tag strings.
#
idnum ::= ':' declen;
path ::= # GIT style file path, e.g. "a/b/c";
ref ::= # GIT ref name, e.g. "refs/heads/MOZ_GECKO_EXPERIMENT";
tag ::= # GIT tag name, e.g. "FIREFOX_1_5";
sha1exp ::= # Any valid GIT SHA1 expression;
hexsha1 ::= # SHA1 in hexadecimal format;
# note: name and email are UTF8 strings, however name must not
# contain '<' or lf and email must not contain any of the
# following: '<', '>', lf.
#
name ::= # valid GIT author/committer name;
email ::= # valid GIT author/committer email;
ts ::= # time since the epoch in seconds, ascii base10 notation;
tz ::= # GIT style timezone;
*/
#include "builtin.h"
#include "cache.h"
#include "object.h"
#include "blob.h"
#include "tree.h"
#include "delta.h"
#include "pack.h"
#include "refs.h"
#include "csum-file.h"
#include "strbuf.h"
#include "quote.h"
struct object_entry
{
struct object_entry *next;
enum object_type type;
unsigned long offset;
unsigned char sha1[20];
};
struct object_entry_pool
{
struct object_entry_pool *next_pool;
struct object_entry *next_free;
struct object_entry *end;
struct object_entry entries[FLEX_ARRAY]; /* more */
};
struct last_object
{
void *data;
unsigned int len;
unsigned int depth;
unsigned char sha1[20];
};
struct mem_pool
{
struct mem_pool *next_pool;
char *next_free;
char *end;
char space[FLEX_ARRAY]; /* more */
};
struct atom_str
{
struct atom_str *next_atom;
int str_len;
char str_dat[FLEX_ARRAY]; /* more */
};
struct tree_content;
struct tree_entry
{
struct tree_content *tree;
struct atom_str* name;
unsigned int mode;
unsigned char sha1[20];
};
struct tree_content
{
unsigned int entry_capacity; /* must match avail_tree_content */
unsigned int entry_count;
struct tree_entry *entries[FLEX_ARRAY]; /* more */
};
struct avail_tree_content
{
unsigned int entry_capacity; /* must match tree_content */
struct avail_tree_content *next_avail;
};
struct branch
{
struct branch *table_next_branch;
struct branch *active_next_branch;
const char *name;
unsigned long last_commit;
struct tree_entry branch_tree;
unsigned char sha1[20];
};
/* Stats and misc. counters */
static int max_depth = 10;
static unsigned long alloc_count;
static unsigned long branch_count;
static unsigned long object_count;
static unsigned long duplicate_count;
static unsigned long object_count_by_type[9];
static unsigned long duplicate_count_by_type[9];
static unsigned long stream_cnt;
/* Memory pools */
static size_t mem_pool_alloc = 2*1024*1024 - sizeof(struct mem_pool);
static size_t total_allocd;
static struct mem_pool *mem_pool;
/* Atom management */
static unsigned int atom_table_sz = 4451;
static unsigned int atom_cnt;
static struct atom_str **atom_table;
/* The .pack file being generated */
static int pack_fd;
static unsigned long pack_offset;
static unsigned char pack_sha1[20];
static z_stream zs;
static int zs_live;
static void* zs_out;
static unsigned long zs_outlen;
/* Table of objects we've written. */
static unsigned int object_entry_alloc = 1000;
static struct object_entry_pool *blocks;
static struct object_entry *object_table[1 << 16];
/* Our last blob */
static struct last_object last_blob;
/* Tree management */
static unsigned int tree_entry_alloc = 1000;
static void *avail_tree_entry;
static unsigned int avail_tree_table_sz = 100;
static struct avail_tree_content **avail_tree_table;
/* Branch data */
static unsigned int max_active_branches = 5;
static unsigned int cur_active_branches;
static unsigned int branch_table_sz = 1039;
static struct branch **branch_table;
static struct branch *active_branches;
/* Input stream parsing */
static struct strbuf command_buf;
static unsigned long command_mark;
static void alloc_objects(int cnt)
{
struct object_entry_pool *b;
b = xmalloc(sizeof(struct object_entry_pool)
+ cnt * sizeof(struct object_entry));
b->next_pool = blocks;
b->next_free = b->entries;
b->end = b->entries + cnt;
blocks = b;
alloc_count += cnt;
}
static struct object_entry* new_object(unsigned char *sha1)
{
struct object_entry *e;
if (blocks->next_free == blocks->end)
alloc_objects(object_entry_alloc);
e = blocks->next_free++;
memcpy(e->sha1, sha1, sizeof(e->sha1));
return e;
}
static struct object_entry* find_object(unsigned char *sha1)
{
unsigned int h = sha1[0] << 8 | sha1[1];
struct object_entry *e;
for (e = object_table[h]; e; e = e->next)
if (!memcmp(sha1, e->sha1, sizeof(e->sha1)))
return e;
return NULL;
}
static struct object_entry* insert_object(unsigned char *sha1)
{
unsigned int h = sha1[0] << 8 | sha1[1];
struct object_entry *e = object_table[h];
struct object_entry *p = NULL;
while (e) {
if (!memcmp(sha1, e->sha1, sizeof(e->sha1)))
return e;
p = e;
e = e->next;
}
e = new_object(sha1);
e->next = NULL;
e->offset = 0;
if (p)
p->next = e;
else
object_table[h] = e;
return e;
}
static unsigned int hc_str(const char *s, size_t len)
{
unsigned int r = 0;
while (len-- > 0)
r = r * 31 + *s++;
return r;
}
static void* pool_alloc(size_t len)
{
struct mem_pool *p;
void *r;
for (p = mem_pool; p; p = p->next_pool)
if ((p->end - p->next_free >= len))
break;
if (!p) {
if (len >= (mem_pool_alloc/2)) {
total_allocd += len;
return xmalloc(len);
}
total_allocd += sizeof(struct mem_pool) + mem_pool_alloc;
p = xmalloc(sizeof(struct mem_pool) + mem_pool_alloc);
p->next_pool = mem_pool;
p->next_free = p->space;
p->end = p->next_free + mem_pool_alloc;
mem_pool = p;
}
r = p->next_free;
p->next_free += len;
return r;
}
static void* pool_calloc(size_t count, size_t size)
{
size_t len = count * size;
void *r = pool_alloc(len);
memset(r, 0, len);
return r;
}
static char* pool_strdup(const char *s)
{
char *r = pool_alloc(strlen(s) + 1);
strcpy(r, s);
return r;
}
static struct atom_str* to_atom(const char *s, size_t len)
{
unsigned int hc = hc_str(s, len) % atom_table_sz;
struct atom_str *c;
for (c = atom_table[hc]; c; c = c->next_atom)
if (c->str_len == len && !strncmp(s, c->str_dat, len))
return c;
c = pool_alloc(sizeof(struct atom_str) + len + 1);
c->str_len = len;
strncpy(c->str_dat, s, len);
c->str_dat[len] = 0;
c->next_atom = atom_table[hc];
atom_table[hc] = c;
atom_cnt++;
return c;
}
static struct branch* lookup_branch(const char *name)
{
unsigned int hc = hc_str(name, strlen(name)) % branch_table_sz;
struct branch *b;
for (b = branch_table[hc]; b; b = b->table_next_branch)
if (!strcmp(name, b->name))
return b;
return NULL;
}
static struct branch* new_branch(const char *name)
{
unsigned int hc = hc_str(name, strlen(name)) % branch_table_sz;
struct branch* b = lookup_branch(name);
if (b)
die("Invalid attempt to create duplicate branch: %s", name);
if (check_ref_format(name))
die("Branch name doesn't conform to GIT standards: %s", name);
b = pool_calloc(1, sizeof(struct branch));
b->name = pool_strdup(name);
b->table_next_branch = branch_table[hc];
branch_table[hc] = b;
branch_count++;
return b;
}
static unsigned int hc_entries(unsigned int cnt)
{
cnt = cnt & 7 ? (cnt / 8) + 1 : cnt / 8;
return cnt < avail_tree_table_sz ? cnt : avail_tree_table_sz - 1;
}
static struct tree_content* new_tree_content(unsigned int cnt)
{
struct avail_tree_content *f, *l = NULL;
struct tree_content *t;
unsigned int hc = hc_entries(cnt);
for (f = avail_tree_table[hc]; f; l = f, f = f->next_avail)
if (f->entry_capacity >= cnt)
break;
if (f) {
if (l)
l->next_avail = f->next_avail;
else
avail_tree_table[hc] = f->next_avail;
} else {
cnt = cnt & 7 ? ((cnt / 8) + 1) * 8 : cnt;
f = pool_alloc(sizeof(*t) + sizeof(t->entries[0]) * cnt);
f->entry_capacity = cnt;
}
t = (struct tree_content*)f;
t->entry_count = 0;
return t;
}
static void release_tree_entry(struct tree_entry *e);
static void release_tree_content(struct tree_content *t)
{
struct avail_tree_content *f = (struct avail_tree_content*)t;
unsigned int hc = hc_entries(f->entry_capacity);
unsigned int i;
for (i = 0; i < t->entry_count; i++)
release_tree_entry(t->entries[i]);
f->next_avail = avail_tree_table[hc];
avail_tree_table[hc] = f;
}
static struct tree_content* grow_tree_content(
struct tree_content *t,
int amt)
{
struct tree_content *r = new_tree_content(t->entry_count + amt);
r->entry_count = t->entry_count;
memcpy(r->entries,t->entries,t->entry_count*sizeof(t->entries[0]));
release_tree_content(t);
return r;
}
static struct tree_entry* new_tree_entry()
{
struct tree_entry *e;
if (!avail_tree_entry) {
unsigned int n = tree_entry_alloc;
avail_tree_entry = e = xmalloc(n * sizeof(struct tree_entry));
while (n--) {
*((void**)e) = e + 1;
e++;
}
}
e = avail_tree_entry;
avail_tree_entry = *((void**)e);
return e;
}
static void release_tree_entry(struct tree_entry *e)
{
if (e->tree)
release_tree_content(e->tree);
*((void**)e) = avail_tree_entry;
avail_tree_entry = e;
}
static void yread(int fd, void *buffer, size_t length)
{
ssize_t ret = 0;
while (ret < length) {
ssize_t size = xread(fd, (char *) buffer + ret, length - ret);
if (!size)
die("Read from descriptor %i: end of stream", fd);
if (size < 0)
die("Read from descriptor %i: %s", fd, strerror(errno));
ret += size;
}
}
static void ywrite(int fd, void *buffer, size_t length)
{
ssize_t ret = 0;
while (ret < length) {
ssize_t size = xwrite(fd, (char *) buffer + ret, length - ret);
if (!size)
die("Write to descriptor %i: end of file", fd);
if (size < 0)
die("Write to descriptor %i: %s", fd, strerror(errno));
ret += size;
}
}
static size_t encode_header(
enum object_type type,
size_t size,
unsigned char *hdr)
{
int n = 1;
unsigned char c;
if (type < OBJ_COMMIT || type > OBJ_DELTA)
die("bad type %d", type);
c = (type << 4) | (size & 15);
size >>= 4;
while (size) {
*hdr++ = c | 0x80;
c = size & 0x7f;
size >>= 7;
n++;
}
*hdr = c;
return n;
}
static void zs_finish()
{
if (zs_live) {
while (deflate(&zs, Z_FINISH) == Z_OK) {
if (zs.total_out) {
ywrite(pack_fd, zs_out, zs.total_out);
pack_offset += zs.total_out;
zs.next_out = zs_out;
zs.avail_out = zs_outlen;
zs.total_out = 0;
}
}
deflateEnd(&zs);
free(zs_out);
stream_cnt++;
}
}
static int store_object(
enum object_type type,
void *dat,
size_t datlen,
struct last_object *last,
unsigned char *sha1out)
{
void *delta;
struct object_entry *e;
unsigned char hdr[96];
unsigned char sha1[20];
unsigned long hdrlen, deltalen;
SHA_CTX c;
hdrlen = sprintf((char*)hdr,"%s %lu",type_names[type],datlen) + 1;
SHA1_Init(&c);
SHA1_Update(&c, hdr, hdrlen);
SHA1_Update(&c, dat, datlen);
SHA1_Final(sha1, &c);
if (sha1out)
memcpy(sha1out, sha1, sizeof(sha1));
e = insert_object(sha1);
if (e->offset) {
duplicate_count++;
duplicate_count_by_type[type]++;
return 1;
}
e->type = type;
e->offset = pack_offset;
object_count++;
object_count_by_type[type]++;
if (last && last->data && last->depth < max_depth)
delta = diff_delta(last->data, last->len,
dat, datlen,
&deltalen, 0);
else {
zs_finish();
delta = 0;
memset(&zs, 0, sizeof(zs));
deflateInit(&zs, zlib_compression_level);
zs_live = 1;
zs_outlen = 1024*1024;
zs_out = xmalloc(zs_outlen);
zs.next_out = zs_out;
zs.avail_out = zs_outlen;
}
if (delta) {
last->depth++;
zs.next_in = delta;
zs.avail_in = deltalen;
hdrlen = encode_header(OBJ_DELTA, deltalen, hdr);
ywrite(pack_fd, hdr, hdrlen);
ywrite(pack_fd, last->sha1, sizeof(sha1));
pack_offset += hdrlen + sizeof(sha1);
} else {
if (last)
last->depth = 0;
zs.next_in = dat;
zs.avail_in = datlen;
hdrlen = encode_header(type, datlen, hdr);
ywrite(pack_fd, hdr, hdrlen);
pack_offset += hdrlen;
}
while (deflate(&zs, Z_NO_FLUSH) == Z_OK) {
if (zs.total_out) {
ywrite(pack_fd, zs_out, zs.total_out);
pack_offset += zs.total_out;
zs.next_out = zs_out;
zs.avail_out = zs_outlen;
zs.total_out = 0;
}
}
if (delta)
free(delta);
if (last) {
if (last->data)
free(last->data);
last->data = dat;
last->len = datlen;
memcpy(last->sha1, sha1, sizeof(sha1));
}
return 0;
}
static const char *get_mode(const char *str, unsigned int *modep)
{
unsigned char c;
unsigned int mode = 0;
while ((c = *str++) != ' ') {
if (c < '0' || c > '7')
return NULL;
mode = (mode << 3) + (c - '0');
}
*modep = mode;
return str;
}
static void load_tree(struct tree_entry *root)
{
struct object_entry *myoe;
struct tree_content *t;
unsigned long size;
char *buf;
const char *c;
char type[20];
root->tree = t = new_tree_content(8);
if (!memcmp(root->sha1, null_sha1, 20))
return;
myoe = find_object(root->sha1);
if (myoe) {
die("FIXME");
} else {
buf = read_sha1_file(root->sha1, type, &size);
if (!buf || strcmp(type, tree_type))
die("Can't load existing tree %s", sha1_to_hex(root->sha1));
}
c = buf;
while (c != (buf + size)) {
struct tree_entry *e = new_tree_entry();
if (t->entry_count == t->entry_capacity)
root->tree = t = grow_tree_content(t, 8);
t->entries[t->entry_count++] = e;
e->tree = NULL;
c = get_mode(c, &e->mode);
if (!c)
die("Corrupt mode in %s", sha1_to_hex(root->sha1));
e->name = to_atom(c, strlen(c));
c += e->name->str_len + 1;
memcpy(e->sha1, c, sizeof(e->sha1));
c += 20;
}
free(buf);
}
static int tecmp (const void *_a, const void *_b)
{
struct tree_entry *a = *((struct tree_entry**)_a);
struct tree_entry *b = *((struct tree_entry**)_b);
return base_name_compare(
a->name->str_dat, a->name->str_len, a->mode,
b->name->str_dat, b->name->str_len, b->mode);
}
static void store_tree(struct tree_entry *root)
{
struct tree_content *t = root->tree;
unsigned int i;
size_t maxlen;
char *buf, *c;
if (memcmp(root->sha1, null_sha1, 20))
return;
maxlen = 0;
for (i = 0; i < t->entry_count; i++) {
maxlen += t->entries[i]->name->str_len + 34;
if (t->entries[i]->tree)
store_tree(t->entries[i]);
}
qsort(t->entries, t->entry_count, sizeof(t->entries[0]), tecmp);
buf = c = xmalloc(maxlen);
for (i = 0; i < t->entry_count; i++) {
struct tree_entry *e = t->entries[i];
c += sprintf(c, "%o", e->mode);
*c++ = ' ';
strcpy(c, e->name->str_dat);
c += e->name->str_len + 1;
memcpy(c, e->sha1, 20);
c += 20;
}
store_object(OBJ_TREE, buf, c - buf, NULL, root->sha1);
free(buf);
}
static int tree_content_set(
struct tree_entry *root,
const char *p,
const unsigned char *sha1,
const unsigned int mode)
{
struct tree_content *t = root->tree;
const char *slash1;
unsigned int i, n;
struct tree_entry *e;
slash1 = strchr(p, '/');
if (slash1)
n = slash1 - p;
else
n = strlen(p);
for (i = 0; i < t->entry_count; i++) {
e = t->entries[i];
if (e->name->str_len == n && !strncmp(p, e->name->str_dat, n)) {
if (!slash1) {
if (e->mode == mode && !memcmp(e->sha1, sha1, 20))
return 0;
e->mode = mode;
memcpy(e->sha1, sha1, 20);
if (e->tree) {
release_tree_content(e->tree);
e->tree = NULL;
}
memcpy(root->sha1, null_sha1, 20);
return 1;
}
if (!S_ISDIR(e->mode)) {
e->tree = new_tree_content(8);
e->mode = S_IFDIR;
}
if (!e->tree)
load_tree(e);
if (tree_content_set(e, slash1 + 1, sha1, mode)) {
memcpy(root->sha1, null_sha1, 20);
return 1;
}
return 0;
}
}
if (t->entry_count == t->entry_capacity)
root->tree = t = grow_tree_content(t, 8);
e = new_tree_entry();
e->name = to_atom(p, n);
t->entries[t->entry_count++] = e;
if (slash1) {
e->tree = new_tree_content(8);
e->mode = S_IFDIR;
tree_content_set(e, slash1 + 1, sha1, mode);
} else {
e->tree = NULL;
e->mode = mode;
memcpy(e->sha1, sha1, 20);
}
memcpy(root->sha1, null_sha1, 20);
return 1;
}
static int tree_content_remove(struct tree_entry *root, const char *p)
{
struct tree_content *t = root->tree;
const char *slash1;
unsigned int i, n;
struct tree_entry *e;
slash1 = strchr(p, '/');
if (slash1)
n = slash1 - p;
else
n = strlen(p);
for (i = 0; i < t->entry_count; i++) {
e = t->entries[i];
if (e->name->str_len == n && !strncmp(p, e->name->str_dat, n)) {
if (!slash1 || !S_ISDIR(e->mode))
goto del_entry;
if (!e->tree)
load_tree(e);
if (tree_content_remove(e, slash1 + 1)) {
if (!e->tree->entry_count)
goto del_entry;
memcpy(root->sha1, null_sha1, 20);
return 1;
}
return 0;
}
}
return 0;
del_entry:
for (i++; i < t->entry_count; i++)
t->entries[i-1] = t->entries[i];
t->entry_count--;
release_tree_entry(e);
memcpy(root->sha1, null_sha1, 20);
return 1;
}
static void init_pack_header()
{
const char* magic = "PACK";
unsigned long version = 3;
unsigned long zero = 0;
version = htonl(version);
ywrite(pack_fd, (char*)magic, 4);
ywrite(pack_fd, &version, 4);
ywrite(pack_fd, &zero, 4);
pack_offset = 4 * 3;
}
static void fixup_header_footer()
{
SHA_CTX c;
char hdr[8];
unsigned long cnt;
char *buf;
size_t n;
zs_finish();
if (lseek(pack_fd, 0, SEEK_SET) != 0)
die("Failed seeking to start: %s", strerror(errno));
SHA1_Init(&c);
yread(pack_fd, hdr, 8);
SHA1_Update(&c, hdr, 8);
cnt = htonl(object_count);
SHA1_Update(&c, &cnt, 4);
ywrite(pack_fd, &cnt, 4);
buf = xmalloc(128 * 1024);
for (;;) {
n = xread(pack_fd, buf, 128 * 1024);
if (n <= 0)
break;
SHA1_Update(&c, buf, n);
}
free(buf);
SHA1_Final(pack_sha1, &c);
ywrite(pack_fd, pack_sha1, sizeof(pack_sha1));
}
static int oecmp (const void *_a, const void *_b)
{
struct object_entry *a = *((struct object_entry**)_a);
struct object_entry *b = *((struct object_entry**)_b);
return memcmp(a->sha1, b->sha1, sizeof(a->sha1));
}
static void write_index(const char *idx_name)
{
struct sha1file *f;
struct object_entry **idx, **c, **last;
struct object_entry *e;
struct object_entry_pool *o;
unsigned int array[256];
int i;
/* Build the sorted table of object IDs. */
idx = xmalloc(object_count * sizeof(struct object_entry*));
c = idx;
for (o = blocks; o; o = o->next_pool)
for (e = o->entries; e != o->next_free; e++)
*c++ = e;
last = idx + object_count;
qsort(idx, object_count, sizeof(struct object_entry*), oecmp);
/* Generate the fan-out array. */
c = idx;
for (i = 0; i < 256; i++) {
struct object_entry **next = c;;
while (next < last) {
if ((*next)->sha1[0] != i)
break;
next++;
}
array[i] = htonl(next - idx);
c = next;
}
f = sha1create("%s", idx_name);
sha1write(f, array, 256 * sizeof(int));
for (c = idx; c != last; c++) {
unsigned int offset = htonl((*c)->offset);
sha1write(f, &offset, 4);
sha1write(f, (*c)->sha1, sizeof((*c)->sha1));
}
sha1write(f, pack_sha1, sizeof(pack_sha1));
sha1close(f, NULL, 1);
free(idx);
}
static void dump_branches()
{
static const char *msg = "fast-import";
unsigned int i;
struct branch *b;
struct ref_lock *lock;
for (i = 0; i < branch_table_sz; i++) {
for (b = branch_table[i]; b; b = b->table_next_branch) {
lock = lock_any_ref_for_update(b->name, NULL, 0);
if (!lock || write_ref_sha1(lock, b->sha1, msg) < 0)
die("Can't write %s", b->name);
}
}
}
static void read_next_command()
{
read_line(&command_buf, stdin, '\n');
}
static void cmd_mark()
{
if (!strncmp("mark :", command_buf.buf, 6)) {
command_mark = strtoul(command_buf.buf + 6, NULL, 10);
read_next_command();
}
else
command_mark = 0;
}
static void* cmd_data (size_t *size)
{
size_t n = 0;
void *buffer;
size_t length;
if (strncmp("data ", command_buf.buf, 5))
die("Expected 'data n' command, found: %s", command_buf.buf);
length = strtoul(command_buf.buf + 5, NULL, 10);
buffer = xmalloc(length);
while (n < length) {
size_t s = fread((char*)buffer + n, 1, length - n, stdin);
if (!s && feof(stdin))
die("EOF in data (%lu bytes remaining)", length - n);
n += s;
}
if (fgetc(stdin) != '\n')
die("An lf did not trail the binary data as expected.");
*size = length;
return buffer;
}
static void cmd_new_blob()
{
size_t datlen;
void *dat;
unsigned char sha1[20];
read_next_command();
cmd_mark();
dat = cmd_data(&datlen);
if (store_object(OBJ_BLOB, dat, datlen, &last_blob, sha1))
free(dat);
}
static void unload_one_branch()
{
while (cur_active_branches >= max_active_branches) {
unsigned long min_commit = ULONG_MAX;
struct branch *e, *l = NULL, *p = NULL;
for (e = active_branches; e; e = e->active_next_branch) {
if (e->last_commit < min_commit) {
p = l;
min_commit = e->last_commit;
}
l = e;
}
if (p) {
e = p->active_next_branch;
p->active_next_branch = e->active_next_branch;
} else {
e = active_branches;
active_branches = e->active_next_branch;
}
e->active_next_branch = NULL;
if (e->branch_tree.tree) {
release_tree_content(e->branch_tree.tree);
e->branch_tree.tree = NULL;
}
cur_active_branches--;
}
}
static void load_branch(struct branch *b)
{
load_tree(&b->branch_tree);
b->active_next_branch = active_branches;
active_branches = b;
cur_active_branches++;
}
static void file_change_m(struct branch *b)
{
const char *p = command_buf.buf + 2;
char *p_uq;
const char *endp;
struct object_entry *oe;
unsigned char sha1[20];
unsigned int mode;
char type[20];
p = get_mode(p, &mode);
if (!p)
die("Corrupt mode: %s", command_buf.buf);
switch (mode) {
case S_IFREG | 0644:
case S_IFREG | 0755:
case 0644:
case 0755:
/* ok */
break;
default:
die("Corrupt mode: %s", command_buf.buf);
}
if (get_sha1_hex(p, sha1))
die("Invalid SHA1: %s", command_buf.buf);
p += 40;
if (*p++ != ' ')
die("Missing space after SHA1: %s", command_buf.buf);
p_uq = unquote_c_style(p, &endp);
if (p_uq) {
if (*endp)
die("Garbage after path in: %s", command_buf.buf);
p = p_uq;
}
oe = find_object(sha1);
if (oe) {
if (oe->type != OBJ_BLOB)
die("Not a blob (actually a %s): %s",
command_buf.buf, type_names[oe->type]);
} else {
if (sha1_object_info(sha1, type, NULL))
die("Blob not found: %s", command_buf.buf);
if (strcmp(blob_type, type))
die("Not a blob (actually a %s): %s",
command_buf.buf, type);
}
tree_content_set(&b->branch_tree, p, sha1, S_IFREG | mode);
if (p_uq)
free(p_uq);
}
static void file_change_d(struct branch *b)
{
const char *p = command_buf.buf + 2;
char *p_uq;
const char *endp;
p_uq = unquote_c_style(p, &endp);
if (p_uq) {
if (*endp)
die("Garbage after path in: %s", command_buf.buf);
p = p_uq;
}
tree_content_remove(&b->branch_tree, p);
if (p_uq)
free(p_uq);
}
static void cmd_new_commit()
{
struct branch *b;
void *msg;
size_t msglen;
char *str_uq;
const char *endp;
char *sp;
char *author = NULL;
char *committer = NULL;
char *body;
/* Obtain the branch name from the rest of our command */
sp = strchr(command_buf.buf, ' ') + 1;
str_uq = unquote_c_style(sp, &endp);
if (str_uq) {
if (*endp)
die("Garbage after ref in: %s", command_buf.buf);
sp = str_uq;
}
b = lookup_branch(sp);
if (!b)
die("Branch not declared: %s", sp);
if (str_uq)
free(str_uq);
read_next_command();
cmd_mark();
if (!strncmp("author ", command_buf.buf, 7)) {
author = strdup(command_buf.buf);
read_next_command();
}
if (!strncmp("committer ", command_buf.buf, 10)) {
committer = strdup(command_buf.buf);
read_next_command();
}
if (!committer)
die("Expected committer but didn't get one");
msg = cmd_data(&msglen);
/* ensure the branch is active/loaded */
if (!b->branch_tree.tree) {
unload_one_branch();
load_branch(b);
}
/* file_change* */
for (;;) {
read_next_command();
if (1 == command_buf.len)
break;
else if (!strncmp("M ", command_buf.buf, 2))
file_change_m(b);
else if (!strncmp("D ", command_buf.buf, 2))
file_change_d(b);
else
die("Unsupported file_change: %s", command_buf.buf);
}
/* build the tree and the commit */
store_tree(&b->branch_tree);
body = xmalloc(97 + msglen
+ (author
? strlen(author) + strlen(committer)
: 2 * strlen(committer)));
sp = body;
sp += sprintf(sp, "tree %s\n", sha1_to_hex(b->branch_tree.sha1));
if (memcmp(b->sha1, null_sha1, 20))
sp += sprintf(sp, "parent %s\n", sha1_to_hex(b->sha1));
if (author)
sp += sprintf(sp, "%s\n", author);
else
sp += sprintf(sp, "author %s\n", committer + 10);
sp += sprintf(sp, "%s\n\n", committer);
memcpy(sp, msg, msglen);
sp += msglen;
if (author)
free(author);
free(committer);
free(msg);
store_object(OBJ_COMMIT, body, sp - body, NULL, b->sha1);
free(body);
b->last_commit = object_count_by_type[OBJ_COMMIT];
}
static void cmd_new_branch()
{
struct branch *b;
char *str_uq;
const char *endp;
char *sp;
/* Obtain the new branch name from the rest of our command */
sp = strchr(command_buf.buf, ' ') + 1;
str_uq = unquote_c_style(sp, &endp);
if (str_uq) {
if (*endp)
die("Garbage after ref in: %s", command_buf.buf);
sp = str_uq;
}
b = new_branch(sp);
if (str_uq)
free(str_uq);
read_next_command();
/* from ... */
if (!strncmp("from ", command_buf.buf, 5)) {
const char *from;
struct branch *s;
from = strchr(command_buf.buf, ' ') + 1;
str_uq = unquote_c_style(from, &endp);
if (str_uq) {
if (*endp)
die("Garbage after string in: %s", command_buf.buf);
from = str_uq;
}
s = lookup_branch(from);
if (b == s)
die("Can't create a branch from itself: %s", b->name);
else if (s) {
memcpy(b->sha1, s->sha1, 20);
memcpy(b->branch_tree.sha1, s->branch_tree.sha1, 20);
} else if (!get_sha1(from, b->sha1)) {
if (!memcmp(b->sha1, null_sha1, 20))
memcpy(b->branch_tree.sha1, null_sha1, 20);
else {
unsigned long size;
char *buf;
buf = read_object_with_reference(b->sha1,
type_names[OBJ_COMMIT], &size, b->sha1);
if (!buf || size < 46)
die("Not a valid commit: %s", from);
if (memcmp("tree ", buf, 5)
|| get_sha1_hex(buf + 5, b->branch_tree.sha1))
die("The commit %s is corrupt", sha1_to_hex(b->sha1));
free(buf);
}
} else
die("Invalid ref name or SHA1 expression: %s", from);
if (str_uq)
free(str_uq);
read_next_command();
} else {
memcpy(b->sha1, null_sha1, 20);
memcpy(b->branch_tree.sha1, null_sha1, 20);
}
if (command_buf.eof || command_buf.len > 1)
die("An lf did not terminate the branch command as expected.");
}
int main(int argc, const char **argv)
{
const char *base_name = argv[1];
int est_obj_cnt = atoi(argv[2]);
char *pack_name;
char *idx_name;
struct stat sb;
setup_ident();
git_config(git_default_config);
pack_name = xmalloc(strlen(base_name) + 6);
sprintf(pack_name, "%s.pack", base_name);
idx_name = xmalloc(strlen(base_name) + 5);
sprintf(idx_name, "%s.idx", base_name);
pack_fd = open(pack_name, O_RDWR|O_CREAT|O_EXCL, 0666);
if (pack_fd < 0)
die("Can't create %s: %s", pack_name, strerror(errno));
init_pack_header();
alloc_objects(est_obj_cnt);
strbuf_init(&command_buf);
atom_table = xcalloc(atom_table_sz, sizeof(struct atom_str*));
branch_table = xcalloc(branch_table_sz, sizeof(struct branch*));
avail_tree_table = xcalloc(avail_tree_table_sz, sizeof(struct avail_tree_content*));
for (;;) {
read_next_command();
if (command_buf.eof)
break;
else if (!strcmp("blob", command_buf.buf))
cmd_new_blob();
else if (!strncmp("branch ", command_buf.buf, 7))
cmd_new_branch();
else if (!strncmp("commit ", command_buf.buf, 7))
cmd_new_commit();
else
die("Unsupported command: %s", command_buf.buf);
}
fixup_header_footer();
close(pack_fd);
write_index(idx_name);
dump_branches();
fprintf(stderr, "%s statistics:\n", argv[0]);
fprintf(stderr, "---------------------------------------------------\n");
fprintf(stderr, "Alloc'd objects: %10lu (%10lu overflow )\n", alloc_count, alloc_count - est_obj_cnt);
fprintf(stderr, "Total objects: %10lu (%10lu duplicates)\n", object_count, duplicate_count);
fprintf(stderr, " blobs : %10lu (%10lu duplicates)\n", object_count_by_type[OBJ_BLOB], duplicate_count_by_type[OBJ_BLOB]);
fprintf(stderr, " trees : %10lu (%10lu duplicates)\n", object_count_by_type[OBJ_TREE], duplicate_count_by_type[OBJ_TREE]);
fprintf(stderr, " commits: %10lu (%10lu duplicates)\n", object_count_by_type[OBJ_COMMIT], duplicate_count_by_type[OBJ_COMMIT]);
fprintf(stderr, " tags : %10lu (%10lu duplicates)\n", object_count_by_type[OBJ_TAG], duplicate_count_by_type[OBJ_TAG]);
fprintf(stderr, "Total streams: %10u\n", stream_cnt);
fprintf(stderr, "Total branches: %10lu\n", branch_count);
fprintf(stderr, "Total atoms: %10u\n", atom_cnt);
fprintf(stderr, "Memory total: %10lu KiB\n", (total_allocd + alloc_count*sizeof(struct object_entry))/1024);
fprintf(stderr, " pools: %10lu KiB\n", total_allocd/1024);
fprintf(stderr, " objects: %10lu KiB\n", (alloc_count*sizeof(struct object_entry))/1024);
fprintf(stderr, "---------------------------------------------------\n");
stat(pack_name, &sb);
fprintf(stderr, "Pack size: %10lu KiB\n", (unsigned long)(sb.st_size/1024));
stat(idx_name, &sb);
fprintf(stderr, "Index size: %10lu KiB\n", (unsigned long)(sb.st_size/1024));
fprintf(stderr, "\n");
return 0;
}
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-18 4:03 ` Nicolas Pitre
2006-08-18 12:53 ` Jon Smirl
@ 2006-08-18 13:15 ` Jon Smirl
2006-08-18 13:36 ` Johannes Schindelin
2006-08-18 16:25 ` Nicolas Pitre
1 sibling, 2 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-18 13:15 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn Pearce, git
On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> A better way to get such a size saving is to increase the window and
> depth parameters. For example, a window of 20 and depth of 20 can
> usually provide a pack size saving greater than 11% with none of the
> disadvantages mentioned above.
Our window size is effectively infinite. I am handing him all of the
revisions from a single file in optimal order. This includes branches.
He takes these revisions, runs xdiff on them, and then puts the entire
result into a single zlib blob.
I suspect the size reduction is directly proportional to the age of
the repository. The kernel repository only has three years worth of
data in it. Linus has the full history in another repository that is
not in general distribution. We can get it from him when he gets back
from vacation.
If the repository doesn't contain long delta chains the optimization
doesn't help that much. On the other hand it doesn't hurt either since
the chains weren't long. My repository is four times as old as the
kernel one and I am getting 4x the benefit.
This is a good format for archival data that is infrequency accessed.
That is why I proposed a two pack system. One pack would contain the
archival data and be highly optimized for size and the second pack
would contain recent changes and be optimized for speed. The size
optimization is important for controlling bandwidth costs.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-18 13:15 ` Jon Smirl
@ 2006-08-18 13:36 ` Johannes Schindelin
2006-08-18 13:50 ` Jon Smirl
2006-08-18 16:25 ` Nicolas Pitre
1 sibling, 1 reply; 36+ messages in thread
From: Johannes Schindelin @ 2006-08-18 13:36 UTC (permalink / raw)
To: Jon Smirl; +Cc: Nicolas Pitre, Shawn Pearce, git
Hi,
On Fri, 18 Aug 2006, Jon Smirl wrote:
> I suspect the size reduction is directly proportional to the age of
> the repository. The kernel repository only has three years worth of
> data in it. Linus has the full history in another repository that is
> not in general distribution. We can get it from him when he gets back
> from vacation.
Maybe you mean
http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/tglx/history.git
Ciao,
Dscho
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-18 13:36 ` Johannes Schindelin
@ 2006-08-18 13:50 ` Jon Smirl
2006-08-19 19:25 ` Linus Torvalds
0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-18 13:50 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Nicolas Pitre, Shawn Pearce, git
On 8/18/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Fri, 18 Aug 2006, Jon Smirl wrote:
>
> > I suspect the size reduction is directly proportional to the age of
> > the repository. The kernel repository only has three years worth of
> > data in it. Linus has the full history in another repository that is
> > not in general distribution. We can get it from him when he gets back
> > from vacation.
>
> Maybe you mean
>
> http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/tglx/history.git
That one only goes to 2002, the full one goes back to around 1990.
>
> Ciao,
> Dscho
>
>
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-18 13:15 ` Jon Smirl
2006-08-18 13:36 ` Johannes Schindelin
@ 2006-08-18 16:25 ` Nicolas Pitre
2006-08-21 7:06 ` Shawn Pearce
1 sibling, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-18 16:25 UTC (permalink / raw)
To: Jon Smirl; +Cc: Shawn Pearce, git
On Fri, 18 Aug 2006, Jon Smirl wrote:
> On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> > A better way to get such a size saving is to increase the window and
> > depth parameters. For example, a window of 20 and depth of 20 can
> > usually provide a pack size saving greater than 11% with none of the
> > disadvantages mentioned above.
>
> Our window size is effectively infinite. I am handing him all of the
> revisions from a single file in optimal order. This includes branches.
In GIT packing terms this is infinite delta _depth_ not _window_.
> He takes these revisions, runs xdiff on them, and then puts the entire
> result into a single zlib blob.
This is not a good idea to have infinite delta depth. The time to
browse the repository history then becomes exponential with the number
of revisions making the value of such a repository a bit questionnable
(you could as well only preserve the last 2 years of history instead
since further than that with infinite delta depth is likely to be too
slow and painful to use).
But just for comparison I did a repack -a -f on the kernel repository
with --window=50 --depth=5000 which should be a good approximation of
the best possible delta matchingwith infinite depth.
Default delta params (window=10 depth=10) : 122103455
Agressive deltas (window=50 depth=5000) : 105870516
Reduction : 13%
OK let's try it with delta chains in the same zlib stream using the
patch I posted yesterday (with a minor tweak allowing the usage of -f
with git-repack).
Agressive and grouped deltas (window=50 depth=5000 : 99860685
This is a mere 5.7% reduction over the non grouped deltas, less than the
11% reduction I obtained yesterday when the delta depth is kept
reasonably short.
The increased delta depth is likely to make a large difference on old
repos with long history, maybe more so and with much less
complexity than the delta grouping.
> I suspect the size reduction is directly proportional to the age of
> the repository. The kernel repository only has three years worth of
> data in it. Linus has the full history in another repository that is
> not in general distribution. We can get it from him when he gets back
> from vacation.
>
> If the repository doesn't contain long delta chains the optimization
> doesn't help that much. On the other hand it doesn't hurt either since
> the chains weren't long. My repository is four times as old as the
> kernel one and I am getting 4x the benefit.
No that cannot be right.
Let's assume every whole objects are 10 in size and every deltas are 1.
You therefore can have 1 base object and 10 delta objects, effectively
storing 11 objects for a size of 20. You therefore have a 1.8 vs 10
size ratio.
If the delta depth is 100 then you potentially have 1 base object and
100 deltas for a size ratio of 1.1 vs 10.
If the delta depth is 1000 the ratiobecomes 1.01 vs 10.
The size saving is therefore _not_ proportional with the age of the
repository. It rather tend to be asymptotic with the delta ratio (but
impose an exponential runtime cost when fetching objects out of it).
The fact that your 4x old repository has
a 4x size saving
can be due only to packing malfunction I would say.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-18 12:53 ` Jon Smirl
@ 2006-08-18 16:30 ` Nicolas Pitre
2006-08-18 16:56 ` Jon Smirl
0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-18 16:30 UTC (permalink / raw)
To: Jon Smirl; +Cc: Shawn Pearce, git
On Fri, 18 Aug 2006, Jon Smirl wrote:
> I attached Shawn's code. He is gone until Monday and can't defend it.
I will have a look at it next week as I'll be gone for the weekend as
well.
> Do note that I am running this on the Mozilla CVS which is over 10
> years old. Some of the files have over 2,000 deltas. I average 10
> deltas per file but the distribution is not at all even. Many files
> get checked-in and never changed, for example 1000's of images.
This is IMHO more evidence that something is wrong with the results
you obtained.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-18 16:30 ` Nicolas Pitre
@ 2006-08-18 16:56 ` Jon Smirl
2006-08-21 3:45 ` Nicolas Pitre
0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-18 16:56 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn Pearce, git
On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> On Fri, 18 Aug 2006, Jon Smirl wrote:
>
> > I attached Shawn's code. He is gone until Monday and can't defend it.
>
> I will have a look at it next week as I'll be gone for the weekend as
> well.
I looked at it some and couldn't see anything obviously wrong with it,
but it wasn't a detailed inspection.
As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB.
The 295MB git pack number does not have commits and trees in it, it is
revisions only.
It is not impossible that when the trees and commits are added to the
295MB number that it won't end up being near the 541MB number. We
already know xdiff is better at generating diffs than CVS diff. And I
still have to add 250K log entries and trees. Those are going to add
at least 100MB.
I will get back to work on the import code over the weekend. One of my
2 year olds gave me a cold so I have been useless the last couple of
days.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-18 13:50 ` Jon Smirl
@ 2006-08-19 19:25 ` Linus Torvalds
0 siblings, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2006-08-19 19:25 UTC (permalink / raw)
To: Jon Smirl; +Cc: Johannes Schindelin, Nicolas Pitre, Shawn Pearce, git
On Fri, 18 Aug 2006, Jon Smirl wrote:
> On 8/18/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > Hi,
> >
> > On Fri, 18 Aug 2006, Jon Smirl wrote:
> >
> > > I suspect the size reduction is directly proportional to the age of
> > > the repository. The kernel repository only has three years worth of
> > > data in it. Linus has the full history in another repository that is
> > > not in general distribution. We can get it from him when he gets back
> > > from vacation.
> >
> > Maybe you mean
> >
> > http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/tglx/history.git
>
> That one only goes to 2002, the full one goes back to around 1990.
I don't actually have such a "full" history. It would be wonderful if
somebody took the time to try to piece such a thing together (and git
actually makes that a _lot_ easier than some other SCM's, because you can
just import random versions in any order, and then re-stich just the
commit history when you add a new thing in the middle, without generating
any new trees or deltas or anything strange at all).
But it's a lot of work. I tried to do a "Linux-Historic" archive about a
year ago (and imported some of the old kernels I had), but I gave up, just
because it was such a pain to try to do a good job and try to find old
release notes etc to import into the changelogs etc.
Oh, well.
So the only "old" history I have is indeed that BK conversion by Thomas
Gleixner. Any pre-BK stuff only exists as patches and tar-balls on various
ftp sites (and I don't have any magic repository of my own, so everybody
else can do exactly as well as I could, with possibly the exception that I
might remember some random details about some old release history - but
considering my memory, that's pretty unlikely too. Google is your friend)
Linus
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-18 16:56 ` Jon Smirl
@ 2006-08-21 3:45 ` Nicolas Pitre
2006-08-21 6:46 ` Shawn Pearce
0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-21 3:45 UTC (permalink / raw)
To: Jon Smirl; +Cc: Shawn Pearce, git
On Fri, 18 Aug 2006, Jon Smirl wrote:
> On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> > On Fri, 18 Aug 2006, Jon Smirl wrote:
> >
> > > I attached Shawn's code. He is gone until Monday and can't defend it.
> >
> > I will have a look at it next week as I'll be gone for the weekend as
> > well.
>
> I looked at it some and couldn't see anything obviously wrong with it,
> but it wasn't a detailed inspection.
I looked at it too and the code looks OK.
This doesn't mean there is no problem at a higher level though. The
deltification process is extremely crude and I think this is the cause
of the original pack size.
For example, last April we discovered that a small change in the
heuristics to determine base delta objects in git-pack-objects could
create a pack size regression up to 4x the size of the same pack created
before such change.
It is also possible to have a denser delta stream but once deflated it
is larger than a less dense delta to start with.
Just to say that many tweaks and heuristics have been implemented and
studied in git-pack-objects for over a year now in order to get the
really small packs we have today. And a really subtle and
inocent-looking change can break it size wize.
So what I think is happening with the fastimport code is that the delta
selection is not really good. It is certainly much better than no delta
at all but still not optimal which smells deja vu to me. Then by
deflating them all together the redundent information that the bad delta
set still carries along is eliminated -- thanks to zlib sort of
mitigating the real issue.
But... as my recent experiments show, the grouping of related deltas
into a single zlib stream doesn't produce significant improvements when
implemented directly into git-pack-objects. Certainly not worth the
inconvenients and costs it brings along. I even think that if you used
git-repack -a -f on the pack produced by the import process, with only
delta deflated individually just like it did originally, then the
repacked pack would _also_ shrink significantly. Most probably around
4x just like you observed with the grouping of deltas in the same zlib
stream.
Not only would git-repack make it much smaller, but it also provicdes a
much better layout where all objects for recent commits are all stored
together at the beginning of the pack. The fastimport code is instead
storing them scattered all over the pack for every commit by making all
revisions of each file next to each other which will cause horrible
access patterns and really bad IO.
So I think that trying to make fastimport too clever is wrong. It
should instead focus on creating an initial pack as fast as possible and
then rely on a final git-repack pass to produce the shrinked pack. I
really doubt the import code could ever make a better job than
git-pack-objects does.
If I can make a suggestion, you should forget about this multiple deltas
in one zlib stream for now and focus on making the import process work
all the way to tree and commit objects instead. Then, only then, if
git-repack -a -f doesn't produce satisfactory pack size we could look at
better pack encoding. And so far the grouping of related deltas in one
zlib stream is _not_ a better encoding given the rather small
improvement over unmodified git-pack-objects vs the inconvenients and
cost it brings with it.
> As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB.
> The 295MB git pack number does not have commits and trees in it, it is
> revisions only.
Running git-repack -a -f from a recent GIT on the Mozilla repo converted
through cvsps and friends produces a pack smaller than 500MB. I even
brought it down to 430MB by using non default delta window and depth.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-21 3:45 ` Nicolas Pitre
@ 2006-08-21 6:46 ` Shawn Pearce
2006-08-21 10:24 ` Jakub Narebski
2006-08-21 16:23 ` Jon Smirl
0 siblings, 2 replies; 36+ messages in thread
From: Shawn Pearce @ 2006-08-21 6:46 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Jon Smirl, git
Nicolas Pitre <nico@cam.org> wrote:
> I looked at it too and the code looks OK.
Based on the stats Jon provided me it appeared as though the code
skipped about 9% of the objects, and at the time I thought I new
why but looking at it again right now I'm not sure why. :-)
Assuming the objects were roughly the same size (probably a bad
assumption, but lets use it anyway) we're talking about approx. 9%
increase, bringing the size to around 360 MB.
Since that is still only blobs and lacks trees&commits, and you were
able to get a pack from the earlier import attempts down to 430 MB
(including trees&commits) everything you pointed out with regards
to delta base selection not being optimal is probably the major
culprit here.
> So I think that trying to make fastimport too clever is wrong. It
> should instead focus on creating an initial pack as fast as possible and
> then rely on a final git-repack pass to produce the shrinked pack. I
> really doubt the import code could ever make a better job than
> git-pack-objects does.
I'm mixed on that.
My goal with fast-import was to construct a pack as cheaply as
possible in such a way that IO is relatively minimized during
pack construction and later during an initial repacking, as well
as to avoid all fork'ing costs associated with many update-index
and repack calls.
fast-import is assuming the front end driving it is being somewhat
nice by feeding revisions in newest-oldest order, one file at a time.
If this assumption is false it will produce a pack but one that is
so large the IO to repack it will be horrible (but probably would
still be better than if these were all loose objects).
fast-import is also assuming that pack construction speed and memory
usage required to build that pack are more important than subsequent
reads of the resulting pack.
Why?
Because I fully expected the pack to undergo a `git-repack -a -d -f`
with a rather large delta depth before it really got used by
clients which cared about performance or disk space. :-)
Whether or not git-repack can handle such a large pack is another
story. Apparently it can already take a ~600 MB pack and do
something useful to it in a reasonable time span so I'm putting
any limits that might be in there off for now.
But the more I work with Jon on this Mozilla import process the more
I'm realizing that:
- fast-import.c's other name is "GIT in 1325 lines of C code, all
of which more or less already exists elsewhere in a form that
wasn't useful in this context without refactoring it in major
ways so I got lazy and rewrote it in another equally useless way";
- the strategy of how I'm slamming a very large number of objects
into a pack may be useful in situations other than a foreign
SCM import process. I can see someone wanting to create a
large commit with a lot of modified objects. Doing this with
update-index and write-tree into loose objects would be far
slower than just generating a new pack if the number of objects
you are writing exceeds about 100 on Windows or ~1k on UNIX;
- Jon has lots of CPU time to spare and possibly even some memory.
fast-import is barely using CPU and barely uses memory (its quite
lightweight!) but its definately producing a suboptimal pack
as a result. Just doing slightly better delta selection before
committing an object to the output pack may prevent the need to
use -f during git-repack and produce something almost as optimal.
Right now fast-import is only lacking the following major features,
at which point it *should* be able to process the entire Mozilla
repository, assuming Jon can get his symbol closing problem in
cvs2svn solved:
- reload a branch which has been swapped out (requires reading
the pack that is currently being constructed back into memory,
which since its not really a valid pack yet and lacks an index
reusing the code in sha1_file.c is slightly interesting);
- generate a tag object into the pack;
- track association ID numbers to SHA1 IDs (needed to properly
create a branch from a prior commit or a tag from a prior commit
as the prior commit's SHA ID isn't known to the front end);
Working on the multiple objects in one zlib stream code over the
weekend also showed me how to use sha1_file.c to perform the first
of the three (I think) and the remaining two are trivial. So my
weekend wasn't a total loss. :-)
I plan on getting that code finished up in the next day or so and
get a new draft of it off to Jon. Maybe by then he'll also have
something coming out of cvs2svn's pass 9.
> > As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB.
> > The 295MB git pack number does not have commits and trees in it, it is
> > revisions only.
>
> Running git-repack -a -f from a recent GIT on the Mozilla repo converted
> through cvsps and friends produces a pack smaller than 500MB. I even
> brought it down to 430MB by using non default delta window and depth.
I'd still like to find a way to beat that 430 MB current
best-yet-still-valid size. :-)
I think we'll get there. Shaving off 4-5% with a pack specific
dictionary may be worthwhile on such large packs, especially ones
that are considered "historical" and probably will never get repacked
again. Folding objects into a single zlib stream may also be worth
it, but for now I'm going to wait and see what mileage we can get
from the existing repack code and the existing pack file format.
--
Shawn.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-18 16:25 ` Nicolas Pitre
@ 2006-08-21 7:06 ` Shawn Pearce
2006-08-21 14:07 ` Jon Smirl
2006-08-21 15:46 ` Nicolas Pitre
0 siblings, 2 replies; 36+ messages in thread
From: Shawn Pearce @ 2006-08-21 7:06 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Jon Smirl, git
Nicolas Pitre <nico@cam.org> wrote:
> On Fri, 18 Aug 2006, Jon Smirl wrote:
>
> > On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> > > A better way to get such a size saving is to increase the window and
> > > depth parameters. For example, a window of 20 and depth of 20 can
> > > usually provide a pack size saving greater than 11% with none of the
> > > disadvantages mentioned above.
> >
> > Our window size is effectively infinite. I am handing him all of the
> > revisions from a single file in optimal order. This includes branches.
>
> In GIT packing terms this is infinite delta _depth_ not _window_.
We're not using infinite anything.
fast-import is basically doing window=1 and depth=10.
We only examine the last blob to see if we can get a delta against
it. If we do we write that delta out; otherwise we reset our delta
chain and write the complete object. We also reset our chain after
writing out 10 deltas, each of which used the immediately prior
object as its base.
Since I just found out that in some cases the Mozilla repository has
1000s of revisions per file[*1*] and in others only 1 revision per
file we probably should be adjusting this depth to have a maximum
of 500 while also having the frontend send us a "I'm switching
files now" marker so we know to not even bother trying to delta
the new blob against the last blob as they are likely to not
delta well[*2*].
> Default delta params (window=10 depth=10) : 122103455
> Agressive deltas (window=50 depth=5000) : 105870516
> Agressive and grouped deltas (window=50 depth=5000 : 99860685
Although complex the aggressive and grouped deltas appears to
have saved you 18.2% on this repository. That's not something
to ignore. A reasonably optimal local pack dictionary could save
at least 4%[*3*]. Whacking 22% off a 400 MB pack is saving 88 MB.
Transferring that over the network on an initial clone is like
downloading all of Eclipse. Or an uncompressed kernel tarball...
[*1*] Jon noted this in another email in this thread but I'm too
lazy to lookup the hyperlink right now.
[*2*] Granted in some cases they may delta very well against each
other but I think the probablity of that occuring is low
enough that its not worth worrying about in fast-import.c;
we can let repack's strategy deal with it instead.
[*3*] I wrote a brain-dead simple local dictionary selecter in Perl.
Its horribly far from being ideal. But it is consistently
saving us 4% on the GIT and the Mozilla repository and its
pretty darn fast. Shockingly the C keywords didn't gain
us very much here; its project specific text that's the
real win.
Looking at chunks which are frequently copied in deltas
from base objects and breaking those chunks up into
smaller common chunks, then loading those most frequent
common chunks into the pack dictionary would most likely
produce far better results.
--
Shawn.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-21 6:46 ` Shawn Pearce
@ 2006-08-21 10:24 ` Jakub Narebski
2006-08-21 16:23 ` Jon Smirl
1 sibling, 0 replies; 36+ messages in thread
From: Jakub Narebski @ 2006-08-21 10:24 UTC (permalink / raw)
To: git
Shawn Pearce wrote:
> - the strategy of how I'm slamming a very large number of objects
> into a pack may be useful in situations other than a foreign
> SCM import process. I can see someone wanting to create a
> large commit with a lot of modified objects. Doing this with
> update-index and write-tree into loose objects would be far
> slower than just generating a new pack if the number of objects
> you are writing exceeds about 100 on Windows or ~1k on UNIX;
Like e.g. initial import of large project, or incorporating some subproject
into a projects (e.g. gitk and gitweb in git)?
--
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-21 7:06 ` Shawn Pearce
@ 2006-08-21 14:07 ` Jon Smirl
2006-08-21 15:46 ` Nicolas Pitre
1 sibling, 0 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-21 14:07 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Nicolas Pitre, git
Mozilla CVS files have between 1 and 1700 deltas. The average is ten,
but the deviation is high. I believe less than 10 files have over 1000
deltas, they are all in the root directory and related to the build
process. Many files have no deltas, or because of CVS artifacts all of
the revisions are identical.
I am still IO bound. Random access IO is the problem, not stream IO. I
have to open and read 110,000 (5GB total) files. It takes about 2hrs
to do all of the IO. I'm not CPU bound yet but as we make things more
efficient, I am getting closer to being CPU bound.
Forking is not an option. It can takes days to fork 1M copies of an
app. I have used oprofile on parsecvs. It spends 60% of the time in
the kernel processing fork calls. Parsecvs runs for 6hrs on mozcvs and
dies without finishing.
I am back to working on the branch code. I'm over the cold I got from
my 2 yr old. It is slow going now. I am in the phase where the import
process runs without error 5-10 minutes and then dies from some
unusual branch case. I fix it up and try again. I am slowly
identifying and removing all the code in cvs2svn that puts the
branches and symbols into their own subdirectories.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-21 7:06 ` Shawn Pearce
2006-08-21 14:07 ` Jon Smirl
@ 2006-08-21 15:46 ` Nicolas Pitre
2006-08-21 16:14 ` Jon Smirl
1 sibling, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-21 15:46 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Jon Smirl, git
On Mon, 21 Aug 2006, Shawn Pearce wrote:
> Nicolas Pitre <nico@cam.org> wrote:
> > On Fri, 18 Aug 2006, Jon Smirl wrote:
> >
> > > On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> > > > A better way to get such a size saving is to increase the window and
> > > > depth parameters. For example, a window of 20 and depth of 20 can
> > > > usually provide a pack size saving greater than 11% with none of the
> > > > disadvantages mentioned above.
> > >
> > > Our window size is effectively infinite. I am handing him all of the
> > > revisions from a single file in optimal order. This includes branches.
> >
> > In GIT packing terms this is infinite delta _depth_ not _window_.
>
> We're not using infinite anything.
>
> fast-import is basically doing window=1 and depth=10.
Yes, that is what I figured out looking at the code.
> We only examine the last blob to see if we can get a delta against
> it. If we do we write that delta out; otherwise we reset our delta
> chain and write the complete object. We also reset our chain after
> writing out 10 deltas, each of which used the immediately prior
> object as its base.
I think this is a perfectly sensible thing to do for the initial import
pass. This will get the data in GIT format quickly and save disk space
over no deltas at all. No need to seek for optimal packing at that
point but only lay things out for a faster git-repack pass to complete
the import.
Also, like I said before, you cannot get a nice object layout from your
import process since all revisions of each file end up next to each
other while in practice what you want is all objects for each _commit_
to be next to each other. Otherwise checking out a commit will seek all
over the pack like mad. A final repack using git-repack will create
that nice object layout for you and that is something you cannot do
on the fly.
The git-pack-objects code is also where all the accumulated knowledge
for best delta packing heuristics has been created over the last year.
It would be counter productive to cut yourself from its value. If
anything more clever can be done with packing it probably should be done
there.
Finally, I don't see how you could create nice deltas for tree objects
on the fly like you do with blobs without making your code significantly
more complex and use more memory due to the recursive nature of
directory information. This is no a problem for git-pack-objects since
tree objects are no different from blob objects when deltifying them.
Again this is why I think you should focus on the quickest generation of
a reasonable intermediate pack and let git-repack make it optimal.
> Since I just found out that in some cases the Mozilla repository has
> 1000s of revisions per file[*1*] and in others only 1 revision per
> file we probably should be adjusting this depth to have a maximum
> of 500 while also having the frontend send us a "I'm switching
> files now" marker so we know to not even bother trying to delta
> the new blob against the last blob as they are likely to not
> delta well[*2*].
No please don't do that. A deep delta chain won't produce a dramatic
size difference but it _will_ make it really costly to read objects back
from the pack later on. For example it might make the next git-repack
many times longer but the size saving will not be more than 20%. If you
agree with me that a final git-repack is the best thing to do then the
intermediate pack doesn't have to be that small.
> Although complex the aggressive and grouped deltas appears to
> have saved you 18.2% on this repository. That's not something
> to ignore.
But the aggressive delta alone saved 13% already. The remaining saving
is questionable since, like I said, it prevents the delta data reuse
optimization which is so nice in order to reduce the load on servers to
nearly zero.
> A reasonably optimal local pack dictionary could save
> at least 4%[*3*]. Whacking 22% off a 400 MB pack is saving 88 MB.
This is still speculations though. The separate dictionary is certainly
a good idea to try out and it is way more flexible than the grouped
deltas. But I doubt the combined concepts will all add up size
savings linearly.
> Transferring that over the network on an initial clone is like
> downloading all of Eclipse. Or an uncompressed kernel tarball...
... something that most people can afford these days. Plus, not using
grouped objects would make future shallow clones much less costly on the
server as well, while this is something that most people will be most
interested in with such large repos and which has much higher bandwidth
saver value.
Of course this is a trade-off. Tighter packs are likely to be more
costly to use. So far we've contained that cost to the _initial_
generation of the pack (using -a -f with git-repack) while subsequent
repacking for network transfer were basically free on CPU. The grouped
object concept saves around 7 to 11% on pack size in my tests, but it
impose a non negligible cost on any further repacking as well as an
additional cost on the fetching of objects out of the pack.
> [*3*] I wrote a brain-dead simple local dictionary selecter in Perl.
> Its horribly far from being ideal. But it is consistently
> saving us 4% on the GIT and the Mozilla repository and its
> pretty darn fast. Shockingly the C keywords didn't gain
> us very much here; its project specific text that's the
> real win.
>
> Looking at chunks which are frequently copied in deltas
> from base objects and breaking those chunks up into
> smaller common chunks, then loading those most frequent
> common chunks into the pack dictionary would most likely
> produce far better results.
I think such a global dictionary per pack is something that could be
really nice. Again it limits the cost to the initial repack process
while having no cost impact neither on the delta reuse optimization nor
on the object checkout which is not the case for object grouping.
Certainly worth experimenting ... but independently from any import
process I think.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-21 15:46 ` Nicolas Pitre
@ 2006-08-21 16:14 ` Jon Smirl
2006-08-21 17:48 ` Nicolas Pitre
0 siblings, 1 reply; 36+ messages in thread
From: Jon Smirl @ 2006-08-21 16:14 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Shawn Pearce, git
How about making the length of delta chains an exponential function of
the number of revs? In Mozilla configure.in has 1,700 revs and it is a
250K file. If you store a full copy every 10 revs that is 43MB
(prezip) of data that almost no one is going to look at. The chains
lengths should reflect the relative probability that someone is going
to ask to see the revs. That is not at all a uniform function.
Personally I am still in favor of a two pack system. One archival pack
stores everything in a single chain and size, not speed, is it's most
important attribute. It is marked readonly and only functions as an
archive; git-repack never touches it. It might even use a more compact
compression algorithm.
The second pack is for storing more recent revisions. The archival
pack would be constructed such that none of the files needed for the
head revisions of any branch are in it. They would all be in the
second pack.
After time the second pack may grow large and another archival pack
can be created. The first one would still be maintained in it's
readonly form. git could be optimized to always search for objects in
non-archival packs before even opening the index of an archival one.
This may be a path to partial repositories. Instead of downloading the
real archival pack I could download just an index for it. The index
entries would be marked to indicate that these objects are valid but
not-present.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-21 6:46 ` Shawn Pearce
2006-08-21 10:24 ` Jakub Narebski
@ 2006-08-21 16:23 ` Jon Smirl
1 sibling, 0 replies; 36+ messages in thread
From: Jon Smirl @ 2006-08-21 16:23 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Nicolas Pitre, git
On 8/21/06, Shawn Pearce <spearce@spearce.org> wrote:
> Based on the stats Jon provided me it appeared as though the code
> skipped about 9% of the objects, and at the time I thought I new
> why but looking at it again right now I'm not sure why. :-)
9% of the objects may be duplicates because of branching. I have no
way of detecting them since I am not maintaining a map of sha1s
encountered. The way CVS branching works some revisions are built
using forward diffs and some reverse, I can't tell when they will end
up with the same result.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-21 16:14 ` Jon Smirl
@ 2006-08-21 17:48 ` Nicolas Pitre
2006-08-21 17:55 ` Nicolas Pitre
0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-21 17:48 UTC (permalink / raw)
To: Jon Smirl; +Cc: Shawn Pearce, git
On Mon, 21 Aug 2006, Jon Smirl wrote:
> How about making the length of delta chains an exponential function of
> the number of revs? In Mozilla configure.in has 1,700 revs and it is a
> 250K file. If you store a full copy every 10 revs that is 43MB
> (prezip) of data that almost no one is going to look at.
> The chains
> lengths should reflect the relative probability that someone is going
> to ask to see the revs. That is not at all a uniform function.
1) You can do that already with stock git-repack.
2) The current git-pack-objects code can and does produce a delta "tree"
off of a single base object. It doesn't have to be a linear list.
Therefore, even if the default depth is 10, you may as well have many
deltas pointing to the _same_ base object effectively making the
compression ratio much larger than 1/10.
If for example each object has 2 delta childs, and each of those deltas
also have 2 delta childs, you could have up to 39366 delta objects
attached to a _single_ undeltified base object.
And of course the git-pack-objects code doesn't limit the number of
delta childs in any way so this could theoritically be infinite even
though the max depth is 10. OK the delta matching window limits that
somehow but nothing prevents you from repacking with a larger window
since that parameter has no penalty on the reading of objects out of the
pack.
> Personally I am still in favor of a two pack system. One archival pack
> stores everything in a single chain and size, not speed, is it's most
> important attribute. It is marked readonly and only functions as an
> archive; git-repack never touches it. It might even use a more compact
> compression algorithm.
>
> The second pack is for storing more recent revisions. The archival
> pack would be constructed such that none of the files needed for the
> head revisions of any branch are in it. They would all be in the
> second pack.
Personally I'm still against that. All arguments put forward for a
different or multiple packing system are based on unproven assumptions
so far and none of those arguments actually present significant
advantages over the current system. For example, I was really intriged
by the potential of object grouping into a single zlib stream at first,
but it turns out not to be so great after actual testing.
I still think that a global zlib dictionary is a good idea. Not because
it looks like it'll make packs enormously smaller, but rather because it
impose no performance regression over the current system. And I
strongly believe that you have to have a really strong case for
promoting a solution that carries performance regression, something like
over 25% smaller packs for example. But so far it didn't happen.
> This may be a path to partial repositories. Instead of downloading the
> real archival pack I could download just an index for it. The index
> entries would be marked to indicate that these objects are valid but
> not-present.
An index without the actual objects is simply useless. You could do
without the index entirely in that case anyway since the mere presence
of an entry in the index doesn't give you anything really useful.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-21 17:48 ` Nicolas Pitre
@ 2006-08-21 17:55 ` Nicolas Pitre
2006-08-21 18:01 ` Nicolas Pitre
0 siblings, 1 reply; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-21 17:55 UTC (permalink / raw)
To: Jon Smirl; +Cc: Shawn Pearce, git
On Mon, 21 Aug 2006, Nicolas Pitre wrote:
> If for example each object has 2 delta childs, and each of those deltas
> also have 2 delta childs, you could have up to 39366 delta objects
> attached to a _single_ undeltified base object.
Sorry I've got the math wrong. That is 3582 deltas, given a
conservative number of delta childs = 2.
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit
2006-08-21 17:55 ` Nicolas Pitre
@ 2006-08-21 18:01 ` Nicolas Pitre
0 siblings, 0 replies; 36+ messages in thread
From: Nicolas Pitre @ 2006-08-21 18:01 UTC (permalink / raw)
To: Jon Smirl; +Cc: Shawn Pearce, git
On Mon, 21 Aug 2006, Nicolas Pitre wrote:
> On Mon, 21 Aug 2006, Nicolas Pitre wrote:
>
> > If for example each object has 2 delta childs, and each of those deltas
> > also have 2 delta childs, you could have up to 39366 delta objects
> > attached to a _single_ undeltified base object.
>
> Sorry I've got the math wrong. That is 3582 deltas, given a
> conservative number of delta childs = 2.
OK I just can't count. That is 2046.
But since I made a fool of myself already, why not try with 3 delta
childs for a delta depth of 10 just for fun, now that I should have it
right. The answer is 88572.
Anyway I hope you've got my point now. ;-)
Nicolas
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2006-08-21 18:02 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl
2006-08-17 4:07 ` Shawn Pearce
2006-08-17 7:56 ` Johannes Schindelin
2006-08-17 8:07 ` Johannes Schindelin
2006-08-17 14:36 ` Jon Smirl
2006-08-17 15:45 ` Johannes Schindelin
2006-08-17 16:33 ` Nicolas Pitre
2006-08-17 17:05 ` Johannes Schindelin
2006-08-17 17:22 ` Jon Smirl
2006-08-17 18:15 ` Nicolas Pitre
2006-08-17 17:17 ` Jon Smirl
2006-08-17 17:32 ` Nicolas Pitre
2006-08-17 18:06 ` Jon Smirl
2006-08-17 17:22 ` Nicolas Pitre
2006-08-17 18:03 ` Jon Smirl
2006-08-17 18:24 ` Nicolas Pitre
2006-08-18 4:03 ` Nicolas Pitre
2006-08-18 12:53 ` Jon Smirl
2006-08-18 16:30 ` Nicolas Pitre
2006-08-18 16:56 ` Jon Smirl
2006-08-21 3:45 ` Nicolas Pitre
2006-08-21 6:46 ` Shawn Pearce
2006-08-21 10:24 ` Jakub Narebski
2006-08-21 16:23 ` Jon Smirl
2006-08-18 13:15 ` Jon Smirl
2006-08-18 13:36 ` Johannes Schindelin
2006-08-18 13:50 ` Jon Smirl
2006-08-19 19:25 ` Linus Torvalds
2006-08-18 16:25 ` Nicolas Pitre
2006-08-21 7:06 ` Shawn Pearce
2006-08-21 14:07 ` Jon Smirl
2006-08-21 15:46 ` Nicolas Pitre
2006-08-21 16:14 ` Jon Smirl
2006-08-21 17:48 ` Nicolas Pitre
2006-08-21 17:55 ` Nicolas Pitre
2006-08-21 18:01 ` Nicolas Pitre
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).