* Huge win, compressing a window of delta runs as a unit @ 2006-08-16 17:20 Jon Smirl 2006-08-17 4:07 ` Shawn Pearce 2006-08-18 4:03 ` Nicolas Pitre 0 siblings, 2 replies; 36+ messages in thread From: Jon Smirl @ 2006-08-16 17:20 UTC (permalink / raw) To: Shawn Pearce, git Shawn put together a new version of his import utility that packs all of the deltas from a run into a single blob instead of one blob per delta. The idea is to put 10 or more deltas into each delta entry instead of one. The index format would map the 10 sha1's to a single packed delta entry which would be expanded when needed. Note that you probably needed multiple entries out of the delta pack to generate the revision you were looking for so this is no real loss on extraction. I ran it overnight on mozcvs. If his delta pack code is correct this is a huge win. One entry per delta - 845,42,0150 Packed deltas - 295,018,474 65% smaller The effect of packing the deltas is to totally eliminate many of the redundant zlib dictionaries. This is without using a zlib dictionary which gains another 4% form a 4KB dictionary. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl @ 2006-08-17 4:07 ` Shawn Pearce 2006-08-17 7:56 ` Johannes Schindelin 2006-08-17 17:22 ` Nicolas Pitre 2006-08-18 4:03 ` Nicolas Pitre 1 sibling, 2 replies; 36+ messages in thread From: Shawn Pearce @ 2006-08-17 4:07 UTC (permalink / raw) To: Jon Smirl; +Cc: git Jon Smirl <jonsmirl@gmail.com> wrote: > Shawn put together a new version of his import utility that packs all > of the deltas from a run into a single blob instead of one blob per > delta. The idea is to put 10 or more deltas into each delta entry > instead of one. The index format would map the 10 sha1's to a single > packed delta entry which would be expanded when needed. Note that you > probably needed multiple entries out of the delta pack to generate the > revision you were looking for so this is no real loss on extraction. > > I ran it overnight on mozcvs. If his delta pack code is correct this > is a huge win. > > One entry per delta - 845,42,0150 > Packed deltas - 295,018,474 > 65% smaller > > The effect of packing the deltas is to totally eliminate many of the > redundant zlib dictionaries. I'm going to try to integrate this into core GIT this weekend. My current idea is to make use of the OBJ_EXT type flag to add an extended header field behind the length which describes the "chunk" as being a delta chain compressed in one zlib stream. I'm not overly concerned about saving lots of space in the header here as it looks like we're winning a huge amount of pack space, so the extended header will probably itself be a couple of bytes. This keeps the shorter reserved types free for other great ideas. :) My primary goal of integrating it into core GIT is to take advantage of verify-pack to check the file fast-import is producing. Plus having support for it in sha1_file.c will make it easier to performance test the common access routines that need to be fast, like commit and tree walking. My secondary goal is to get a patchset which other folks can try on their own workloads to see if its as effective as what Jon is seeing on the Mozilla archive. Unfortunately I can't think of a way to make this type of pack readable by older software. So this could be creating a pretty big change in the pack format, relatively speaking. :) -- Shawn. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 4:07 ` Shawn Pearce @ 2006-08-17 7:56 ` Johannes Schindelin 2006-08-17 8:07 ` Johannes Schindelin 2006-08-17 17:22 ` Nicolas Pitre 1 sibling, 1 reply; 36+ messages in thread From: Johannes Schindelin @ 2006-08-17 7:56 UTC (permalink / raw) To: Shawn Pearce; +Cc: Jon Smirl, git Hi, On Thu, 17 Aug 2006, Shawn Pearce wrote: > Jon Smirl <jonsmirl@gmail.com> wrote: > > > > One entry per delta - 845,42,0150 > > Packed deltas - 295,018,474 > > 65% smaller Impressed. > Unfortunately I can't think of a way to make this type of pack > readable by older software. So this could be creating a pretty > big change in the pack format, relatively speaking. :) Not necessarily: the format could stay the same, where multiple object names point to the same entry. It should not be a problem, as long as this change is kept local: the same git tools which created this kind of pack have to read it. However, I see some real problems when a public repository has a pack like this: if the client is older, the server has to do some real work, especially in the case of a fresh clone. It gets even worse with a public HTTP repo. The obvious short-time solution would be to make this opt-in, and use it only on private repositories. Then, one could think about how to handle the delta-chains in the pack protocol (it should be easy, since you just have to store one object instead of a few), and make this a "supports" option. Ciao, Dscho ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 7:56 ` Johannes Schindelin @ 2006-08-17 8:07 ` Johannes Schindelin 2006-08-17 14:36 ` Jon Smirl 0 siblings, 1 reply; 36+ messages in thread From: Johannes Schindelin @ 2006-08-17 8:07 UTC (permalink / raw) To: Shawn Pearce; +Cc: Jon Smirl, git Hi, On Thu, 17 Aug 2006, Johannes Schindelin wrote: > However, I see some real problems when a public repository has a pack like > this: if the client is older, the server has to do some real work, > especially in the case of a fresh clone. It gets even worse with a public > HTTP repo. > > The obvious short-time solution would be to make this opt-in, and use it > only on private repositories. > > Then, one could think about how to handle the delta-chains in the pack > protocol (it should be easy, since you just have to store one object > instead of a few), and make this a "supports" option. Oh, and I completely forgot: IIUC you have to unpack the (possibly big) delta-chain until you have the delta you really want, right? So, you might want to have a clever mechanism to keep the delta-chains short for recent objects (i.e. objects you are likely to need more often). At least, the delta-chains should be limited by size (_not_ by number of deltas: you can have huge deltas, and if you have to unpack 5 huge deltas before getting to the huge delta you really need, it takes really long). Ciao, Dscho ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 8:07 ` Johannes Schindelin @ 2006-08-17 14:36 ` Jon Smirl 2006-08-17 15:45 ` Johannes Schindelin 0 siblings, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-17 14:36 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Shawn Pearce, git On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > At least, the delta-chains should be limited by size (_not_ by number of > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas > before getting to the huge delta you really need, it takes really long). This is not an obvious conclusion. Reducing our pack from 845MB to 292MB eliminates a huge amount of IO. All of the work I am doing with this pack has been totally IO bound. The CPU time to get from one delta to another is minor compared to the IO time increase if you break the chains up. Each time you break the chain you have to store a full copy of the file, the IO increase from doing this out weighs the CPU time reduction. I would use a two pack model. One pack holds the 292MB archive and the second pack is made from recent changes. They can both be the same format since the chains in the second pack won't be very long. We might want to put a flag into an archive pack that tells git-repack not to touch it by default. As for public servers there is an immediate win in shifting to the new pack format. Three hour downloads vs one hour, plus the bandwidth bills. Are the tools smart enough to say this is a newer pack format, upgrade? It takes far less than two hours to upgrade your git install. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 14:36 ` Jon Smirl @ 2006-08-17 15:45 ` Johannes Schindelin 2006-08-17 16:33 ` Nicolas Pitre 2006-08-17 17:17 ` Jon Smirl 0 siblings, 2 replies; 36+ messages in thread From: Johannes Schindelin @ 2006-08-17 15:45 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn Pearce, git Hi, On Thu, 17 Aug 2006, Jon Smirl wrote: > On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > At least, the delta-chains should be limited by size (_not_ by number of > > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas > > before getting to the huge delta you really need, it takes really long). > > This is not an obvious conclusion. The big win is bought by having _one_ zlib stream for multiple deltas, right? So, everytime you want to access the _last_ delta in the chain, you unpack _all_ of them. This quite obvious conclusion is obviously your reason to propose two packs, one for the archive of "old" objects, and one for the "new" objects. > As for public servers there is an immediate win in shifting to the new > pack format. Three hour downloads vs one hour, plus the bandwidth > bills. Are the tools smart enough to say this is a newer pack format, > upgrade? It takes far less than two hours to upgrade your git install. Have you thought about a non-upgraded client? The server has to repack in that case, and if the client wants a clone, you might not even have the time to kiss your server good-bye before it goes belly up. There is an obvious choice here as to how fast people would upgrade their servers. Ciao, Dscho ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 15:45 ` Johannes Schindelin @ 2006-08-17 16:33 ` Nicolas Pitre 2006-08-17 17:05 ` Johannes Schindelin 2006-08-17 17:22 ` Jon Smirl 2006-08-17 17:17 ` Jon Smirl 1 sibling, 2 replies; 36+ messages in thread From: Nicolas Pitre @ 2006-08-17 16:33 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Jon Smirl, Shawn Pearce, git On Thu, 17 Aug 2006, Johannes Schindelin wrote: > Hi, > > On Thu, 17 Aug 2006, Jon Smirl wrote: > > > On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > > At least, the delta-chains should be limited by size (_not_ by number of > > > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas > > > before getting to the huge delta you really need, it takes really long). > > > > This is not an obvious conclusion. > > The big win is bought by having _one_ zlib stream for multiple deltas, > right? > > So, everytime you want to access the _last_ delta in the chain, you unpack > _all_ of them. This is the case whether deltas are separately deflated or not. > This quite obvious conclusion is obviously your reason to > propose two packs, one for the archive of "old" objects, and one for the > "new" objects. Old objects are usually further down the delta chain and also stored further from the beginning of the pack. Hence "new" objects could still have quick access since even if a delta chain is all in the same zlib stream, it is likely that inflating the whole of the zlib stream to get "new" objects won't be necessary, just like it is done now where only the needed deltas are inflated. > > As for public servers there is an immediate win in shifting to the new > > pack format. Three hour downloads vs one hour, plus the bandwidth > > bills. Are the tools smart enough to say this is a newer pack format, > > upgrade? It takes far less than two hours to upgrade your git install. > > Have you thought about a non-upgraded client? The server has to repack in > that case, and if the client wants a clone, you might not even have the > time to kiss your server good-bye before it goes belly up. Nah. The only overhead for a server to feed an "old" pack format from a "new" pack file is to inflate and deflate some data. That is not _that_ costly. Nicolas ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 16:33 ` Nicolas Pitre @ 2006-08-17 17:05 ` Johannes Schindelin 2006-08-17 17:22 ` Jon Smirl 1 sibling, 0 replies; 36+ messages in thread From: Johannes Schindelin @ 2006-08-17 17:05 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jon Smirl, Shawn Pearce, git Hi, On Thu, 17 Aug 2006, Nicolas Pitre wrote: > On Thu, 17 Aug 2006, Johannes Schindelin wrote: > > > On Thu, 17 Aug 2006, Jon Smirl wrote: > > > > > On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > > > At least, the delta-chains should be limited by size (_not_ by number of > > > > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas > > > > before getting to the huge delta you really need, it takes really long). > > > > > > This is not an obvious conclusion. > > > > The big win is bought by having _one_ zlib stream for multiple deltas, > > right? > > > > So, everytime you want to access the _last_ delta in the chain, you unpack > > _all_ of them. > > This is the case whether deltas are separately deflated or not. Oh, now I get it! The delta chain is comprised of exactly those deltas which are needed to reconstruct a certain object from another object which was stored undeltified in the pack. Now it makes sense to me that they could be bundled before being deflated. In the pack-index, the objects of that delta-chain could just point to the delta-chain object. Which has a mini-index in its extended header. Or did I misunderstand again? Ciao, Dscho ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 16:33 ` Nicolas Pitre 2006-08-17 17:05 ` Johannes Schindelin @ 2006-08-17 17:22 ` Jon Smirl 2006-08-17 18:15 ` Nicolas Pitre 1 sibling, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-17 17:22 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Johannes Schindelin, Shawn Pearce, git On 8/17/06, Nicolas Pitre <nico@cam.org> wrote: > Nah. The only overhead for a server to feed an "old" pack format from a > "new" pack file is to inflate and deflate some data. That is not _that_ > costly. It is costly because the Mozilla pack is going to inflate from 295MB to 845MB. It will take you much longer to download an extra 500MB than upgrading your client would take. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 17:22 ` Jon Smirl @ 2006-08-17 18:15 ` Nicolas Pitre 0 siblings, 0 replies; 36+ messages in thread From: Nicolas Pitre @ 2006-08-17 18:15 UTC (permalink / raw) To: Jon Smirl; +Cc: Johannes Schindelin, Shawn Pearce, git On Thu, 17 Aug 2006, Jon Smirl wrote: > On 8/17/06, Nicolas Pitre <nico@cam.org> wrote: > > Nah. The only overhead for a server to feed an "old" pack format from a > > "new" pack file is to inflate and deflate some data. That is not _that_ > > costly. > > It is costly because the Mozilla pack is going to inflate from 295MB > to 845MB. It will take you much longer to download an extra 500MB than > upgrading your client would take. I didn't say it isn't costly. I said that it isn't _that_ costly to the point of bringing down a server. Nicolas ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 15:45 ` Johannes Schindelin 2006-08-17 16:33 ` Nicolas Pitre @ 2006-08-17 17:17 ` Jon Smirl 2006-08-17 17:32 ` Nicolas Pitre 1 sibling, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-17 17:17 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Shawn Pearce, git On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > Hi, > > On Thu, 17 Aug 2006, Jon Smirl wrote: > > > On 8/17/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > > At least, the delta-chains should be limited by size (_not_ by number of > > > deltas: you can have huge deltas, and if you have to unpack 5 huge deltas > > > before getting to the huge delta you really need, it takes really long). > > > > This is not an obvious conclusion. > > The big win is bought by having _one_ zlib stream for multiple deltas, > right? > > So, everytime you want to access the _last_ delta in the chain, you unpack > _all_ of them. This quite obvious conclusion is obviously your reason to > propose two packs, one for the archive of "old" objects, and one for the > "new" objects. Do some measurements, the IO vs CPU time trade off is way in favor of eliminating the IO. It really doesn't take very much CPU to unpack the delta chain. The two pack proprosal was not about reducing the delta chain length; they are reverse deltas, the newest version is always at the front. Two packs are used to avoid repacking the 280MB pack when you do a repack command. It takes 2-3 hours to repack 280MB. Even if if you just copy the old pack to the new it take 30 minutes to do all of the IO. > > As for public servers there is an immediate win in shifting to the new > > pack format. Three hour downloads vs one hour, plus the bandwidth > > bills. Are the tools smart enough to say this is a newer pack format, > > upgrade? It takes far less than two hours to upgrade your git install. > > Have you thought about a non-upgraded client? The server has to repack in > that case, and if the client wants a clone, you might not even have the > time to kiss your server good-bye before it goes belly up. Is there a pack format version number built into the server procotol? If not there needs to be one. When there is a mismatch with the server pack version the client should simply display an error requesting an upgrade of the client software. Git should be designed for forward evolution, not infinite backward compatibility. It is easy to upgrade your client to support the new protocol. The protocol just needs to ensure that the client reliably gets an error about the need to upgrade. Forward evolution implies that a client is able to work with older servers, but not the inverse, that new servers have to work with old clients. > There is an obvious choice here as to how fast people would upgrade their > servers. > > Ciao, > Dscho > > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 17:17 ` Jon Smirl @ 2006-08-17 17:32 ` Nicolas Pitre 2006-08-17 18:06 ` Jon Smirl 0 siblings, 1 reply; 36+ messages in thread From: Nicolas Pitre @ 2006-08-17 17:32 UTC (permalink / raw) To: Jon Smirl; +Cc: Johannes Schindelin, Shawn Pearce, git On Thu, 17 Aug 2006, Jon Smirl wrote: > Is there a pack format version number built into the server procotol? > If not there needs to be one. When there is a mismatch with the server > pack version the client should simply display an error requesting an > upgrade of the client software. There is a pack version number. Reception of an unknown pack already produces an error: if (!pack_version_ok(hdr->hdr_version)) die("unknown pack file version %d", ntohl(hdr->hdr_version)); That will occur really early in a pull or clone when using the native protocol. > Git should be designed for forward evolution, not infinite backward > compatibility. It is easy to upgrade your client to support the new > protocol. The protocol just needs to ensure that the client reliably > gets an error about the need to upgrade. > > Forward evolution implies that a client is able to work with older > servers, but not the inverse, that new servers have to work with old > clients. And still, if we really wanted, the server could have the ability to stream an old pack format by simply inflating the grouped objects and deflating them individually all on the fly, which is not necessarily that costly. But it might still be a good idea to simply have the ability to generate both formats and make the grouped deltas the default only after a while. Nicolas ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 17:32 ` Nicolas Pitre @ 2006-08-17 18:06 ` Jon Smirl 0 siblings, 0 replies; 36+ messages in thread From: Jon Smirl @ 2006-08-17 18:06 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Johannes Schindelin, Shawn Pearce, git On 8/17/06, Nicolas Pitre <nico@cam.org> wrote: > On Thu, 17 Aug 2006, Jon Smirl wrote: > > > Is there a pack format version number built into the server procotol? > > If not there needs to be one. When there is a mismatch with the server > > pack version the client should simply display an error requesting an > > upgrade of the client software. > > There is a pack version number. Reception of an unknown pack already > produces an error: > > if (!pack_version_ok(hdr->hdr_version)) > die("unknown pack file version %d", ntohl(hdr->hdr_version)); > > That will occur really early in a pull or clone when using the native > protocol. Can this be fixed so that if the pack protocol coming from the server is greater than the one handled by the client. they client gets a nice hint about upgrading? -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 4:07 ` Shawn Pearce 2006-08-17 7:56 ` Johannes Schindelin @ 2006-08-17 17:22 ` Nicolas Pitre 2006-08-17 18:03 ` Jon Smirl 1 sibling, 1 reply; 36+ messages in thread From: Nicolas Pitre @ 2006-08-17 17:22 UTC (permalink / raw) To: Shawn Pearce; +Cc: Jon Smirl, git On Thu, 17 Aug 2006, Shawn Pearce wrote: > I'm going to try to integrate this into core GIT this weekend. > My current idea is to make use of the OBJ_EXT type flag to add > an extended header field behind the length which describes the > "chunk" as being a delta chain compressed in one zlib stream. > I'm not overly concerned about saving lots of space in the header > here as it looks like we're winning a huge amount of pack space, > so the extended header will probably itself be a couple of bytes. > This keeps the shorter reserved types free for other great ideas. :) We're streaving for optimal data storage here so don't be afraid to use one of the available types for an "object stream" object. Because when you think of it, the deflating of multiple objects into a single zlib stream can be applied to all object types not only deltas. If ever deflating many blobs into one zlib stream is dimmed worth it then the encoding will already be ready for it. Also you can leverage existing code to write headers, etc. I'd suggest you use OBJ_GROUP = 0 as a new primary object type. Then the "size" field in the header could then become the number of objects that are included in the group. Most of the time that will fit in the low 4 bits of the first header byte, but if there is more than 15 grouped objects then more bits can be used on the following byte. Anyway so far all the code to generate and parse that is already there. If ever there is a need for more extensions that could be prefixed with a pure zero byte (an object group with a zero object count which is distinguishable from a real group). Then, having the number of grouped objects, you just have to list the usual headers for those objects, which are their type and inflated size just like regular object headers, including the base sha1 for deltas. Again you already have code to produce and parse those. And finally just append the objects payload in a single deflated stream. This way the reading of an object from a group can be optimized if the object data is located at the beginning of the stream such that you only need to inflate the amount of bytes leading to the desired data (possibly caching those for further delta replaying), inflate the needed data for the desired object and then ignoring the remaining of the stream. Nicolas ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 17:22 ` Nicolas Pitre @ 2006-08-17 18:03 ` Jon Smirl 2006-08-17 18:24 ` Nicolas Pitre 0 siblings, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-17 18:03 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Shawn Pearce, git On 8/17/06, Nicolas Pitre <nico@cam.org> wrote: > We're streaving for optimal data storage here so don't be afraid to use > one of the available types for an "object stream" object. Because when > you think of it, the deflating of multiple objects into a single zlib > stream can be applied to all object types not only deltas. If ever > deflating many blobs into one zlib stream is dimmed worth it then the > encoding will already be ready for it. Also you can leverage existing > code to write headers, etc. Here are two more case that need to be accounted for in the packs. 1) If you zip something and it gets bigger. We need an entry that just stores the object without it being zipped. Zipping jpegs or mpegs will likely make them significantly bigger. Or does zlib like already detect this case and do a null compression? 2) If you delta something and the delta is bigger than the object being deltaed. The delta code should detect this and store the full object instead of the delta. Again jpegs and mpegs will trigger this. You may even want to say that the delta has to be smaller than 80% of the full object. Shawn is planning on looking into true dictionary based compression. That will generate even more types inside of a pack. If dictionary compression works out full text search can be added with a little more overhead. True dictionary based compression has the potential to go even smaller than the current 280MB. The optimal way to do this is for each pack to contain it's own dictionary. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-17 18:03 ` Jon Smirl @ 2006-08-17 18:24 ` Nicolas Pitre 0 siblings, 0 replies; 36+ messages in thread From: Nicolas Pitre @ 2006-08-17 18:24 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn Pearce, git On Thu, 17 Aug 2006, Jon Smirl wrote: > On 8/17/06, Nicolas Pitre <nico@cam.org> wrote: > > We're streaving for optimal data storage here so don't be afraid to use > > one of the available types for an "object stream" object. Because when > > you think of it, the deflating of multiple objects into a single zlib > > stream can be applied to all object types not only deltas. If ever > > deflating many blobs into one zlib stream is dimmed worth it then the > > encoding will already be ready for it. Also you can leverage existing > > code to write headers, etc. > > Here are two more case that need to be accounted for in the packs. > > 1) If you zip something and it gets bigger. We need an entry that just > stores the object without it being zipped. Zipping jpegs or mpegs will > likely make them significantly bigger. Or does zlib like already > detect this case and do a null compression? zlib appears to do the right thing here. > 2) If you delta something and the delta is bigger than the object > being deltaed. The delta code should detect this and store the full > object instead of the delta. Again jpegs and mpegs will trigger this. > You may even want to say that the delta has to be smaller than 80% of > the full object. I'd invite you to read te try_delta() code in [builtin-]builtin-pack-objects.c. You might find that we're already way more agressive than that with rather simple heuristics that have been tweaked and tuned for over a year now. The commit logs for that file (before the rename to builtin-pack-objects.c) is also quite enlightening. Nicolas ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl 2006-08-17 4:07 ` Shawn Pearce @ 2006-08-18 4:03 ` Nicolas Pitre 2006-08-18 12:53 ` Jon Smirl 2006-08-18 13:15 ` Jon Smirl 1 sibling, 2 replies; 36+ messages in thread From: Nicolas Pitre @ 2006-08-18 4:03 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn Pearce, git On Wed, 16 Aug 2006, Jon Smirl wrote: > Shawn put together a new version of his import utility that packs all > of the deltas from a run into a single blob instead of one blob per > delta. The idea is to put 10 or more deltas into each delta entry > instead of one. The index format would map the 10 sha1's to a single > packed delta entry which would be expanded when needed. Note that you > probably needed multiple entries out of the delta pack to generate the > revision you were looking for so this is no real loss on extraction. > > I ran it overnight on mozcvs. If his delta pack code is correct this > is a huge win. > > One entry per delta - 845,42,0150 > Packed deltas - 295,018,474 > 65% smaller Well, I have to state that I don't believe those results to be right. Not that I question the results you obtained, but rather that Shawn's pack generation is most probably broken. Since I was highly suspicious looking at the above size difference (it is just too good to be true), I did a quick hack to pack-objects.c to produce packs with all deltas depending on the same root object into the same zlib stream. In effect I implemented exactly what Shawn pretends to have done but directly in pack-objects.c so git-repack could be used with existing repositories like the linux kernel in order to validate the concept. Results: One entry per delta - 122,103,455 Packed (or grouped) deltas : 108,479,014 Only 11% smaller. I was really enthousiastic at first when I saw the result you obtained, but I no longer believe it can be right. Furthermore, such a pack organization has many disadvantages: it prevents trivial delta data reuse for incremental push/pull operations which is a huge performance penalty (bad for servers), as well as making random object extraction from such a pack somewhat more complex. Those disadvantages greatly outweight the 11% size saving IMHO. A better way to get such a size saving is to increase the window and depth parameters. For example, a window of 20 and depth of 20 can usually provide a pack size saving greater than 11% with none of the disadvantages mentioned above. I therefore don't think this idea should be pursued any further. For completeness you can find my changes to pack-objects below. Beware that this produces packs that cannot be read back so backup your repository first before playing with this. diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index 448461b..bfae892 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -227,7 +227,7 @@ static int encode_header(enum object_typ int n = 1; unsigned char c; - if (type < OBJ_COMMIT || type > OBJ_DELTA) + if (type < 0 || type > OBJ_DELTA) die("bad type %d", type); c = (type << 4) | (size & 15); @@ -332,6 +332,8 @@ static unsigned long write_object(struct return hdrlen + datalen; } +#if 0 + static unsigned long write_one(struct sha1file *f, struct object_entry *e, unsigned long offset) @@ -349,6 +351,107 @@ static unsigned long write_one(struct sh return offset; } +#else + +static unsigned count_delta_childs(struct object_entry *e, unsigned long o) +{ + unsigned count = 0; + struct object_entry *child = e->delta_child; + while (child) { + count += 1 + count_delta_childs(child, o); + if (child->offset) + die("object already seen???"); + child->offset = o; + child = child->delta_sibling; + } + return count; +} + +static unsigned long write_delta_headers(struct sha1file *f, struct object_entry *e, unsigned long *size) +{ + unsigned long offset = 0; + struct object_entry *child = e->delta_child; + while (child) { + unsigned char header[10]; + unsigned hdrlen; + hdrlen = encode_header(OBJ_DELTA, child->delta_size, header); + *size += child->delta_size; + sha1write(f, header, hdrlen); + sha1write(f, child->delta, 20); + offset += hdrlen + 20; + offset += write_delta_headers(f, child, size); + child = child->delta_sibling; + } + return offset; +} + +static void * concat_delta_data(struct object_entry *e, void *buf) +{ + struct object_entry *child = e->delta_child; + while (child) { + char type[10]; + unsigned long size; + void *data = read_sha1_file(child->sha1, type, &size); + if (!data) + die("unable to read %s", sha1_to_hex(child->sha1)); + data = delta_against(data, size, child); + memcpy(buf, data, child->delta_size); + buf += child->delta_size; + free(data); + written++; + written_delta++; + buf = concat_delta_data(child, buf); + child = child->delta_sibling; + } + return buf; +} + +static unsigned long write_one(struct sha1file *f, + struct object_entry *e, + unsigned long offset) +{ + unsigned char header[10]; + unsigned count, hdrlen; + unsigned long size; + void *buf, *p; + + if (e->offset) + /* offset starts from header size and cannot be zero + * if it is written already. + */ + return offset; + if (!e->delta) { + e->offset = offset; + return offset + write_object(f, e); + } + + /* find delta root object */ + while (e->delta) + e = e->delta; + + /* count how many deltas depend on it */ + count = count_delta_childs(e, offset); + + /* write header for object group */ + hdrlen = encode_header(0, count, header); + sha1write(f, header, hdrlen); + offset += hdrlen; + + /* write each object's header and find total data size */ + size = 0; + offset += write_delta_headers(f, e, &size); + + /* write concatenated object data buffer */ + buf = xmalloc(size); + p = concat_delta_data(e, buf); + offset += sha1write_compressed(f, buf, p-buf); + free(buf); + + return offset; +} + +#endif + static void write_pack_file(void) { int i; Nicolas ^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-18 4:03 ` Nicolas Pitre @ 2006-08-18 12:53 ` Jon Smirl 2006-08-18 16:30 ` Nicolas Pitre 2006-08-18 13:15 ` Jon Smirl 1 sibling, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-18 12:53 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Shawn Pearce, git [-- Attachment #1: Type: text/plain, Size: 370 bytes --] I attached Shawn's code. He is gone until Monday and can't defend it. Do note that I am running this on the Mozilla CVS which is over 10 years old. Some of the files have over 2,000 deltas. I average 10 deltas per file but the distribution is not at all even. Many files get checked-in and never changed, for example 1000's of images. -- Jon Smirl jonsmirl@gmail.com [-- Attachment #2: fast-import3.c --] [-- Type: text/x-csrc, Size: 31535 bytes --] /* Format of STDIN stream: stream ::= cmd*; cmd ::= new_blob | new_branch | new_commit | new_tag ; new_blob ::= 'blob' lf mark? file_content; file_content ::= data; new_branch ::= 'branch' sp ref_str lf ('from' sp (ref_str | hexsha1 | sha1exp_str | idnum) lf)? lf; new_commit ::= 'commit' sp ref_str lf mark? ('author' sp name '<' email '>' ts tz lf)? 'committer' sp name '<' email '>' ts tz lf commit_msg file_change* lf; commit_msg ::= data; file_change ::= 'M' sp mode sp (hexsha1 | idnum) sp path_str lf | 'D' sp path_str lf ; mode ::= '644' | '755'; new_tag ::= 'tag' sp tag_str lf 'from' sp (ref_str | hexsha1 | sha1exp_str | idnum) lf 'tagger' sp name '<' email '>' ts tz lf tag_msg; tag_msg ::= data; # note: the first idnum in a stream should be 1 and subsequent # idnums should not have gaps between values as this will cause # the stream parser to reserve space for the gapped values. An # idnum can be updated in the future to a new object by issuing # a new mark directive with the old idnum. # mark ::= 'mark' sp idnum lf; # note: declen indicates the length of binary_data in bytes. # declen does not include the lf preceeding or trailing the # binary data. # data ::= 'data' sp declen lf binary_data lf; # note: quoted strings are C-style quoting supporting \c for # common escapes of 'c' (e..g \n, \t, \\, \") or \nnn where nnn # is the signed byte value in octal. Note that the only # characters which must actually be escaped to protect the # stream formatting is: \, " and LF. Otherwise these values # are UTF8. # ref_str ::= ref | '"' quoted(ref) '"' ; sha1exp_str ::= sha1exp | '"' quoted(sha1exp) '"' ; tag_str ::= tag | '"' quoted(tag) '"' ; path_str ::= path | '"' quoted(path) '"' ; declen ::= # unsigned 32 bit value, ascii base10 notation; binary_data ::= # file content, not interpreted; sp ::= # ASCII space character; lf ::= # ASCII newline (LF) character; # note: a colon (':') must precede the numerical value assigned to # an idnum. This is to distinguish it from a ref or tag name as # GIT does not permit ':' in ref or tag strings. # idnum ::= ':' declen; path ::= # GIT style file path, e.g. "a/b/c"; ref ::= # GIT ref name, e.g. "refs/heads/MOZ_GECKO_EXPERIMENT"; tag ::= # GIT tag name, e.g. "FIREFOX_1_5"; sha1exp ::= # Any valid GIT SHA1 expression; hexsha1 ::= # SHA1 in hexadecimal format; # note: name and email are UTF8 strings, however name must not # contain '<' or lf and email must not contain any of the # following: '<', '>', lf. # name ::= # valid GIT author/committer name; email ::= # valid GIT author/committer email; ts ::= # time since the epoch in seconds, ascii base10 notation; tz ::= # GIT style timezone; */ #include "builtin.h" #include "cache.h" #include "object.h" #include "blob.h" #include "tree.h" #include "delta.h" #include "pack.h" #include "refs.h" #include "csum-file.h" #include "strbuf.h" #include "quote.h" struct object_entry { struct object_entry *next; enum object_type type; unsigned long offset; unsigned char sha1[20]; }; struct object_entry_pool { struct object_entry_pool *next_pool; struct object_entry *next_free; struct object_entry *end; struct object_entry entries[FLEX_ARRAY]; /* more */ }; struct last_object { void *data; unsigned int len; unsigned int depth; unsigned char sha1[20]; }; struct mem_pool { struct mem_pool *next_pool; char *next_free; char *end; char space[FLEX_ARRAY]; /* more */ }; struct atom_str { struct atom_str *next_atom; int str_len; char str_dat[FLEX_ARRAY]; /* more */ }; struct tree_content; struct tree_entry { struct tree_content *tree; struct atom_str* name; unsigned int mode; unsigned char sha1[20]; }; struct tree_content { unsigned int entry_capacity; /* must match avail_tree_content */ unsigned int entry_count; struct tree_entry *entries[FLEX_ARRAY]; /* more */ }; struct avail_tree_content { unsigned int entry_capacity; /* must match tree_content */ struct avail_tree_content *next_avail; }; struct branch { struct branch *table_next_branch; struct branch *active_next_branch; const char *name; unsigned long last_commit; struct tree_entry branch_tree; unsigned char sha1[20]; }; /* Stats and misc. counters */ static int max_depth = 10; static unsigned long alloc_count; static unsigned long branch_count; static unsigned long object_count; static unsigned long duplicate_count; static unsigned long object_count_by_type[9]; static unsigned long duplicate_count_by_type[9]; static unsigned long stream_cnt; /* Memory pools */ static size_t mem_pool_alloc = 2*1024*1024 - sizeof(struct mem_pool); static size_t total_allocd; static struct mem_pool *mem_pool; /* Atom management */ static unsigned int atom_table_sz = 4451; static unsigned int atom_cnt; static struct atom_str **atom_table; /* The .pack file being generated */ static int pack_fd; static unsigned long pack_offset; static unsigned char pack_sha1[20]; static z_stream zs; static int zs_live; static void* zs_out; static unsigned long zs_outlen; /* Table of objects we've written. */ static unsigned int object_entry_alloc = 1000; static struct object_entry_pool *blocks; static struct object_entry *object_table[1 << 16]; /* Our last blob */ static struct last_object last_blob; /* Tree management */ static unsigned int tree_entry_alloc = 1000; static void *avail_tree_entry; static unsigned int avail_tree_table_sz = 100; static struct avail_tree_content **avail_tree_table; /* Branch data */ static unsigned int max_active_branches = 5; static unsigned int cur_active_branches; static unsigned int branch_table_sz = 1039; static struct branch **branch_table; static struct branch *active_branches; /* Input stream parsing */ static struct strbuf command_buf; static unsigned long command_mark; static void alloc_objects(int cnt) { struct object_entry_pool *b; b = xmalloc(sizeof(struct object_entry_pool) + cnt * sizeof(struct object_entry)); b->next_pool = blocks; b->next_free = b->entries; b->end = b->entries + cnt; blocks = b; alloc_count += cnt; } static struct object_entry* new_object(unsigned char *sha1) { struct object_entry *e; if (blocks->next_free == blocks->end) alloc_objects(object_entry_alloc); e = blocks->next_free++; memcpy(e->sha1, sha1, sizeof(e->sha1)); return e; } static struct object_entry* find_object(unsigned char *sha1) { unsigned int h = sha1[0] << 8 | sha1[1]; struct object_entry *e; for (e = object_table[h]; e; e = e->next) if (!memcmp(sha1, e->sha1, sizeof(e->sha1))) return e; return NULL; } static struct object_entry* insert_object(unsigned char *sha1) { unsigned int h = sha1[0] << 8 | sha1[1]; struct object_entry *e = object_table[h]; struct object_entry *p = NULL; while (e) { if (!memcmp(sha1, e->sha1, sizeof(e->sha1))) return e; p = e; e = e->next; } e = new_object(sha1); e->next = NULL; e->offset = 0; if (p) p->next = e; else object_table[h] = e; return e; } static unsigned int hc_str(const char *s, size_t len) { unsigned int r = 0; while (len-- > 0) r = r * 31 + *s++; return r; } static void* pool_alloc(size_t len) { struct mem_pool *p; void *r; for (p = mem_pool; p; p = p->next_pool) if ((p->end - p->next_free >= len)) break; if (!p) { if (len >= (mem_pool_alloc/2)) { total_allocd += len; return xmalloc(len); } total_allocd += sizeof(struct mem_pool) + mem_pool_alloc; p = xmalloc(sizeof(struct mem_pool) + mem_pool_alloc); p->next_pool = mem_pool; p->next_free = p->space; p->end = p->next_free + mem_pool_alloc; mem_pool = p; } r = p->next_free; p->next_free += len; return r; } static void* pool_calloc(size_t count, size_t size) { size_t len = count * size; void *r = pool_alloc(len); memset(r, 0, len); return r; } static char* pool_strdup(const char *s) { char *r = pool_alloc(strlen(s) + 1); strcpy(r, s); return r; } static struct atom_str* to_atom(const char *s, size_t len) { unsigned int hc = hc_str(s, len) % atom_table_sz; struct atom_str *c; for (c = atom_table[hc]; c; c = c->next_atom) if (c->str_len == len && !strncmp(s, c->str_dat, len)) return c; c = pool_alloc(sizeof(struct atom_str) + len + 1); c->str_len = len; strncpy(c->str_dat, s, len); c->str_dat[len] = 0; c->next_atom = atom_table[hc]; atom_table[hc] = c; atom_cnt++; return c; } static struct branch* lookup_branch(const char *name) { unsigned int hc = hc_str(name, strlen(name)) % branch_table_sz; struct branch *b; for (b = branch_table[hc]; b; b = b->table_next_branch) if (!strcmp(name, b->name)) return b; return NULL; } static struct branch* new_branch(const char *name) { unsigned int hc = hc_str(name, strlen(name)) % branch_table_sz; struct branch* b = lookup_branch(name); if (b) die("Invalid attempt to create duplicate branch: %s", name); if (check_ref_format(name)) die("Branch name doesn't conform to GIT standards: %s", name); b = pool_calloc(1, sizeof(struct branch)); b->name = pool_strdup(name); b->table_next_branch = branch_table[hc]; branch_table[hc] = b; branch_count++; return b; } static unsigned int hc_entries(unsigned int cnt) { cnt = cnt & 7 ? (cnt / 8) + 1 : cnt / 8; return cnt < avail_tree_table_sz ? cnt : avail_tree_table_sz - 1; } static struct tree_content* new_tree_content(unsigned int cnt) { struct avail_tree_content *f, *l = NULL; struct tree_content *t; unsigned int hc = hc_entries(cnt); for (f = avail_tree_table[hc]; f; l = f, f = f->next_avail) if (f->entry_capacity >= cnt) break; if (f) { if (l) l->next_avail = f->next_avail; else avail_tree_table[hc] = f->next_avail; } else { cnt = cnt & 7 ? ((cnt / 8) + 1) * 8 : cnt; f = pool_alloc(sizeof(*t) + sizeof(t->entries[0]) * cnt); f->entry_capacity = cnt; } t = (struct tree_content*)f; t->entry_count = 0; return t; } static void release_tree_entry(struct tree_entry *e); static void release_tree_content(struct tree_content *t) { struct avail_tree_content *f = (struct avail_tree_content*)t; unsigned int hc = hc_entries(f->entry_capacity); unsigned int i; for (i = 0; i < t->entry_count; i++) release_tree_entry(t->entries[i]); f->next_avail = avail_tree_table[hc]; avail_tree_table[hc] = f; } static struct tree_content* grow_tree_content( struct tree_content *t, int amt) { struct tree_content *r = new_tree_content(t->entry_count + amt); r->entry_count = t->entry_count; memcpy(r->entries,t->entries,t->entry_count*sizeof(t->entries[0])); release_tree_content(t); return r; } static struct tree_entry* new_tree_entry() { struct tree_entry *e; if (!avail_tree_entry) { unsigned int n = tree_entry_alloc; avail_tree_entry = e = xmalloc(n * sizeof(struct tree_entry)); while (n--) { *((void**)e) = e + 1; e++; } } e = avail_tree_entry; avail_tree_entry = *((void**)e); return e; } static void release_tree_entry(struct tree_entry *e) { if (e->tree) release_tree_content(e->tree); *((void**)e) = avail_tree_entry; avail_tree_entry = e; } static void yread(int fd, void *buffer, size_t length) { ssize_t ret = 0; while (ret < length) { ssize_t size = xread(fd, (char *) buffer + ret, length - ret); if (!size) die("Read from descriptor %i: end of stream", fd); if (size < 0) die("Read from descriptor %i: %s", fd, strerror(errno)); ret += size; } } static void ywrite(int fd, void *buffer, size_t length) { ssize_t ret = 0; while (ret < length) { ssize_t size = xwrite(fd, (char *) buffer + ret, length - ret); if (!size) die("Write to descriptor %i: end of file", fd); if (size < 0) die("Write to descriptor %i: %s", fd, strerror(errno)); ret += size; } } static size_t encode_header( enum object_type type, size_t size, unsigned char *hdr) { int n = 1; unsigned char c; if (type < OBJ_COMMIT || type > OBJ_DELTA) die("bad type %d", type); c = (type << 4) | (size & 15); size >>= 4; while (size) { *hdr++ = c | 0x80; c = size & 0x7f; size >>= 7; n++; } *hdr = c; return n; } static void zs_finish() { if (zs_live) { while (deflate(&zs, Z_FINISH) == Z_OK) { if (zs.total_out) { ywrite(pack_fd, zs_out, zs.total_out); pack_offset += zs.total_out; zs.next_out = zs_out; zs.avail_out = zs_outlen; zs.total_out = 0; } } deflateEnd(&zs); free(zs_out); stream_cnt++; } } static int store_object( enum object_type type, void *dat, size_t datlen, struct last_object *last, unsigned char *sha1out) { void *delta; struct object_entry *e; unsigned char hdr[96]; unsigned char sha1[20]; unsigned long hdrlen, deltalen; SHA_CTX c; hdrlen = sprintf((char*)hdr,"%s %lu",type_names[type],datlen) + 1; SHA1_Init(&c); SHA1_Update(&c, hdr, hdrlen); SHA1_Update(&c, dat, datlen); SHA1_Final(sha1, &c); if (sha1out) memcpy(sha1out, sha1, sizeof(sha1)); e = insert_object(sha1); if (e->offset) { duplicate_count++; duplicate_count_by_type[type]++; return 1; } e->type = type; e->offset = pack_offset; object_count++; object_count_by_type[type]++; if (last && last->data && last->depth < max_depth) delta = diff_delta(last->data, last->len, dat, datlen, &deltalen, 0); else { zs_finish(); delta = 0; memset(&zs, 0, sizeof(zs)); deflateInit(&zs, zlib_compression_level); zs_live = 1; zs_outlen = 1024*1024; zs_out = xmalloc(zs_outlen); zs.next_out = zs_out; zs.avail_out = zs_outlen; } if (delta) { last->depth++; zs.next_in = delta; zs.avail_in = deltalen; hdrlen = encode_header(OBJ_DELTA, deltalen, hdr); ywrite(pack_fd, hdr, hdrlen); ywrite(pack_fd, last->sha1, sizeof(sha1)); pack_offset += hdrlen + sizeof(sha1); } else { if (last) last->depth = 0; zs.next_in = dat; zs.avail_in = datlen; hdrlen = encode_header(type, datlen, hdr); ywrite(pack_fd, hdr, hdrlen); pack_offset += hdrlen; } while (deflate(&zs, Z_NO_FLUSH) == Z_OK) { if (zs.total_out) { ywrite(pack_fd, zs_out, zs.total_out); pack_offset += zs.total_out; zs.next_out = zs_out; zs.avail_out = zs_outlen; zs.total_out = 0; } } if (delta) free(delta); if (last) { if (last->data) free(last->data); last->data = dat; last->len = datlen; memcpy(last->sha1, sha1, sizeof(sha1)); } return 0; } static const char *get_mode(const char *str, unsigned int *modep) { unsigned char c; unsigned int mode = 0; while ((c = *str++) != ' ') { if (c < '0' || c > '7') return NULL; mode = (mode << 3) + (c - '0'); } *modep = mode; return str; } static void load_tree(struct tree_entry *root) { struct object_entry *myoe; struct tree_content *t; unsigned long size; char *buf; const char *c; char type[20]; root->tree = t = new_tree_content(8); if (!memcmp(root->sha1, null_sha1, 20)) return; myoe = find_object(root->sha1); if (myoe) { die("FIXME"); } else { buf = read_sha1_file(root->sha1, type, &size); if (!buf || strcmp(type, tree_type)) die("Can't load existing tree %s", sha1_to_hex(root->sha1)); } c = buf; while (c != (buf + size)) { struct tree_entry *e = new_tree_entry(); if (t->entry_count == t->entry_capacity) root->tree = t = grow_tree_content(t, 8); t->entries[t->entry_count++] = e; e->tree = NULL; c = get_mode(c, &e->mode); if (!c) die("Corrupt mode in %s", sha1_to_hex(root->sha1)); e->name = to_atom(c, strlen(c)); c += e->name->str_len + 1; memcpy(e->sha1, c, sizeof(e->sha1)); c += 20; } free(buf); } static int tecmp (const void *_a, const void *_b) { struct tree_entry *a = *((struct tree_entry**)_a); struct tree_entry *b = *((struct tree_entry**)_b); return base_name_compare( a->name->str_dat, a->name->str_len, a->mode, b->name->str_dat, b->name->str_len, b->mode); } static void store_tree(struct tree_entry *root) { struct tree_content *t = root->tree; unsigned int i; size_t maxlen; char *buf, *c; if (memcmp(root->sha1, null_sha1, 20)) return; maxlen = 0; for (i = 0; i < t->entry_count; i++) { maxlen += t->entries[i]->name->str_len + 34; if (t->entries[i]->tree) store_tree(t->entries[i]); } qsort(t->entries, t->entry_count, sizeof(t->entries[0]), tecmp); buf = c = xmalloc(maxlen); for (i = 0; i < t->entry_count; i++) { struct tree_entry *e = t->entries[i]; c += sprintf(c, "%o", e->mode); *c++ = ' '; strcpy(c, e->name->str_dat); c += e->name->str_len + 1; memcpy(c, e->sha1, 20); c += 20; } store_object(OBJ_TREE, buf, c - buf, NULL, root->sha1); free(buf); } static int tree_content_set( struct tree_entry *root, const char *p, const unsigned char *sha1, const unsigned int mode) { struct tree_content *t = root->tree; const char *slash1; unsigned int i, n; struct tree_entry *e; slash1 = strchr(p, '/'); if (slash1) n = slash1 - p; else n = strlen(p); for (i = 0; i < t->entry_count; i++) { e = t->entries[i]; if (e->name->str_len == n && !strncmp(p, e->name->str_dat, n)) { if (!slash1) { if (e->mode == mode && !memcmp(e->sha1, sha1, 20)) return 0; e->mode = mode; memcpy(e->sha1, sha1, 20); if (e->tree) { release_tree_content(e->tree); e->tree = NULL; } memcpy(root->sha1, null_sha1, 20); return 1; } if (!S_ISDIR(e->mode)) { e->tree = new_tree_content(8); e->mode = S_IFDIR; } if (!e->tree) load_tree(e); if (tree_content_set(e, slash1 + 1, sha1, mode)) { memcpy(root->sha1, null_sha1, 20); return 1; } return 0; } } if (t->entry_count == t->entry_capacity) root->tree = t = grow_tree_content(t, 8); e = new_tree_entry(); e->name = to_atom(p, n); t->entries[t->entry_count++] = e; if (slash1) { e->tree = new_tree_content(8); e->mode = S_IFDIR; tree_content_set(e, slash1 + 1, sha1, mode); } else { e->tree = NULL; e->mode = mode; memcpy(e->sha1, sha1, 20); } memcpy(root->sha1, null_sha1, 20); return 1; } static int tree_content_remove(struct tree_entry *root, const char *p) { struct tree_content *t = root->tree; const char *slash1; unsigned int i, n; struct tree_entry *e; slash1 = strchr(p, '/'); if (slash1) n = slash1 - p; else n = strlen(p); for (i = 0; i < t->entry_count; i++) { e = t->entries[i]; if (e->name->str_len == n && !strncmp(p, e->name->str_dat, n)) { if (!slash1 || !S_ISDIR(e->mode)) goto del_entry; if (!e->tree) load_tree(e); if (tree_content_remove(e, slash1 + 1)) { if (!e->tree->entry_count) goto del_entry; memcpy(root->sha1, null_sha1, 20); return 1; } return 0; } } return 0; del_entry: for (i++; i < t->entry_count; i++) t->entries[i-1] = t->entries[i]; t->entry_count--; release_tree_entry(e); memcpy(root->sha1, null_sha1, 20); return 1; } static void init_pack_header() { const char* magic = "PACK"; unsigned long version = 3; unsigned long zero = 0; version = htonl(version); ywrite(pack_fd, (char*)magic, 4); ywrite(pack_fd, &version, 4); ywrite(pack_fd, &zero, 4); pack_offset = 4 * 3; } static void fixup_header_footer() { SHA_CTX c; char hdr[8]; unsigned long cnt; char *buf; size_t n; zs_finish(); if (lseek(pack_fd, 0, SEEK_SET) != 0) die("Failed seeking to start: %s", strerror(errno)); SHA1_Init(&c); yread(pack_fd, hdr, 8); SHA1_Update(&c, hdr, 8); cnt = htonl(object_count); SHA1_Update(&c, &cnt, 4); ywrite(pack_fd, &cnt, 4); buf = xmalloc(128 * 1024); for (;;) { n = xread(pack_fd, buf, 128 * 1024); if (n <= 0) break; SHA1_Update(&c, buf, n); } free(buf); SHA1_Final(pack_sha1, &c); ywrite(pack_fd, pack_sha1, sizeof(pack_sha1)); } static int oecmp (const void *_a, const void *_b) { struct object_entry *a = *((struct object_entry**)_a); struct object_entry *b = *((struct object_entry**)_b); return memcmp(a->sha1, b->sha1, sizeof(a->sha1)); } static void write_index(const char *idx_name) { struct sha1file *f; struct object_entry **idx, **c, **last; struct object_entry *e; struct object_entry_pool *o; unsigned int array[256]; int i; /* Build the sorted table of object IDs. */ idx = xmalloc(object_count * sizeof(struct object_entry*)); c = idx; for (o = blocks; o; o = o->next_pool) for (e = o->entries; e != o->next_free; e++) *c++ = e; last = idx + object_count; qsort(idx, object_count, sizeof(struct object_entry*), oecmp); /* Generate the fan-out array. */ c = idx; for (i = 0; i < 256; i++) { struct object_entry **next = c;; while (next < last) { if ((*next)->sha1[0] != i) break; next++; } array[i] = htonl(next - idx); c = next; } f = sha1create("%s", idx_name); sha1write(f, array, 256 * sizeof(int)); for (c = idx; c != last; c++) { unsigned int offset = htonl((*c)->offset); sha1write(f, &offset, 4); sha1write(f, (*c)->sha1, sizeof((*c)->sha1)); } sha1write(f, pack_sha1, sizeof(pack_sha1)); sha1close(f, NULL, 1); free(idx); } static void dump_branches() { static const char *msg = "fast-import"; unsigned int i; struct branch *b; struct ref_lock *lock; for (i = 0; i < branch_table_sz; i++) { for (b = branch_table[i]; b; b = b->table_next_branch) { lock = lock_any_ref_for_update(b->name, NULL, 0); if (!lock || write_ref_sha1(lock, b->sha1, msg) < 0) die("Can't write %s", b->name); } } } static void read_next_command() { read_line(&command_buf, stdin, '\n'); } static void cmd_mark() { if (!strncmp("mark :", command_buf.buf, 6)) { command_mark = strtoul(command_buf.buf + 6, NULL, 10); read_next_command(); } else command_mark = 0; } static void* cmd_data (size_t *size) { size_t n = 0; void *buffer; size_t length; if (strncmp("data ", command_buf.buf, 5)) die("Expected 'data n' command, found: %s", command_buf.buf); length = strtoul(command_buf.buf + 5, NULL, 10); buffer = xmalloc(length); while (n < length) { size_t s = fread((char*)buffer + n, 1, length - n, stdin); if (!s && feof(stdin)) die("EOF in data (%lu bytes remaining)", length - n); n += s; } if (fgetc(stdin) != '\n') die("An lf did not trail the binary data as expected."); *size = length; return buffer; } static void cmd_new_blob() { size_t datlen; void *dat; unsigned char sha1[20]; read_next_command(); cmd_mark(); dat = cmd_data(&datlen); if (store_object(OBJ_BLOB, dat, datlen, &last_blob, sha1)) free(dat); } static void unload_one_branch() { while (cur_active_branches >= max_active_branches) { unsigned long min_commit = ULONG_MAX; struct branch *e, *l = NULL, *p = NULL; for (e = active_branches; e; e = e->active_next_branch) { if (e->last_commit < min_commit) { p = l; min_commit = e->last_commit; } l = e; } if (p) { e = p->active_next_branch; p->active_next_branch = e->active_next_branch; } else { e = active_branches; active_branches = e->active_next_branch; } e->active_next_branch = NULL; if (e->branch_tree.tree) { release_tree_content(e->branch_tree.tree); e->branch_tree.tree = NULL; } cur_active_branches--; } } static void load_branch(struct branch *b) { load_tree(&b->branch_tree); b->active_next_branch = active_branches; active_branches = b; cur_active_branches++; } static void file_change_m(struct branch *b) { const char *p = command_buf.buf + 2; char *p_uq; const char *endp; struct object_entry *oe; unsigned char sha1[20]; unsigned int mode; char type[20]; p = get_mode(p, &mode); if (!p) die("Corrupt mode: %s", command_buf.buf); switch (mode) { case S_IFREG | 0644: case S_IFREG | 0755: case 0644: case 0755: /* ok */ break; default: die("Corrupt mode: %s", command_buf.buf); } if (get_sha1_hex(p, sha1)) die("Invalid SHA1: %s", command_buf.buf); p += 40; if (*p++ != ' ') die("Missing space after SHA1: %s", command_buf.buf); p_uq = unquote_c_style(p, &endp); if (p_uq) { if (*endp) die("Garbage after path in: %s", command_buf.buf); p = p_uq; } oe = find_object(sha1); if (oe) { if (oe->type != OBJ_BLOB) die("Not a blob (actually a %s): %s", command_buf.buf, type_names[oe->type]); } else { if (sha1_object_info(sha1, type, NULL)) die("Blob not found: %s", command_buf.buf); if (strcmp(blob_type, type)) die("Not a blob (actually a %s): %s", command_buf.buf, type); } tree_content_set(&b->branch_tree, p, sha1, S_IFREG | mode); if (p_uq) free(p_uq); } static void file_change_d(struct branch *b) { const char *p = command_buf.buf + 2; char *p_uq; const char *endp; p_uq = unquote_c_style(p, &endp); if (p_uq) { if (*endp) die("Garbage after path in: %s", command_buf.buf); p = p_uq; } tree_content_remove(&b->branch_tree, p); if (p_uq) free(p_uq); } static void cmd_new_commit() { struct branch *b; void *msg; size_t msglen; char *str_uq; const char *endp; char *sp; char *author = NULL; char *committer = NULL; char *body; /* Obtain the branch name from the rest of our command */ sp = strchr(command_buf.buf, ' ') + 1; str_uq = unquote_c_style(sp, &endp); if (str_uq) { if (*endp) die("Garbage after ref in: %s", command_buf.buf); sp = str_uq; } b = lookup_branch(sp); if (!b) die("Branch not declared: %s", sp); if (str_uq) free(str_uq); read_next_command(); cmd_mark(); if (!strncmp("author ", command_buf.buf, 7)) { author = strdup(command_buf.buf); read_next_command(); } if (!strncmp("committer ", command_buf.buf, 10)) { committer = strdup(command_buf.buf); read_next_command(); } if (!committer) die("Expected committer but didn't get one"); msg = cmd_data(&msglen); /* ensure the branch is active/loaded */ if (!b->branch_tree.tree) { unload_one_branch(); load_branch(b); } /* file_change* */ for (;;) { read_next_command(); if (1 == command_buf.len) break; else if (!strncmp("M ", command_buf.buf, 2)) file_change_m(b); else if (!strncmp("D ", command_buf.buf, 2)) file_change_d(b); else die("Unsupported file_change: %s", command_buf.buf); } /* build the tree and the commit */ store_tree(&b->branch_tree); body = xmalloc(97 + msglen + (author ? strlen(author) + strlen(committer) : 2 * strlen(committer))); sp = body; sp += sprintf(sp, "tree %s\n", sha1_to_hex(b->branch_tree.sha1)); if (memcmp(b->sha1, null_sha1, 20)) sp += sprintf(sp, "parent %s\n", sha1_to_hex(b->sha1)); if (author) sp += sprintf(sp, "%s\n", author); else sp += sprintf(sp, "author %s\n", committer + 10); sp += sprintf(sp, "%s\n\n", committer); memcpy(sp, msg, msglen); sp += msglen; if (author) free(author); free(committer); free(msg); store_object(OBJ_COMMIT, body, sp - body, NULL, b->sha1); free(body); b->last_commit = object_count_by_type[OBJ_COMMIT]; } static void cmd_new_branch() { struct branch *b; char *str_uq; const char *endp; char *sp; /* Obtain the new branch name from the rest of our command */ sp = strchr(command_buf.buf, ' ') + 1; str_uq = unquote_c_style(sp, &endp); if (str_uq) { if (*endp) die("Garbage after ref in: %s", command_buf.buf); sp = str_uq; } b = new_branch(sp); if (str_uq) free(str_uq); read_next_command(); /* from ... */ if (!strncmp("from ", command_buf.buf, 5)) { const char *from; struct branch *s; from = strchr(command_buf.buf, ' ') + 1; str_uq = unquote_c_style(from, &endp); if (str_uq) { if (*endp) die("Garbage after string in: %s", command_buf.buf); from = str_uq; } s = lookup_branch(from); if (b == s) die("Can't create a branch from itself: %s", b->name); else if (s) { memcpy(b->sha1, s->sha1, 20); memcpy(b->branch_tree.sha1, s->branch_tree.sha1, 20); } else if (!get_sha1(from, b->sha1)) { if (!memcmp(b->sha1, null_sha1, 20)) memcpy(b->branch_tree.sha1, null_sha1, 20); else { unsigned long size; char *buf; buf = read_object_with_reference(b->sha1, type_names[OBJ_COMMIT], &size, b->sha1); if (!buf || size < 46) die("Not a valid commit: %s", from); if (memcmp("tree ", buf, 5) || get_sha1_hex(buf + 5, b->branch_tree.sha1)) die("The commit %s is corrupt", sha1_to_hex(b->sha1)); free(buf); } } else die("Invalid ref name or SHA1 expression: %s", from); if (str_uq) free(str_uq); read_next_command(); } else { memcpy(b->sha1, null_sha1, 20); memcpy(b->branch_tree.sha1, null_sha1, 20); } if (command_buf.eof || command_buf.len > 1) die("An lf did not terminate the branch command as expected."); } int main(int argc, const char **argv) { const char *base_name = argv[1]; int est_obj_cnt = atoi(argv[2]); char *pack_name; char *idx_name; struct stat sb; setup_ident(); git_config(git_default_config); pack_name = xmalloc(strlen(base_name) + 6); sprintf(pack_name, "%s.pack", base_name); idx_name = xmalloc(strlen(base_name) + 5); sprintf(idx_name, "%s.idx", base_name); pack_fd = open(pack_name, O_RDWR|O_CREAT|O_EXCL, 0666); if (pack_fd < 0) die("Can't create %s: %s", pack_name, strerror(errno)); init_pack_header(); alloc_objects(est_obj_cnt); strbuf_init(&command_buf); atom_table = xcalloc(atom_table_sz, sizeof(struct atom_str*)); branch_table = xcalloc(branch_table_sz, sizeof(struct branch*)); avail_tree_table = xcalloc(avail_tree_table_sz, sizeof(struct avail_tree_content*)); for (;;) { read_next_command(); if (command_buf.eof) break; else if (!strcmp("blob", command_buf.buf)) cmd_new_blob(); else if (!strncmp("branch ", command_buf.buf, 7)) cmd_new_branch(); else if (!strncmp("commit ", command_buf.buf, 7)) cmd_new_commit(); else die("Unsupported command: %s", command_buf.buf); } fixup_header_footer(); close(pack_fd); write_index(idx_name); dump_branches(); fprintf(stderr, "%s statistics:\n", argv[0]); fprintf(stderr, "---------------------------------------------------\n"); fprintf(stderr, "Alloc'd objects: %10lu (%10lu overflow )\n", alloc_count, alloc_count - est_obj_cnt); fprintf(stderr, "Total objects: %10lu (%10lu duplicates)\n", object_count, duplicate_count); fprintf(stderr, " blobs : %10lu (%10lu duplicates)\n", object_count_by_type[OBJ_BLOB], duplicate_count_by_type[OBJ_BLOB]); fprintf(stderr, " trees : %10lu (%10lu duplicates)\n", object_count_by_type[OBJ_TREE], duplicate_count_by_type[OBJ_TREE]); fprintf(stderr, " commits: %10lu (%10lu duplicates)\n", object_count_by_type[OBJ_COMMIT], duplicate_count_by_type[OBJ_COMMIT]); fprintf(stderr, " tags : %10lu (%10lu duplicates)\n", object_count_by_type[OBJ_TAG], duplicate_count_by_type[OBJ_TAG]); fprintf(stderr, "Total streams: %10u\n", stream_cnt); fprintf(stderr, "Total branches: %10lu\n", branch_count); fprintf(stderr, "Total atoms: %10u\n", atom_cnt); fprintf(stderr, "Memory total: %10lu KiB\n", (total_allocd + alloc_count*sizeof(struct object_entry))/1024); fprintf(stderr, " pools: %10lu KiB\n", total_allocd/1024); fprintf(stderr, " objects: %10lu KiB\n", (alloc_count*sizeof(struct object_entry))/1024); fprintf(stderr, "---------------------------------------------------\n"); stat(pack_name, &sb); fprintf(stderr, "Pack size: %10lu KiB\n", (unsigned long)(sb.st_size/1024)); stat(idx_name, &sb); fprintf(stderr, "Index size: %10lu KiB\n", (unsigned long)(sb.st_size/1024)); fprintf(stderr, "\n"); return 0; } ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-18 12:53 ` Jon Smirl @ 2006-08-18 16:30 ` Nicolas Pitre 2006-08-18 16:56 ` Jon Smirl 0 siblings, 1 reply; 36+ messages in thread From: Nicolas Pitre @ 2006-08-18 16:30 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn Pearce, git On Fri, 18 Aug 2006, Jon Smirl wrote: > I attached Shawn's code. He is gone until Monday and can't defend it. I will have a look at it next week as I'll be gone for the weekend as well. > Do note that I am running this on the Mozilla CVS which is over 10 > years old. Some of the files have over 2,000 deltas. I average 10 > deltas per file but the distribution is not at all even. Many files > get checked-in and never changed, for example 1000's of images. This is IMHO more evidence that something is wrong with the results you obtained. Nicolas ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-18 16:30 ` Nicolas Pitre @ 2006-08-18 16:56 ` Jon Smirl 2006-08-21 3:45 ` Nicolas Pitre 0 siblings, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-18 16:56 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Shawn Pearce, git On 8/18/06, Nicolas Pitre <nico@cam.org> wrote: > On Fri, 18 Aug 2006, Jon Smirl wrote: > > > I attached Shawn's code. He is gone until Monday and can't defend it. > > I will have a look at it next week as I'll be gone for the weekend as > well. I looked at it some and couldn't see anything obviously wrong with it, but it wasn't a detailed inspection. As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB. The 295MB git pack number does not have commits and trees in it, it is revisions only. It is not impossible that when the trees and commits are added to the 295MB number that it won't end up being near the 541MB number. We already know xdiff is better at generating diffs than CVS diff. And I still have to add 250K log entries and trees. Those are going to add at least 100MB. I will get back to work on the import code over the weekend. One of my 2 year olds gave me a cold so I have been useless the last couple of days. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-18 16:56 ` Jon Smirl @ 2006-08-21 3:45 ` Nicolas Pitre 2006-08-21 6:46 ` Shawn Pearce 0 siblings, 1 reply; 36+ messages in thread From: Nicolas Pitre @ 2006-08-21 3:45 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn Pearce, git On Fri, 18 Aug 2006, Jon Smirl wrote: > On 8/18/06, Nicolas Pitre <nico@cam.org> wrote: > > On Fri, 18 Aug 2006, Jon Smirl wrote: > > > > > I attached Shawn's code. He is gone until Monday and can't defend it. > > > > I will have a look at it next week as I'll be gone for the weekend as > > well. > > I looked at it some and couldn't see anything obviously wrong with it, > but it wasn't a detailed inspection. I looked at it too and the code looks OK. This doesn't mean there is no problem at a higher level though. The deltification process is extremely crude and I think this is the cause of the original pack size. For example, last April we discovered that a small change in the heuristics to determine base delta objects in git-pack-objects could create a pack size regression up to 4x the size of the same pack created before such change. It is also possible to have a denser delta stream but once deflated it is larger than a less dense delta to start with. Just to say that many tweaks and heuristics have been implemented and studied in git-pack-objects for over a year now in order to get the really small packs we have today. And a really subtle and inocent-looking change can break it size wize. So what I think is happening with the fastimport code is that the delta selection is not really good. It is certainly much better than no delta at all but still not optimal which smells deja vu to me. Then by deflating them all together the redundent information that the bad delta set still carries along is eliminated -- thanks to zlib sort of mitigating the real issue. But... as my recent experiments show, the grouping of related deltas into a single zlib stream doesn't produce significant improvements when implemented directly into git-pack-objects. Certainly not worth the inconvenients and costs it brings along. I even think that if you used git-repack -a -f on the pack produced by the import process, with only delta deflated individually just like it did originally, then the repacked pack would _also_ shrink significantly. Most probably around 4x just like you observed with the grouping of deltas in the same zlib stream. Not only would git-repack make it much smaller, but it also provicdes a much better layout where all objects for recent commits are all stored together at the beginning of the pack. The fastimport code is instead storing them scattered all over the pack for every commit by making all revisions of each file next to each other which will cause horrible access patterns and really bad IO. So I think that trying to make fastimport too clever is wrong. It should instead focus on creating an initial pack as fast as possible and then rely on a final git-repack pass to produce the shrinked pack. I really doubt the import code could ever make a better job than git-pack-objects does. If I can make a suggestion, you should forget about this multiple deltas in one zlib stream for now and focus on making the import process work all the way to tree and commit objects instead. Then, only then, if git-repack -a -f doesn't produce satisfactory pack size we could look at better pack encoding. And so far the grouping of related deltas in one zlib stream is _not_ a better encoding given the rather small improvement over unmodified git-pack-objects vs the inconvenients and cost it brings with it. > As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB. > The 295MB git pack number does not have commits and trees in it, it is > revisions only. Running git-repack -a -f from a recent GIT on the Mozilla repo converted through cvsps and friends produces a pack smaller than 500MB. I even brought it down to 430MB by using non default delta window and depth. Nicolas ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-21 3:45 ` Nicolas Pitre @ 2006-08-21 6:46 ` Shawn Pearce 2006-08-21 10:24 ` Jakub Narebski 2006-08-21 16:23 ` Jon Smirl 0 siblings, 2 replies; 36+ messages in thread From: Shawn Pearce @ 2006-08-21 6:46 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jon Smirl, git Nicolas Pitre <nico@cam.org> wrote: > I looked at it too and the code looks OK. Based on the stats Jon provided me it appeared as though the code skipped about 9% of the objects, and at the time I thought I new why but looking at it again right now I'm not sure why. :-) Assuming the objects were roughly the same size (probably a bad assumption, but lets use it anyway) we're talking about approx. 9% increase, bringing the size to around 360 MB. Since that is still only blobs and lacks trees&commits, and you were able to get a pack from the earlier import attempts down to 430 MB (including trees&commits) everything you pointed out with regards to delta base selection not being optimal is probably the major culprit here. > So I think that trying to make fastimport too clever is wrong. It > should instead focus on creating an initial pack as fast as possible and > then rely on a final git-repack pass to produce the shrinked pack. I > really doubt the import code could ever make a better job than > git-pack-objects does. I'm mixed on that. My goal with fast-import was to construct a pack as cheaply as possible in such a way that IO is relatively minimized during pack construction and later during an initial repacking, as well as to avoid all fork'ing costs associated with many update-index and repack calls. fast-import is assuming the front end driving it is being somewhat nice by feeding revisions in newest-oldest order, one file at a time. If this assumption is false it will produce a pack but one that is so large the IO to repack it will be horrible (but probably would still be better than if these were all loose objects). fast-import is also assuming that pack construction speed and memory usage required to build that pack are more important than subsequent reads of the resulting pack. Why? Because I fully expected the pack to undergo a `git-repack -a -d -f` with a rather large delta depth before it really got used by clients which cared about performance or disk space. :-) Whether or not git-repack can handle such a large pack is another story. Apparently it can already take a ~600 MB pack and do something useful to it in a reasonable time span so I'm putting any limits that might be in there off for now. But the more I work with Jon on this Mozilla import process the more I'm realizing that: - fast-import.c's other name is "GIT in 1325 lines of C code, all of which more or less already exists elsewhere in a form that wasn't useful in this context without refactoring it in major ways so I got lazy and rewrote it in another equally useless way"; - the strategy of how I'm slamming a very large number of objects into a pack may be useful in situations other than a foreign SCM import process. I can see someone wanting to create a large commit with a lot of modified objects. Doing this with update-index and write-tree into loose objects would be far slower than just generating a new pack if the number of objects you are writing exceeds about 100 on Windows or ~1k on UNIX; - Jon has lots of CPU time to spare and possibly even some memory. fast-import is barely using CPU and barely uses memory (its quite lightweight!) but its definately producing a suboptimal pack as a result. Just doing slightly better delta selection before committing an object to the output pack may prevent the need to use -f during git-repack and produce something almost as optimal. Right now fast-import is only lacking the following major features, at which point it *should* be able to process the entire Mozilla repository, assuming Jon can get his symbol closing problem in cvs2svn solved: - reload a branch which has been swapped out (requires reading the pack that is currently being constructed back into memory, which since its not really a valid pack yet and lacks an index reusing the code in sha1_file.c is slightly interesting); - generate a tag object into the pack; - track association ID numbers to SHA1 IDs (needed to properly create a branch from a prior commit or a tag from a prior commit as the prior commit's SHA ID isn't known to the front end); Working on the multiple objects in one zlib stream code over the weekend also showed me how to use sha1_file.c to perform the first of the three (I think) and the remaining two are trivial. So my weekend wasn't a total loss. :-) I plan on getting that code finished up in the next day or so and get a new draft of it off to Jon. Maybe by then he'll also have something coming out of cvs2svn's pass 9. > > As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB. > > The 295MB git pack number does not have commits and trees in it, it is > > revisions only. > > Running git-repack -a -f from a recent GIT on the Mozilla repo converted > through cvsps and friends produces a pack smaller than 500MB. I even > brought it down to 430MB by using non default delta window and depth. I'd still like to find a way to beat that 430 MB current best-yet-still-valid size. :-) I think we'll get there. Shaving off 4-5% with a pack specific dictionary may be worthwhile on such large packs, especially ones that are considered "historical" and probably will never get repacked again. Folding objects into a single zlib stream may also be worth it, but for now I'm going to wait and see what mileage we can get from the existing repack code and the existing pack file format. -- Shawn. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-21 6:46 ` Shawn Pearce @ 2006-08-21 10:24 ` Jakub Narebski 2006-08-21 16:23 ` Jon Smirl 1 sibling, 0 replies; 36+ messages in thread From: Jakub Narebski @ 2006-08-21 10:24 UTC (permalink / raw) To: git Shawn Pearce wrote: > - the strategy of how I'm slamming a very large number of objects > into a pack may be useful in situations other than a foreign > SCM import process. I can see someone wanting to create a > large commit with a lot of modified objects. Doing this with > update-index and write-tree into loose objects would be far > slower than just generating a new pack if the number of objects > you are writing exceeds about 100 on Windows or ~1k on UNIX; Like e.g. initial import of large project, or incorporating some subproject into a projects (e.g. gitk and gitweb in git)? -- Jakub Narebski Warsaw, Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-21 6:46 ` Shawn Pearce 2006-08-21 10:24 ` Jakub Narebski @ 2006-08-21 16:23 ` Jon Smirl 1 sibling, 0 replies; 36+ messages in thread From: Jon Smirl @ 2006-08-21 16:23 UTC (permalink / raw) To: Shawn Pearce; +Cc: Nicolas Pitre, git On 8/21/06, Shawn Pearce <spearce@spearce.org> wrote: > Based on the stats Jon provided me it appeared as though the code > skipped about 9% of the objects, and at the time I thought I new > why but looking at it again right now I'm not sure why. :-) 9% of the objects may be duplicates because of branching. I have no way of detecting them since I am not maintaining a map of sha1s encountered. The way CVS branching works some revisions are built using forward diffs and some reverse, I can't tell when they will end up with the same result. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-18 4:03 ` Nicolas Pitre 2006-08-18 12:53 ` Jon Smirl @ 2006-08-18 13:15 ` Jon Smirl 2006-08-18 13:36 ` Johannes Schindelin 2006-08-18 16:25 ` Nicolas Pitre 1 sibling, 2 replies; 36+ messages in thread From: Jon Smirl @ 2006-08-18 13:15 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Shawn Pearce, git On 8/18/06, Nicolas Pitre <nico@cam.org> wrote: > A better way to get such a size saving is to increase the window and > depth parameters. For example, a window of 20 and depth of 20 can > usually provide a pack size saving greater than 11% with none of the > disadvantages mentioned above. Our window size is effectively infinite. I am handing him all of the revisions from a single file in optimal order. This includes branches. He takes these revisions, runs xdiff on them, and then puts the entire result into a single zlib blob. I suspect the size reduction is directly proportional to the age of the repository. The kernel repository only has three years worth of data in it. Linus has the full history in another repository that is not in general distribution. We can get it from him when he gets back from vacation. If the repository doesn't contain long delta chains the optimization doesn't help that much. On the other hand it doesn't hurt either since the chains weren't long. My repository is four times as old as the kernel one and I am getting 4x the benefit. This is a good format for archival data that is infrequency accessed. That is why I proposed a two pack system. One pack would contain the archival data and be highly optimized for size and the second pack would contain recent changes and be optimized for speed. The size optimization is important for controlling bandwidth costs. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-18 13:15 ` Jon Smirl @ 2006-08-18 13:36 ` Johannes Schindelin 2006-08-18 13:50 ` Jon Smirl 2006-08-18 16:25 ` Nicolas Pitre 1 sibling, 1 reply; 36+ messages in thread From: Johannes Schindelin @ 2006-08-18 13:36 UTC (permalink / raw) To: Jon Smirl; +Cc: Nicolas Pitre, Shawn Pearce, git Hi, On Fri, 18 Aug 2006, Jon Smirl wrote: > I suspect the size reduction is directly proportional to the age of > the repository. The kernel repository only has three years worth of > data in it. Linus has the full history in another repository that is > not in general distribution. We can get it from him when he gets back > from vacation. Maybe you mean http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/tglx/history.git Ciao, Dscho ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-18 13:36 ` Johannes Schindelin @ 2006-08-18 13:50 ` Jon Smirl 2006-08-19 19:25 ` Linus Torvalds 0 siblings, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-18 13:50 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Nicolas Pitre, Shawn Pearce, git On 8/18/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > Hi, > > On Fri, 18 Aug 2006, Jon Smirl wrote: > > > I suspect the size reduction is directly proportional to the age of > > the repository. The kernel repository only has three years worth of > > data in it. Linus has the full history in another repository that is > > not in general distribution. We can get it from him when he gets back > > from vacation. > > Maybe you mean > > http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/tglx/history.git That one only goes to 2002, the full one goes back to around 1990. > > Ciao, > Dscho > > -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-18 13:50 ` Jon Smirl @ 2006-08-19 19:25 ` Linus Torvalds 0 siblings, 0 replies; 36+ messages in thread From: Linus Torvalds @ 2006-08-19 19:25 UTC (permalink / raw) To: Jon Smirl; +Cc: Johannes Schindelin, Nicolas Pitre, Shawn Pearce, git On Fri, 18 Aug 2006, Jon Smirl wrote: > On 8/18/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > Hi, > > > > On Fri, 18 Aug 2006, Jon Smirl wrote: > > > > > I suspect the size reduction is directly proportional to the age of > > > the repository. The kernel repository only has three years worth of > > > data in it. Linus has the full history in another repository that is > > > not in general distribution. We can get it from him when he gets back > > > from vacation. > > > > Maybe you mean > > > > http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/tglx/history.git > > That one only goes to 2002, the full one goes back to around 1990. I don't actually have such a "full" history. It would be wonderful if somebody took the time to try to piece such a thing together (and git actually makes that a _lot_ easier than some other SCM's, because you can just import random versions in any order, and then re-stich just the commit history when you add a new thing in the middle, without generating any new trees or deltas or anything strange at all). But it's a lot of work. I tried to do a "Linux-Historic" archive about a year ago (and imported some of the old kernels I had), but I gave up, just because it was such a pain to try to do a good job and try to find old release notes etc to import into the changelogs etc. Oh, well. So the only "old" history I have is indeed that BK conversion by Thomas Gleixner. Any pre-BK stuff only exists as patches and tar-balls on various ftp sites (and I don't have any magic repository of my own, so everybody else can do exactly as well as I could, with possibly the exception that I might remember some random details about some old release history - but considering my memory, that's pretty unlikely too. Google is your friend) Linus ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-18 13:15 ` Jon Smirl 2006-08-18 13:36 ` Johannes Schindelin @ 2006-08-18 16:25 ` Nicolas Pitre 2006-08-21 7:06 ` Shawn Pearce 1 sibling, 1 reply; 36+ messages in thread From: Nicolas Pitre @ 2006-08-18 16:25 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn Pearce, git On Fri, 18 Aug 2006, Jon Smirl wrote: > On 8/18/06, Nicolas Pitre <nico@cam.org> wrote: > > A better way to get such a size saving is to increase the window and > > depth parameters. For example, a window of 20 and depth of 20 can > > usually provide a pack size saving greater than 11% with none of the > > disadvantages mentioned above. > > Our window size is effectively infinite. I am handing him all of the > revisions from a single file in optimal order. This includes branches. In GIT packing terms this is infinite delta _depth_ not _window_. > He takes these revisions, runs xdiff on them, and then puts the entire > result into a single zlib blob. This is not a good idea to have infinite delta depth. The time to browse the repository history then becomes exponential with the number of revisions making the value of such a repository a bit questionnable (you could as well only preserve the last 2 years of history instead since further than that with infinite delta depth is likely to be too slow and painful to use). But just for comparison I did a repack -a -f on the kernel repository with --window=50 --depth=5000 which should be a good approximation of the best possible delta matchingwith infinite depth. Default delta params (window=10 depth=10) : 122103455 Agressive deltas (window=50 depth=5000) : 105870516 Reduction : 13% OK let's try it with delta chains in the same zlib stream using the patch I posted yesterday (with a minor tweak allowing the usage of -f with git-repack). Agressive and grouped deltas (window=50 depth=5000 : 99860685 This is a mere 5.7% reduction over the non grouped deltas, less than the 11% reduction I obtained yesterday when the delta depth is kept reasonably short. The increased delta depth is likely to make a large difference on old repos with long history, maybe more so and with much less complexity than the delta grouping. > I suspect the size reduction is directly proportional to the age of > the repository. The kernel repository only has three years worth of > data in it. Linus has the full history in another repository that is > not in general distribution. We can get it from him when he gets back > from vacation. > > If the repository doesn't contain long delta chains the optimization > doesn't help that much. On the other hand it doesn't hurt either since > the chains weren't long. My repository is four times as old as the > kernel one and I am getting 4x the benefit. No that cannot be right. Let's assume every whole objects are 10 in size and every deltas are 1. You therefore can have 1 base object and 10 delta objects, effectively storing 11 objects for a size of 20. You therefore have a 1.8 vs 10 size ratio. If the delta depth is 100 then you potentially have 1 base object and 100 deltas for a size ratio of 1.1 vs 10. If the delta depth is 1000 the ratiobecomes 1.01 vs 10. The size saving is therefore _not_ proportional with the age of the repository. It rather tend to be asymptotic with the delta ratio (but impose an exponential runtime cost when fetching objects out of it). The fact that your 4x old repository has a 4x size saving can be due only to packing malfunction I would say. Nicolas ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-18 16:25 ` Nicolas Pitre @ 2006-08-21 7:06 ` Shawn Pearce 2006-08-21 14:07 ` Jon Smirl 2006-08-21 15:46 ` Nicolas Pitre 0 siblings, 2 replies; 36+ messages in thread From: Shawn Pearce @ 2006-08-21 7:06 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jon Smirl, git Nicolas Pitre <nico@cam.org> wrote: > On Fri, 18 Aug 2006, Jon Smirl wrote: > > > On 8/18/06, Nicolas Pitre <nico@cam.org> wrote: > > > A better way to get such a size saving is to increase the window and > > > depth parameters. For example, a window of 20 and depth of 20 can > > > usually provide a pack size saving greater than 11% with none of the > > > disadvantages mentioned above. > > > > Our window size is effectively infinite. I am handing him all of the > > revisions from a single file in optimal order. This includes branches. > > In GIT packing terms this is infinite delta _depth_ not _window_. We're not using infinite anything. fast-import is basically doing window=1 and depth=10. We only examine the last blob to see if we can get a delta against it. If we do we write that delta out; otherwise we reset our delta chain and write the complete object. We also reset our chain after writing out 10 deltas, each of which used the immediately prior object as its base. Since I just found out that in some cases the Mozilla repository has 1000s of revisions per file[*1*] and in others only 1 revision per file we probably should be adjusting this depth to have a maximum of 500 while also having the frontend send us a "I'm switching files now" marker so we know to not even bother trying to delta the new blob against the last blob as they are likely to not delta well[*2*]. > Default delta params (window=10 depth=10) : 122103455 > Agressive deltas (window=50 depth=5000) : 105870516 > Agressive and grouped deltas (window=50 depth=5000 : 99860685 Although complex the aggressive and grouped deltas appears to have saved you 18.2% on this repository. That's not something to ignore. A reasonably optimal local pack dictionary could save at least 4%[*3*]. Whacking 22% off a 400 MB pack is saving 88 MB. Transferring that over the network on an initial clone is like downloading all of Eclipse. Or an uncompressed kernel tarball... [*1*] Jon noted this in another email in this thread but I'm too lazy to lookup the hyperlink right now. [*2*] Granted in some cases they may delta very well against each other but I think the probablity of that occuring is low enough that its not worth worrying about in fast-import.c; we can let repack's strategy deal with it instead. [*3*] I wrote a brain-dead simple local dictionary selecter in Perl. Its horribly far from being ideal. But it is consistently saving us 4% on the GIT and the Mozilla repository and its pretty darn fast. Shockingly the C keywords didn't gain us very much here; its project specific text that's the real win. Looking at chunks which are frequently copied in deltas from base objects and breaking those chunks up into smaller common chunks, then loading those most frequent common chunks into the pack dictionary would most likely produce far better results. -- Shawn. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-21 7:06 ` Shawn Pearce @ 2006-08-21 14:07 ` Jon Smirl 2006-08-21 15:46 ` Nicolas Pitre 1 sibling, 0 replies; 36+ messages in thread From: Jon Smirl @ 2006-08-21 14:07 UTC (permalink / raw) To: Shawn Pearce; +Cc: Nicolas Pitre, git Mozilla CVS files have between 1 and 1700 deltas. The average is ten, but the deviation is high. I believe less than 10 files have over 1000 deltas, they are all in the root directory and related to the build process. Many files have no deltas, or because of CVS artifacts all of the revisions are identical. I am still IO bound. Random access IO is the problem, not stream IO. I have to open and read 110,000 (5GB total) files. It takes about 2hrs to do all of the IO. I'm not CPU bound yet but as we make things more efficient, I am getting closer to being CPU bound. Forking is not an option. It can takes days to fork 1M copies of an app. I have used oprofile on parsecvs. It spends 60% of the time in the kernel processing fork calls. Parsecvs runs for 6hrs on mozcvs and dies without finishing. I am back to working on the branch code. I'm over the cold I got from my 2 yr old. It is slow going now. I am in the phase where the import process runs without error 5-10 minutes and then dies from some unusual branch case. I fix it up and try again. I am slowly identifying and removing all the code in cvs2svn that puts the branches and symbols into their own subdirectories. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-21 7:06 ` Shawn Pearce 2006-08-21 14:07 ` Jon Smirl @ 2006-08-21 15:46 ` Nicolas Pitre 2006-08-21 16:14 ` Jon Smirl 1 sibling, 1 reply; 36+ messages in thread From: Nicolas Pitre @ 2006-08-21 15:46 UTC (permalink / raw) To: Shawn Pearce; +Cc: Jon Smirl, git On Mon, 21 Aug 2006, Shawn Pearce wrote: > Nicolas Pitre <nico@cam.org> wrote: > > On Fri, 18 Aug 2006, Jon Smirl wrote: > > > > > On 8/18/06, Nicolas Pitre <nico@cam.org> wrote: > > > > A better way to get such a size saving is to increase the window and > > > > depth parameters. For example, a window of 20 and depth of 20 can > > > > usually provide a pack size saving greater than 11% with none of the > > > > disadvantages mentioned above. > > > > > > Our window size is effectively infinite. I am handing him all of the > > > revisions from a single file in optimal order. This includes branches. > > > > In GIT packing terms this is infinite delta _depth_ not _window_. > > We're not using infinite anything. > > fast-import is basically doing window=1 and depth=10. Yes, that is what I figured out looking at the code. > We only examine the last blob to see if we can get a delta against > it. If we do we write that delta out; otherwise we reset our delta > chain and write the complete object. We also reset our chain after > writing out 10 deltas, each of which used the immediately prior > object as its base. I think this is a perfectly sensible thing to do for the initial import pass. This will get the data in GIT format quickly and save disk space over no deltas at all. No need to seek for optimal packing at that point but only lay things out for a faster git-repack pass to complete the import. Also, like I said before, you cannot get a nice object layout from your import process since all revisions of each file end up next to each other while in practice what you want is all objects for each _commit_ to be next to each other. Otherwise checking out a commit will seek all over the pack like mad. A final repack using git-repack will create that nice object layout for you and that is something you cannot do on the fly. The git-pack-objects code is also where all the accumulated knowledge for best delta packing heuristics has been created over the last year. It would be counter productive to cut yourself from its value. If anything more clever can be done with packing it probably should be done there. Finally, I don't see how you could create nice deltas for tree objects on the fly like you do with blobs without making your code significantly more complex and use more memory due to the recursive nature of directory information. This is no a problem for git-pack-objects since tree objects are no different from blob objects when deltifying them. Again this is why I think you should focus on the quickest generation of a reasonable intermediate pack and let git-repack make it optimal. > Since I just found out that in some cases the Mozilla repository has > 1000s of revisions per file[*1*] and in others only 1 revision per > file we probably should be adjusting this depth to have a maximum > of 500 while also having the frontend send us a "I'm switching > files now" marker so we know to not even bother trying to delta > the new blob against the last blob as they are likely to not > delta well[*2*]. No please don't do that. A deep delta chain won't produce a dramatic size difference but it _will_ make it really costly to read objects back from the pack later on. For example it might make the next git-repack many times longer but the size saving will not be more than 20%. If you agree with me that a final git-repack is the best thing to do then the intermediate pack doesn't have to be that small. > Although complex the aggressive and grouped deltas appears to > have saved you 18.2% on this repository. That's not something > to ignore. But the aggressive delta alone saved 13% already. The remaining saving is questionable since, like I said, it prevents the delta data reuse optimization which is so nice in order to reduce the load on servers to nearly zero. > A reasonably optimal local pack dictionary could save > at least 4%[*3*]. Whacking 22% off a 400 MB pack is saving 88 MB. This is still speculations though. The separate dictionary is certainly a good idea to try out and it is way more flexible than the grouped deltas. But I doubt the combined concepts will all add up size savings linearly. > Transferring that over the network on an initial clone is like > downloading all of Eclipse. Or an uncompressed kernel tarball... ... something that most people can afford these days. Plus, not using grouped objects would make future shallow clones much less costly on the server as well, while this is something that most people will be most interested in with such large repos and which has much higher bandwidth saver value. Of course this is a trade-off. Tighter packs are likely to be more costly to use. So far we've contained that cost to the _initial_ generation of the pack (using -a -f with git-repack) while subsequent repacking for network transfer were basically free on CPU. The grouped object concept saves around 7 to 11% on pack size in my tests, but it impose a non negligible cost on any further repacking as well as an additional cost on the fetching of objects out of the pack. > [*3*] I wrote a brain-dead simple local dictionary selecter in Perl. > Its horribly far from being ideal. But it is consistently > saving us 4% on the GIT and the Mozilla repository and its > pretty darn fast. Shockingly the C keywords didn't gain > us very much here; its project specific text that's the > real win. > > Looking at chunks which are frequently copied in deltas > from base objects and breaking those chunks up into > smaller common chunks, then loading those most frequent > common chunks into the pack dictionary would most likely > produce far better results. I think such a global dictionary per pack is something that could be really nice. Again it limits the cost to the initial repack process while having no cost impact neither on the delta reuse optimization nor on the object checkout which is not the case for object grouping. Certainly worth experimenting ... but independently from any import process I think. Nicolas ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-21 15:46 ` Nicolas Pitre @ 2006-08-21 16:14 ` Jon Smirl 2006-08-21 17:48 ` Nicolas Pitre 0 siblings, 1 reply; 36+ messages in thread From: Jon Smirl @ 2006-08-21 16:14 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Shawn Pearce, git How about making the length of delta chains an exponential function of the number of revs? In Mozilla configure.in has 1,700 revs and it is a 250K file. If you store a full copy every 10 revs that is 43MB (prezip) of data that almost no one is going to look at. The chains lengths should reflect the relative probability that someone is going to ask to see the revs. That is not at all a uniform function. Personally I am still in favor of a two pack system. One archival pack stores everything in a single chain and size, not speed, is it's most important attribute. It is marked readonly and only functions as an archive; git-repack never touches it. It might even use a more compact compression algorithm. The second pack is for storing more recent revisions. The archival pack would be constructed such that none of the files needed for the head revisions of any branch are in it. They would all be in the second pack. After time the second pack may grow large and another archival pack can be created. The first one would still be maintained in it's readonly form. git could be optimized to always search for objects in non-archival packs before even opening the index of an archival one. This may be a path to partial repositories. Instead of downloading the real archival pack I could download just an index for it. The index entries would be marked to indicate that these objects are valid but not-present. -- Jon Smirl jonsmirl@gmail.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-21 16:14 ` Jon Smirl @ 2006-08-21 17:48 ` Nicolas Pitre 2006-08-21 17:55 ` Nicolas Pitre 0 siblings, 1 reply; 36+ messages in thread From: Nicolas Pitre @ 2006-08-21 17:48 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn Pearce, git On Mon, 21 Aug 2006, Jon Smirl wrote: > How about making the length of delta chains an exponential function of > the number of revs? In Mozilla configure.in has 1,700 revs and it is a > 250K file. If you store a full copy every 10 revs that is 43MB > (prezip) of data that almost no one is going to look at. > The chains > lengths should reflect the relative probability that someone is going > to ask to see the revs. That is not at all a uniform function. 1) You can do that already with stock git-repack. 2) The current git-pack-objects code can and does produce a delta "tree" off of a single base object. It doesn't have to be a linear list. Therefore, even if the default depth is 10, you may as well have many deltas pointing to the _same_ base object effectively making the compression ratio much larger than 1/10. If for example each object has 2 delta childs, and each of those deltas also have 2 delta childs, you could have up to 39366 delta objects attached to a _single_ undeltified base object. And of course the git-pack-objects code doesn't limit the number of delta childs in any way so this could theoritically be infinite even though the max depth is 10. OK the delta matching window limits that somehow but nothing prevents you from repacking with a larger window since that parameter has no penalty on the reading of objects out of the pack. > Personally I am still in favor of a two pack system. One archival pack > stores everything in a single chain and size, not speed, is it's most > important attribute. It is marked readonly and only functions as an > archive; git-repack never touches it. It might even use a more compact > compression algorithm. > > The second pack is for storing more recent revisions. The archival > pack would be constructed such that none of the files needed for the > head revisions of any branch are in it. They would all be in the > second pack. Personally I'm still against that. All arguments put forward for a different or multiple packing system are based on unproven assumptions so far and none of those arguments actually present significant advantages over the current system. For example, I was really intriged by the potential of object grouping into a single zlib stream at first, but it turns out not to be so great after actual testing. I still think that a global zlib dictionary is a good idea. Not because it looks like it'll make packs enormously smaller, but rather because it impose no performance regression over the current system. And I strongly believe that you have to have a really strong case for promoting a solution that carries performance regression, something like over 25% smaller packs for example. But so far it didn't happen. > This may be a path to partial repositories. Instead of downloading the > real archival pack I could download just an index for it. The index > entries would be marked to indicate that these objects are valid but > not-present. An index without the actual objects is simply useless. You could do without the index entirely in that case anyway since the mere presence of an entry in the index doesn't give you anything really useful. Nicolas ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-21 17:48 ` Nicolas Pitre @ 2006-08-21 17:55 ` Nicolas Pitre 2006-08-21 18:01 ` Nicolas Pitre 0 siblings, 1 reply; 36+ messages in thread From: Nicolas Pitre @ 2006-08-21 17:55 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn Pearce, git On Mon, 21 Aug 2006, Nicolas Pitre wrote: > If for example each object has 2 delta childs, and each of those deltas > also have 2 delta childs, you could have up to 39366 delta objects > attached to a _single_ undeltified base object. Sorry I've got the math wrong. That is 3582 deltas, given a conservative number of delta childs = 2. Nicolas ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Huge win, compressing a window of delta runs as a unit 2006-08-21 17:55 ` Nicolas Pitre @ 2006-08-21 18:01 ` Nicolas Pitre 0 siblings, 0 replies; 36+ messages in thread From: Nicolas Pitre @ 2006-08-21 18:01 UTC (permalink / raw) To: Jon Smirl; +Cc: Shawn Pearce, git On Mon, 21 Aug 2006, Nicolas Pitre wrote: > On Mon, 21 Aug 2006, Nicolas Pitre wrote: > > > If for example each object has 2 delta childs, and each of those deltas > > also have 2 delta childs, you could have up to 39366 delta objects > > attached to a _single_ undeltified base object. > > Sorry I've got the math wrong. That is 3582 deltas, given a > conservative number of delta childs = 2. OK I just can't count. That is 2046. But since I made a fool of myself already, why not try with 3 delta childs for a delta depth of 10 just for fun, now that I should have it right. The answer is 88572. Anyway I hope you've got my point now. ;-) Nicolas ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2006-08-21 18:02 UTC | newest] Thread overview: 36+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl 2006-08-17 4:07 ` Shawn Pearce 2006-08-17 7:56 ` Johannes Schindelin 2006-08-17 8:07 ` Johannes Schindelin 2006-08-17 14:36 ` Jon Smirl 2006-08-17 15:45 ` Johannes Schindelin 2006-08-17 16:33 ` Nicolas Pitre 2006-08-17 17:05 ` Johannes Schindelin 2006-08-17 17:22 ` Jon Smirl 2006-08-17 18:15 ` Nicolas Pitre 2006-08-17 17:17 ` Jon Smirl 2006-08-17 17:32 ` Nicolas Pitre 2006-08-17 18:06 ` Jon Smirl 2006-08-17 17:22 ` Nicolas Pitre 2006-08-17 18:03 ` Jon Smirl 2006-08-17 18:24 ` Nicolas Pitre 2006-08-18 4:03 ` Nicolas Pitre 2006-08-18 12:53 ` Jon Smirl 2006-08-18 16:30 ` Nicolas Pitre 2006-08-18 16:56 ` Jon Smirl 2006-08-21 3:45 ` Nicolas Pitre 2006-08-21 6:46 ` Shawn Pearce 2006-08-21 10:24 ` Jakub Narebski 2006-08-21 16:23 ` Jon Smirl 2006-08-18 13:15 ` Jon Smirl 2006-08-18 13:36 ` Johannes Schindelin 2006-08-18 13:50 ` Jon Smirl 2006-08-19 19:25 ` Linus Torvalds 2006-08-18 16:25 ` Nicolas Pitre 2006-08-21 7:06 ` Shawn Pearce 2006-08-21 14:07 ` Jon Smirl 2006-08-21 15:46 ` Nicolas Pitre 2006-08-21 16:14 ` Jon Smirl 2006-08-21 17:48 ` Nicolas Pitre 2006-08-21 17:55 ` Nicolas Pitre 2006-08-21 18:01 ` Nicolas Pitre
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).