* git repack: --depth=100000 causing larger not smaler pack file?
@ 2009-03-17 19:05 Kjetil Barvik
2009-03-17 20:38 ` Nicolas Pitre
0 siblings, 1 reply; 6+ messages in thread
From: Kjetil Barvik @ 2009-03-17 19:05 UTC (permalink / raw)
To: git
aloha!
Yesterday I run the following command on the updated GIT respository:
git repack -adf --window=250000 --depth=100000
After 280 minutes or so it finished, but the strange thing was that
the resulting pack-file was larger than before. I had expected that
it should be smaler, or at least the same size as before.
kjetil git (my_next)$ ls -l .git/objects/pack/*
-r-------- 1 kjetil kjetil 2757280 2009-03-16 15:18 .git/objects/pack/pack-c5f15d5c48d6b3902a49046d7e8a8d717e167051.idx
-r-------- 1 kjetil kjetil 19961120 2009-03-16 15:18 .git/objects/pack/pack-c5f15d5c48d6b3902a49046d7e8a8d717e167051.pack
Before I started the pack file was around 19 250 000 bytes, and was
the result of the following commands:
1) git repack -adf --window=250000 --depth=20000
- not completly sure about the --window number here
- the resulting pack file was a litle less than 19 100 000
2) 'git fetch' to get the latest GIT patches
3) since 'git fetch' always make an extra new "smal" pack file, I run
the command 'git repack -ad --window=40000 --depth=10000' to be
able to get one singel pack file of 19 250 000 bytes or so.
I can think of one thing which is spesial with the "--depth=100000"
number, and that is that it is now larger than the total number of
objects in the pack, which is around 96000 to 97000, or so.
I have run 'git fsck --strict --full' on the pack with no resulting
error/debug output or change in the file size.
Any help on how to debug this?
-- kjetil
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: git repack: --depth=100000 causing larger not smaler pack file? 2009-03-17 19:05 git repack: --depth=100000 causing larger not smaler pack file? Kjetil Barvik @ 2009-03-17 20:38 ` Nicolas Pitre 2009-03-23 10:11 ` Kjetil Barvik 0 siblings, 1 reply; 6+ messages in thread From: Nicolas Pitre @ 2009-03-17 20:38 UTC (permalink / raw) To: Kjetil Barvik; +Cc: git On Tue, 17 Mar 2009, Kjetil Barvik wrote: > aloha! > > Yesterday I run the following command on the updated GIT respository: > > git repack -adf --window=250000 --depth=100000 > > After 280 minutes or so it finished, but the strange thing was that > the resulting pack-file was larger than before. I had expected that > it should be smaler, or at least the same size as before. > > kjetil git (my_next)$ ls -l .git/objects/pack/* > -r-------- 1 kjetil kjetil 2757280 2009-03-16 15:18 .git/objects/pack/pack-c5f15d5c48d6b3902a49046d7e8a8d717e167051.idx > -r-------- 1 kjetil kjetil 19961120 2009-03-16 15:18 .git/objects/pack/pack-c5f15d5c48d6b3902a49046d7e8a8d717e167051.pack > > Before I started the pack file was around 19 250 000 bytes, and was > the result of the following commands: > > 1) git repack -adf --window=250000 --depth=20000 > - not completly sure about the --window number here > - the resulting pack file was a litle less than 19 100 000 > > 2) 'git fetch' to get the latest GIT patches > > 3) since 'git fetch' always make an extra new "smal" pack file, I run > the command 'git repack -ad --window=40000 --depth=10000' to be > able to get one singel pack file of 19 250 000 bytes or so. > > I can think of one thing which is spesial with the "--depth=100000" > number, and that is that it is now larger than the total number of > objects in the pack, which is around 96000 to 97000, or so. No, the depth should have zero negative influence on the pack size. For tight compression, the larger the better. What this will impact though is runtime access to the pack data afterward. The deeper a given object is, the slower its access will be. But since the object recency order tend to put newer objects at the top of a delta chain, this should impact older objects more than recent ones. > I have run 'git fsck --strict --full' on the pack with no resulting > error/debug output or change in the file size. There shouldn't be any. > Any help on how to debug this? I doubt there is anything to debug. In this case the window size is used to evaluate a threshold slope for matching objects in the delta search. What we want is a broader delta tree more than a deep one in order to have more deltas with a lower depth limit. Therefore a size threshold is applied, based on the object distance in the delta search window (see commit c83f032e and the other ones referenced therein). By providing a big window value, the threshold slope becomes rather flat and ineffective, and this changes the delta match outcome. While delta selection is based on the uncompressed delta result, the compressed size of different deltas with the same size may vary. I suspect you might have been unlucky in that regard and this could explain the negative effect on the pack size. Nicolas ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git repack: --depth=100000 causing larger not smaler pack file? 2009-03-17 20:38 ` Nicolas Pitre @ 2009-03-23 10:11 ` Kjetil Barvik 2009-03-23 10:20 ` Mike Ralphson ` (2 more replies) 0 siblings, 3 replies; 6+ messages in thread From: Kjetil Barvik @ 2009-03-23 10:11 UTC (permalink / raw) To: Nicolas Pitre; +Cc: git Nicolas Pitre <nico@cam.org> writes: > On Tue, 17 Mar 2009, Kjetil Barvik wrote: > >> aloha! >> >> Yesterday I run the following command on the updated GIT respository: >> >> git repack -adf --window=250000 --depth=100000 >> >> After 280 minutes or so it finished, but the strange thing was that >> the resulting pack-file was larger than before. I had expected that >> it should be smaler, or at least the same size as before. [snip] >> I can think of one thing which is spesial with the "--depth=100000" >> number, and that is that it is now larger than the total number of >> objects in the pack, which is around 96000 to 97000, or so. > > No, the depth should have zero negative influence on the pack size. > For tight compression, the larger the better. What this will impact > though is runtime access to the pack data afterward. The deeper a > given object is, the slower its access will be. But since the object > recency order tend to put newer objects at the top of a delta chain, > this should impact older objects more than recent ones. I have done some more tests, and have copied the whole git/ directory to a new directory (such that I do not accidentally add or delete any objects/commits), and have made the following table: All pack file sizes, F, below was computed with the following git command: git repack -adf --window=250000 --depth=D D | F | (F - F_prev) / (D - D_prev) -------|------------|---------------------------- 5000 | 19129934 | 10000 | 19128956 | -978 / 5000 = -0.1956 15000 | 19126077 | -2879 / 5000 = -0.5758 20000 | 19126077 | 0 / 5000 = 0 25000 | 19126077 | 0 / 5000 = 0 30000 | 19197575 | 71498 / 5000 = 14.2996 45000 | 19312240 | 114665 / 15000 = 7.6443 60000 | 19560083 | 247843 / 15000 = 16.5229 75000 | 19803043 | 242960 / 15000 = 16.1973 90000 | 19669923 | -133120 / 15000 = -8.8746 95000 | 20463780 | 793857 / 5000 = 155.7714 From the table it seems that you get the smallest pack file (for this particular repository) when --depth value is somewhere between 15000 and 25000. And, when the --depth value was 95000 the resulting pack file was (- 20463780 19126077) = 1 337 703 bytes, 1.25 MiB, or 7% larger than this. > I doubt there is anything to debug. In this case the window size is > used to evaluate a threshold slope for matching objects in the delta > search. What we want is a broader delta tree more than a deep one in > order to have more deltas with a lower depth limit. Therefore a size > threshold is applied, based on the object distance in the delta search > window (see commit c83f032e and the other ones referenced therein). > > By providing a big window value, the threshold slope becomes rather flat > and ineffective, and this changes the delta match outcome. While delta > selection is based on the uncompressed delta result, the compressed size > of different deltas with the same size may vary. I suspect you might > have been unlucky in that regard and this could explain the negative > effect on the pack size. From the table above it seems that I have been unlucky with _all_ --depth values above 25000 or so. Question: is there some low level GIT command I can run to compare 2 pack files to maybe be able to see the reason behind the above table? Maybe to see some details about how many delta's, how big each are, total sizes, etc.. -- kjetil PS! I have the following in my $HOME/.gitconfig file: [repack] UseDeltaBaseOffset = true [gc] auto = 25 autopacklimit = 1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git repack: --depth=100000 causing larger not smaler pack file? 2009-03-23 10:11 ` Kjetil Barvik @ 2009-03-23 10:20 ` Mike Ralphson 2009-03-23 14:05 ` Peter Harris 2009-03-23 14:14 ` Nicolas Pitre 2 siblings, 0 replies; 6+ messages in thread From: Mike Ralphson @ 2009-03-23 10:20 UTC (permalink / raw) To: Kjetil Barvik; +Cc: Nicolas Pitre, git 2009/3/23 Kjetil Barvik <barvik@broadpark.no>: > PS! I have the following in my $HOME/.gitconfig file: > > [repack] > UseDeltaBaseOffset = true > [gc] > auto = 25 > autopacklimit = 1 Just an aside, but from my reading of how it works, there's very little point in setting gc.auto to anything less than 257 and statistically it won't kick in predictably unless set quite a bit higher (say an order of magnitude). Mike ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git repack: --depth=100000 causing larger not smaler pack file? 2009-03-23 10:11 ` Kjetil Barvik 2009-03-23 10:20 ` Mike Ralphson @ 2009-03-23 14:05 ` Peter Harris 2009-03-23 14:14 ` Nicolas Pitre 2 siblings, 0 replies; 6+ messages in thread From: Peter Harris @ 2009-03-23 14:05 UTC (permalink / raw) To: Kjetil Barvik; +Cc: Nicolas Pitre, git On Mon, Mar 23, 2009 at 6:11 AM, Kjetil Barvik wrote: > Question: is there some low level GIT command I can run to compare 2 > pack files to maybe be able to see the reason behind the above table? > Maybe to see some details about how many delta's, how big each are, > total sizes, etc.. git verify-pack -v <pack.idx> The columns are: SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1 (the last two columns are only present for deltified objects) Peter Harris ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git repack: --depth=100000 causing larger not smaler pack file? 2009-03-23 10:11 ` Kjetil Barvik 2009-03-23 10:20 ` Mike Ralphson 2009-03-23 14:05 ` Peter Harris @ 2009-03-23 14:14 ` Nicolas Pitre 2 siblings, 0 replies; 6+ messages in thread From: Nicolas Pitre @ 2009-03-23 14:14 UTC (permalink / raw) To: Kjetil Barvik; +Cc: git On Mon, 23 Mar 2009, Kjetil Barvik wrote: > Nicolas Pitre <nico@cam.org> writes: > > > On Tue, 17 Mar 2009, Kjetil Barvik wrote: > > > >> aloha! > >> > >> Yesterday I run the following command on the updated GIT respository: > >> > >> git repack -adf --window=250000 --depth=100000 > >> > >> After 280 minutes or so it finished, but the strange thing was that > >> the resulting pack-file was larger than before. I had expected that > >> it should be smaler, or at least the same size as before. > [snip] > >> I can think of one thing which is spesial with the "--depth=100000" > >> number, and that is that it is now larger than the total number of > >> objects in the pack, which is around 96000 to 97000, or so. > > > > No, the depth should have zero negative influence on the pack size. > > For tight compression, the larger the better. What this will impact > > though is runtime access to the pack data afterward. The deeper a > > given object is, the slower its access will be. But since the object > > recency order tend to put newer objects at the top of a delta chain, > > this should impact older objects more than recent ones. > > I have done some more tests, and have copied the whole git/ directory > to a new directory (such that I do not accidentally add or delete any > objects/commits), and have made the following table: > > All pack file sizes, F, below was computed with the following git > command: > > git repack -adf --window=250000 --depth=D > > D | F | (F - F_prev) / (D - D_prev) > -------|------------|---------------------------- > 5000 | 19129934 | > 10000 | 19128956 | -978 / 5000 = -0.1956 > 15000 | 19126077 | -2879 / 5000 = -0.5758 > 20000 | 19126077 | 0 / 5000 = 0 > 25000 | 19126077 | 0 / 5000 = 0 > 30000 | 19197575 | 71498 / 5000 = 14.2996 > 45000 | 19312240 | 114665 / 15000 = 7.6443 > 60000 | 19560083 | 247843 / 15000 = 16.5229 > 75000 | 19803043 | 242960 / 15000 = 16.1973 > 90000 | 19669923 | -133120 / 15000 = -8.8746 > 95000 | 20463780 | 793857 / 5000 = 155.7714 > > From the table it seems that you get the smallest pack file (for this > particular repository) when --depth value is somewhere between 15000 > and 25000. And, when the --depth value was 95000 the resulting pack > file was (- 20463780 19126077) = 1 337 703 bytes, 1.25 MiB, or 7% > larger than this. This is a bit intriguing. Of course, before going any further, you must realize that having a depth of 15000 is a bit excessive. That means that, if you have a delta chain with a depth of 15000 that means access to the object at the end of the chain will require that 14999 other objects be accessed before the 15000th one is retrieved. This will have horrible runtime performances for something like 10% reduction in the best cases which is probably not a good tradeoff. This being said, I still stand by my assertion that, in theory, greater delta depth should not make the pack bigger. And your table appears to confirm that, even to the point of reaching a stable size as one would expect, until a breaking point is reached after which results tend to become rather random. What I'm suspecting in that case is some computation overflow in try_delta(). Consider for instance this piece: max_size = max_size * (max_depth - src->depth) / (max_depth - ref_depth + 1); [ This is the treshold slope I was talking about, but contrary to what I said before, it is affected by the depth not the window size. ] In this case, if you have a max_depth of 95000, then any object larger than 90461 bytes will cause a multiplication overflow, and the resulting max_size will be capped to some random smaller value than expected depending on the remaining bits. For example, suppose max_size = 45211, max_depth = 95000 and src->depth = 0 then you should have max_size still equal to 45211, but in this case it'll become 0 and no delta will be attempted at all. The number of deltas reported at the end of the repack process probably reflects that. > > I doubt there is anything to debug. In this case the window size is > > used to evaluate a threshold slope for matching objects in the delta > > search. What we want is a broader delta tree more than a deep one in > > order to have more deltas with a lower depth limit. Therefore a size > > threshold is applied, based on the object distance in the delta search > > window (see commit c83f032e and the other ones referenced therein). > > > > By providing a big window value, the threshold slope becomes rather flat > > and ineffective, and this changes the delta match outcome. While delta > > selection is based on the uncompressed delta result, the compressed size > > of different deltas with the same size may vary. I suspect you might > > have been unlucky in that regard and this could explain the negative > > effect on the pack size. > > From the table above it seems that I have been unlucky with _all_ > --depth values above 25000 or so. See explanation (and self correction) above. > Question: is there some low level GIT command I can run to compare 2 > pack files to maybe be able to see the reason behind the above table? > Maybe to see some details about how many delta's, how big each are, > total sizes, etc.. Yes -- see the -v option of 'git verify-pack'. Nicolas ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-03-23 14:17 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-03-17 19:05 git repack: --depth=100000 causing larger not smaler pack file? Kjetil Barvik 2009-03-17 20:38 ` Nicolas Pitre 2009-03-23 10:11 ` Kjetil Barvik 2009-03-23 10:20 ` Mike Ralphson 2009-03-23 14:05 ` Peter Harris 2009-03-23 14:14 ` Nicolas Pitre
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.