* gc --aggressive
@ 2012-04-17 16:16 Jay Soffian
2012-04-17 17:53 ` Jay Soffian
2012-04-17 22:08 ` Jeff King
0 siblings, 2 replies; 26+ messages in thread
From: Jay Soffian @ 2012-04-17 16:16 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Shawn Pearce
For a couple years now I've had a maintenance script which repacks all
the repos at @dayjob thusly:
git config repack.usedeltabaseoffset true
git config pack.compression 9
git config pack.indexversion 2
git config gc.autopacklimit 4
git config gc.packrefs true
git config gc.reflogexpire never
git config gc.reflogexpireunreachable never
git gc --auto --aggressive --prune
This has worked fine on repos large and small. However, starting a
couple days ago git started running out of memory on a relatively
modest repo[*] while repacking on a Linux box with 12GB memory (+ 12GB
swap). I am able to gc the repo by either removing --aggressive or
.keep'ing the oldest pack.
[*] Stats:
du -hs objects
141M objects
git count-objects -v
count: 0
size: 0
in-pack: 57656
packs: 37
size-pack: 143811
prune-packable: 0
garbage: 0
git version 1.7.10
I've since found a message from Shawn recommending against using --aggressive:
http://groups.google.com/group/repo-discuss/msg/d2462eed67813571
> Junio Hamano and I looked at things a few weeks ago; it turns out the
> --aggressive flag doesn't generally provide a benefit like we thought
> it would. It would be safe to remove from your GC script, and will
> speed things up considerably.
A couple questions:
1) If --aggressive does not generally provide a benefit, should it be
made a no-op?
2) Is it expected that gc --aggressive would run out of memory on this repo?
I've posted the repo in case anyone wants to take a look:
http://dl.dropbox.com/u/2138120/WebKit-trimmed.git.zip
j.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-17 16:16 gc --aggressive Jay Soffian @ 2012-04-17 17:53 ` Jay Soffian 2012-04-17 20:52 ` Matthieu Moy 2012-04-28 16:56 ` Nicolas Pitre 2012-04-17 22:08 ` Jeff King 1 sibling, 2 replies; 26+ messages in thread From: Jay Soffian @ 2012-04-17 17:53 UTC (permalink / raw) To: git; +Cc: Junio C Hamano, Shawn Pearce, Jeff King On Tue, Apr 17, 2012 at 12:16 PM, Jay Soffian <jaysoffian@gmail.com> wrote: > This has worked fine on repos large and small. However, starting a > couple days ago git started running out of memory on a relatively > modest repo[*] while repacking on a Linux box with 12GB memory (+ 12GB > swap). I am able to gc the repo by either removing --aggressive or > .keep'ing the oldest pack. Experimentally, setting pack.windowMemory = 256m keeps git memory usage < 4.5 GB during an aggressive repack. Ironically I end up with a slightly worse pack (63115590 bytes vs 61518628 bytes) than not using --aggressive. I assume this is because pack-objects found a better delta chain during the previous aggressive repack when windowMemory was not set. > 1) If --aggressive does not generally provide a benefit, should it be > made a no-op? I guess I'll revise this question: perhaps --aggressive should be better explained/discouraged. I found a message from Jeff last month and stole his words for this patch: <snip> diff --git i/Documentation/git-gc.txt w/Documentation/git-gc.txt index 815afcb922..ca5bf8b51e 100644 --- i/Documentation/git-gc.txt +++ w/Documentation/git-gc.txt @@ -37,9 +37,8 @@ OPTIONS Usually 'git gc' runs very quickly while providing good disk space utilization and performance. This option will cause 'git gc' to more aggressively optimize the repository at the expense - of taking much more time. The effects of this optimization are - persistent, so this option only needs to be used occasionally; every - few hundred changesets or so. + of taking much more time and potentially using greater memory. This + option is rarely needed. See Repacking below. --auto:: With this option, 'git gc' checks whether any housekeeping is @@ -138,6 +137,39 @@ If you are expecting some objects to be collected and they aren't, check all of those locations and decide whether it makes sense in your case to remove those references. +Repacking +--------- + +Under the covers 'git gc' calls several commands to optimize the repository. +The most significant of these with respect to repository size and general +performance is linkgit:git-repack[1]. There are basically three levels of +'gc' with respect to repacking: + + 1. `git gc --auto`; if there are too many loose objects (`gc.auto`), they + all go into a new incremental pack. If there are already too many + packs (`gc.autopacklimit`), all of the existing packs are re-packed + together. + + Making an incremental pack is by far the fastest because the speed is + independent of the existing repository history. If git packs + everything together, it should be more or less the same as (2). + + 2. `git gc`; this packs everything into a single pack. It uses default + window and depth parameters, but importantly, it reuses existing + deltas. Doing so makes the delta compression phase much faster, and it + often makes the writing phase faster (because for older objects, git + is primarily streaming them right out of the existing pack). On a big + repository though, this does do a lot of I/O, because git has to + rewrite the whole pack. + + 3. `git gc --aggressive`; this is often much slower than (2) because git + throws out all of the existing deltas and recomputes them from + scratch. It uses a higher window parameter meaning it will spend + more time computing, and it may end up with a smaller pack. However, + unless the repository is known to have initially been poorly packed, + this option is not needed and will just cause git to perform + extra work. + HOOKS ----- @@ -147,6 +179,7 @@ linkgit:githooks[5] for more information. SEE ALSO -------- +linkgit:git-pack-refs[1] linkgit:git-prune[1] linkgit:git-reflog[1] linkgit:git-repack[1] </snip> Thoughts? ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-17 17:53 ` Jay Soffian @ 2012-04-17 20:52 ` Matthieu Moy 2012-04-17 21:58 ` Jeff King 2012-04-28 12:25 ` Jeff King 2012-04-28 16:56 ` Nicolas Pitre 1 sibling, 2 replies; 26+ messages in thread From: Matthieu Moy @ 2012-04-17 20:52 UTC (permalink / raw) To: Jay Soffian; +Cc: git, Junio C Hamano, Shawn Pearce, Jeff King Jay Soffian <jaysoffian@gmail.com> writes: > + 3. `git gc --aggressive`; this is often much slower than (2) because git > + throws out all of the existing deltas and recomputes them from > + scratch. It uses a higher window parameter meaning it will spend > + more time computing, and it may end up with a smaller pack. However, > + unless the repository is known to have initially been poorly packed, > + this option is not needed and will just cause git to perform > + extra work. I like your patch. Maybe you should elaborate on "unless the repository is known to have initially been poorly packed". My understanding is that --aggressive was implemented to be called after an import from another VCS that would have computed very poor deltas, but I'm not sure about the details. -- Matthieu Moy http://www-verimag.imag.fr/~moy/ ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-17 20:52 ` Matthieu Moy @ 2012-04-17 21:58 ` Jeff King 2012-04-28 12:25 ` Jeff King 1 sibling, 0 replies; 26+ messages in thread From: Jeff King @ 2012-04-17 21:58 UTC (permalink / raw) To: Matthieu Moy; +Cc: Jay Soffian, git, Junio C Hamano, Shawn Pearce On Tue, Apr 17, 2012 at 10:52:03PM +0200, Matthieu Moy wrote: > Jay Soffian <jaysoffian@gmail.com> writes: > > > + 3. `git gc --aggressive`; this is often much slower than (2) because git > > + throws out all of the existing deltas and recomputes them from > > + scratch. It uses a higher window parameter meaning it will spend > > + more time computing, and it may end up with a smaller pack. However, > > + unless the repository is known to have initially been poorly packed, > > + this option is not needed and will just cause git to perform > > + extra work. > > I like your patch. Me too. I guess it is not surprising since I wrote the initial draft. ;) > Maybe you should elaborate on "unless the repository is known to have > initially been poorly packed". My understanding is that --aggressive was > implemented to be called after an import from another VCS that would > have computed very poor deltas, but I'm not sure about the details. Yes, that's exactly it. fast-import will generate packs, but they are often not optimal. So if you have done a big import, you should definitely "git gc --aggressive" as the final step. I don't know how something like a remote-helper would work, where it is fast-importing little bits at a time. Probably a regular repack would be fine, since it will be considering deltas between objects in different packs anyway. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-17 20:52 ` Matthieu Moy 2012-04-17 21:58 ` Jeff King @ 2012-04-28 12:25 ` Jeff King 2012-04-28 17:11 ` Nicolas Pitre 1 sibling, 1 reply; 26+ messages in thread From: Jeff King @ 2012-04-28 12:25 UTC (permalink / raw) To: git; +Cc: Matthieu Moy, Nicolas Pitre, Jay Soffian, Junio C Hamano, Shawn Pearce On Tue, Apr 17, 2012 at 10:52:03PM +0200, Matthieu Moy wrote: > Jay Soffian <jaysoffian@gmail.com> writes: > > > + 3. `git gc --aggressive`; this is often much slower than (2) because git > > + throws out all of the existing deltas and recomputes them from > > + scratch. It uses a higher window parameter meaning it will spend > > + more time computing, and it may end up with a smaller pack. However, > > + unless the repository is known to have initially been poorly packed, > > + this option is not needed and will just cause git to perform > > + extra work. > > I like your patch. > > Maybe you should elaborate on "unless the repository is known to have > initially been poorly packed". My understanding is that --aggressive was > implemented to be called after an import from another VCS that would > have computed very poor deltas, but I'm not sure about the details. Coincidentally, I came across a case last week that shows --aggressive providing a large improvement. And it's a public repo, so I was able to grab a snapshot of the pre-packed state to experiment on and share. The current packfile is ~246M. It was produced over time by pushes into the repository, which were then eventually grouped into a single pack by "git gc" (I'm not sure of the exact history, but this may even have been a set of "gc --auto" calls over time). Here's a list of commands and the pack sizes they yield on the repo: 1. `git repack -ad`: 246M 2. `git repack -ad -f`: 376M 3. `git repack -ad --window=250`: 246M 4. `git repack -ad -f --window=250`: 145M The most interesting thing is (4): repacking with a larger window size yields a 100M (40%) space improvement. The other commands show that it is not that the current pack is simply bad; command (2) repacks from scratch and actually ends up with a worse pack. So the increased window size really is important. I haven't been able to figure out what it is about this dataset that makes the bigger window so much better. Certainly doing the same commands on git.git does not yield as impressive a speedup. If anybody is interested in the repository as a packing test case, you can download it from: https://gist.github.com/raw/2518074/d9c0244bf0ced690fee1edb2c88c522ecce102e4/phpmyadmin-network.tar -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-28 12:25 ` Jeff King @ 2012-04-28 17:11 ` Nicolas Pitre 2012-04-29 11:34 ` Jeff King 0 siblings, 1 reply; 26+ messages in thread From: Nicolas Pitre @ 2012-04-28 17:11 UTC (permalink / raw) To: Jeff King; +Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce On Sat, 28 Apr 2012, Jeff King wrote: > On Tue, Apr 17, 2012 at 10:52:03PM +0200, Matthieu Moy wrote: > > > Jay Soffian <jaysoffian@gmail.com> writes: > > > > > + 3. `git gc --aggressive`; this is often much slower than (2) because git > > > + throws out all of the existing deltas and recomputes them from > > > + scratch. It uses a higher window parameter meaning it will spend > > > + more time computing, and it may end up with a smaller pack. However, > > > + unless the repository is known to have initially been poorly packed, > > > + this option is not needed and will just cause git to perform > > > + extra work. > > > > I like your patch. > > > > Maybe you should elaborate on "unless the repository is known to have > > initially been poorly packed". My understanding is that --aggressive was > > implemented to be called after an import from another VCS that would > > have computed very poor deltas, but I'm not sure about the details. This is somewhat subjective of course. But to be effective, you need sufficient resources to repack with --aggressive, otherwise you may potentially end up with a worse pack. > Coincidentally, I came across a case last week that shows --aggressive > providing a large improvement. And it's a public repo, so I was able to > grab a snapshot of the pre-packed state to experiment on and share. > > The current packfile is ~246M. It was produced over time by pushes into > the repository, which were then eventually grouped into a single pack by > "git gc" (I'm not sure of the exact history, but this may even have been > a set of "gc --auto" calls over time). > > Here's a list of commands and the pack sizes they yield on the repo: > > 1. `git repack -ad`: 246M > 2. `git repack -ad -f`: 376M > 3. `git repack -ad --window=250`: 246M > 4. `git repack -ad -f --window=250`: 145M > > The most interesting thing is (4): repacking with a larger window size > yields a 100M (40%) space improvement. The other commands show that it > is not that the current pack is simply bad; command (2) repacks from > scratch and actually ends up with a worse pack. So the increased window > size really is important. Absolutely. This doesn't surprises me. > I haven't been able to figure out what it is about this dataset that > makes the bigger window so much better. Certainly doing the same > commands on git.git does not yield as impressive a speedup. The default window size of 10 objects is really really small (yet if your objects are 150MB in size then it is probably too big, but I digress). When doing an incremental repack, the window is also limited by the fact that we don't redelta those already packed objects. Many things could explain the improvements with a larger window. If a lot of files were renamed for example, the larger window would allow similar objects to still delta against each other, despite the fact that we pull them in the search window according to their corresponding file names. With a smaller window this opportunity would be missed. Nicolas ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-28 17:11 ` Nicolas Pitre @ 2012-04-29 11:34 ` Jeff King 2012-04-29 13:53 ` Nicolas Pitre 0 siblings, 1 reply; 26+ messages in thread From: Jeff King @ 2012-04-29 11:34 UTC (permalink / raw) To: Nicolas Pitre Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce On Sat, Apr 28, 2012 at 01:11:48PM -0400, Nicolas Pitre wrote: > > Here's a list of commands and the pack sizes they yield on the repo: > > > > 1. `git repack -ad`: 246M > > 2. `git repack -ad -f`: 376M > > 3. `git repack -ad --window=250`: 246M > > 4. `git repack -ad -f --window=250`: 145M > > > > The most interesting thing is (4): repacking with a larger window size > > yields a 100M (40%) space improvement. The other commands show that it > > is not that the current pack is simply bad; command (2) repacks from > > scratch and actually ends up with a worse pack. So the increased window > > size really is important. > > Absolutely. This doesn't surprises me. I was somewhat surprised, because this repo behaves very differently from other ones as the window size increases. Our default window of 10 is somewhat arbitrary, but I think there was a sense from early tests that you got diminishing returns from increasing it (this is my vague recollection; I didn't actually search for old discussions). But here are some charts showing "repack -adf" with various window sizes on a few repositories. The first column is the window size; the second is the resulting pack size (and its percentage of the window=10 case); the third is the number of seconds of CPU time (and again, the percentage of the window=10 case). Here's git.git: 10 | 31.3M (100%) | 54s (100%) 20 | 28.8M ( 92%) | 72s (133%) 40 | 27.4M ( 87%) | 101s (187%) 80 | 26.3M ( 84%) | 153s (282%) 160 | 25.7M ( 82%) | 247s (455%) 320 | 25.4M ( 81%) | 415s (763%) You can see we get some benefit from increasing window size to 20 or even 40, but we hit an asymptote around 80%. Meanwhile, CPU time keeps jumping. Something like 20 or 40 seems like it might be a nice compromise. Here's linux-2.6: 10 | 564M (100%) | 990s (100%) 20 | 521M ( 92%) | 1323s (134%) 40 | 495M ( 88%) | 1855s (187%) 80 | 479M ( 85%) | 2743s (277%) 160 | 470M ( 83%) | 4284s (432%) 320 | 463M ( 82%) | 7064s (713%) It's quite similar, asymptotically heading towards ~80%. And the CPU numbers look quite similar, too. And here's the phpmyadmin repository (the one I linked to earlier): 10 | 386M (100%) | 1592s (100%) 20 | 280M ( 72%) | 1947s (122%) 40 | 209M ( 54%) | 2514s (158%) 80 | 169M ( 44%) | 3386s (213%) 160 | 151M ( 39%) | 4822s (303%) 320 | 142M ( 37%) | 6948s (436%) The packfile size improvements go on for much longer as we increase the window size. For this repo, a window size of 80-100 is probably a good spot. That leads me to a few questions: 1. Should we bump our default window size? The numbers above show that typical repos would benefit from jumping to 20 or even 40. 2. Is there a heuristic or other metric we can figure out to differentiate the first two repositories from the third, and use a larger window size on the latter? 3. Does the phpmyadmin case give us any insight into whether we can improve our window sorting algorithm? Looking at the repo, ~55K of the ~75K commits are small changes in the po/ directory (it looks like they were using a web-based tool to let non-committers tweak the translation files). In particular, I see a lot of commits in which most of the changes are simply line number changes as the po files are refreshed from the source. I wonder if that is making the size-sorting heuristics perform poorly, as we end up with many files of the same size, and the good deltas get pushed further along the window. 4. What is typical? I suspect that git.git and linux-2.6 are typical, and the weird po-files in the phpmyadmin repository are not. But I'd be happy to test more repos if people have suggestions. And the scripts that generated the charts are included below if anybody wants to try it themselves. -Peff -- >8 -- cat >collect <<\EOF #!/bin/sh # usage: collect /path/to/repo >foo.out windows='10 20 40 80 160 320' for i in $windows; do echo >&2 "Repacking with window $i..." rm -rf tmp && cp -a "$1" tmp && ( cd tmp && time=`time -f %U -o /dev/stdout git repack -adf --window=$i` size=`du -bc objects/pack/pack-*.pack | tail -1 | awk '{print $1}'` echo "$i $size $time" ) done EOF cat >chart <<\EOF #!/usr/bin/perl # usage: chart <foo.out use strict; my @base; while (<>) { chomp; my ($window, $size, $time) = split; @base = ($size, $time) unless @base; printf '%4s', $window; print ' | ', humanize($size); printf ' (%3d%%)', int($size / $base[0] * 100 + 0.5); printf ' | %4ds', $time; printf ' (%d%%)', int($time / $base[1] * 100 + 0.5); print "\n"; } sub human_digits { my $n = shift; my $digits = $n >= 100 ? 0 : $n >= 10 ? 1 : 2; return sprintf '%.*f', $digits, $n; } sub humanize { my $n = shift; my $u; foreach $u ('', qw(K M G)) { return human_digits($n) . $u if $n < 900; $n /= 1024; } return human_digits($n) . $u; } EOF ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-29 11:34 ` Jeff King @ 2012-04-29 13:53 ` Nicolas Pitre 2012-05-01 16:28 ` Jeff King 0 siblings, 1 reply; 26+ messages in thread From: Nicolas Pitre @ 2012-04-29 13:53 UTC (permalink / raw) To: Jeff King; +Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce On Sun, 29 Apr 2012, Jeff King wrote: > On Sat, Apr 28, 2012 at 01:11:48PM -0400, Nicolas Pitre wrote: > > > > Here's a list of commands and the pack sizes they yield on the repo: > > > > > > 1. `git repack -ad`: 246M > > > 2. `git repack -ad -f`: 376M > > > 3. `git repack -ad --window=250`: 246M > > > 4. `git repack -ad -f --window=250`: 145M > > > > > > The most interesting thing is (4): repacking with a larger window size > > > yields a 100M (40%) space improvement. The other commands show that it > > > is not that the current pack is simply bad; command (2) repacks from > > > scratch and actually ends up with a worse pack. So the increased window > > > size really is important. > > > > Absolutely. This doesn't surprises me. > > I was somewhat surprised, because this repo behaves very differently > from other ones as the window size increases. Our default window of 10 > is somewhat arbitrary, but I think there was a sense from early tests > that you got diminishing returns from increasing it (this is my vague > recollection; I didn't actually search for old discussions). Yes, your numbers are very interesting. But my remark was related to the fact that you need to double the affected resources to gain marginal improvements at some point. This is true about computing hardware too: eventually you need way more gates and spend much more $$$ to gain some performance, and the added performance is never linear with the spending. > But here are some charts showing "repack -adf" with various window > sizes on a few repositories. The first column is the window size; the > second is the resulting pack size (and its percentage of the window=10 > case); the third is the number of seconds of CPU time (and again, the > percentage of the window=10 case). > > Here's git.git: > > 10 | 31.3M (100%) | 54s (100%) > 20 | 28.8M ( 92%) | 72s (133%) > 40 | 27.4M ( 87%) | 101s (187%) > 80 | 26.3M ( 84%) | 153s (282%) > 160 | 25.7M ( 82%) | 247s (455%) > 320 | 25.4M ( 81%) | 415s (763%) > > You can see we get some benefit from increasing window size to 20 or > even 40, but we hit an asymptote around 80%. Meanwhile, CPU time keeps > jumping. Something like 20 or 40 seems like it might be a nice > compromise. > > Here's linux-2.6: > > 10 | 564M (100%) | 990s (100%) > 20 | 521M ( 92%) | 1323s (134%) > 40 | 495M ( 88%) | 1855s (187%) > 80 | 479M ( 85%) | 2743s (277%) > 160 | 470M ( 83%) | 4284s (432%) > 320 | 463M ( 82%) | 7064s (713%) > > It's quite similar, asymptotically heading towards ~80%. And the CPU > numbers look quite similar, too. > > And here's the phpmyadmin repository (the one I linked to earlier): > > 10 | 386M (100%) | 1592s (100%) > 20 | 280M ( 72%) | 1947s (122%) > 40 | 209M ( 54%) | 2514s (158%) > 80 | 169M ( 44%) | 3386s (213%) > 160 | 151M ( 39%) | 4822s (303%) > 320 | 142M ( 37%) | 6948s (436%) > > The packfile size improvements go on for much longer as we increase the > window size. For this repo, a window size of 80-100 is probably a good > spot. > > That leads me to a few questions: > > 1. Should we bump our default window size? The numbers above show that > typical repos would benefit from jumping to 20 or even 40. I think this might be a good indication that the number of objects is a bad metric to size the window, as I mentioned previously. Given that you have the test repos already, could you re-run it with --window=1000 and play with --window-memory instead? I would be curious to see if this provides more predictable results. > 2. Is there a heuristic or other metric we can figure out to > differentiate the first two repositories from the third, and use a > larger window size on the latter? Maybe we could look at the size reduction within the delta search loop. If the reduction quickly diminishes as tested objects are further away from the target one then the window doesn't have to be very large, whereas if the reduction remains more or less constant then it might be worth searching further. That could be used to dynamically size the window at run time. > 3. Does the phpmyadmin case give us any insight into whether we can > improve our window sorting algorithm? Looking at the repo, ~55K of > the ~75K commits are small changes in the po/ directory (it looks > like they were using a web-based tool to let non-committers tweak > the translation files). In particular, I see a lot of commits in > which most of the changes are simply line number changes as the po > files are refreshed from the source. I wonder if that is making the > size-sorting heuristics perform poorly, as we end up with many > files of the same size, and the good deltas get pushed further > along the window. You could test this theory by commenting out the size comparisons in type_size_sort() and re-run the test. Linus initially introduced that criterion thinking that newer files tend to grow and it is cheaper to create a delta that removes data than one that adds data. And given that we wanted to prefer delta chains to start from newer objects this all made sense. However the last comparison in that function is meant to handling the recency ordering, and therefore the size comparison might be skewing things here. > 4. What is typical? I suspect that git.git and linux-2.6 are typical, > and the weird po-files in the phpmyadmin repository are not. But > I'd be happy to test more repos if people have suggestions. And the > scripts that generated the charts are included below if anybody > wants to try it themselves. I wouldn't give up on "non typical" data sets just yet though. Nicolas ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-29 13:53 ` Nicolas Pitre @ 2012-05-01 16:28 ` Jeff King 2012-05-01 17:16 ` Jeff King 2012-05-01 17:17 ` Nicolas Pitre 0 siblings, 2 replies; 26+ messages in thread From: Jeff King @ 2012-05-01 16:28 UTC (permalink / raw) To: Nicolas Pitre Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce On Sun, Apr 29, 2012 at 09:53:31AM -0400, Nicolas Pitre wrote: > But my remark was related to the fact that you need to double the > affected resources to gain marginal improvements at some point. This is > true about computing hardware too: eventually you need way more gates > and spend much more $$$ to gain some performance, and the added > performance is never linear with the spending. Right, I agree with that. The trick is just finding the right spot on that curve for each repo to maximize the reward/effort ratio. > > 1. Should we bump our default window size? The numbers above show that > > typical repos would benefit from jumping to 20 or even 40. > > I think this might be a good indication that the number of objects is a > bad metric to size the window, as I mentioned previously. > > Given that you have the test repos already, could you re-run it with > --window=1000 and play with --window-memory instead? I would be curious > to see if this provides more predictable results. It doesn't help. The git.git repo does well with about a 1m window limit. linux-2.6 is somewhere between 1m and 2m. But the phpmyadmin repo wants more like 16m. So it runs into the same issue as using object counts. But it's much, much worse than that. Here are the actual numbers (same format as before; left-hand column is either window size (if no unit) or window-memory limit (if k/m unit), followed by resulting pack size, its percentage of baseline --window=10 pack, the user CPU time and finally its percentage of the baseline): git: 10 | 31.4M (100%) | 54s (100%) 20 | 28.8M ( 92%) | 72s (133%) 128k | 81.4M (260%) | 77s (142%) 256k | 59.1M (188%) | 106s (195%) 512k | 44.5M (142%) | 166s (306%) 1m | 28.7M ( 91%) | 267s (491%) 2m | 27.0M ( 86%) | 347s (637%) 4m | 26.0M ( 83%) | 417s (767%) linux-2.6: 10 | 564M (100%) | 990s (100%) 20 | 521M ( 92%) | 1323s (134%) 128k | 1.41G (256%) | 1322s (133%) 256k | 1.08G (196%) | 1810s (183%) 512k | 783M (139%) | 2775s (280%) 1m | 579M (103%) | 4620s (466%) 2m | 504M ( 89%) | 6786s (685%) 4m | 479M ( 85%) | 8119s (819%) phpmyadmin: 10 | 380M (100%) | 1617s (100%) 80 | 163M ( 43%) | 3410s (211%) 128k | 3.42G (921%) | 2367s (146%) 256k | 3.36G (904%) | 2437s (151%) 512k | 3.22G (865%) | 2589s (160%) 1m | 3.10G (833%) | 2746s (170%) 2m | 436M (115%) | 1674s (104%) 4m | 299M ( 78%) | 2140s (132%) 8m | 222M ( 58%) | 2751s (170%) 16m | 178M ( 47%) | 3334s (206%) I intentionally started with a too-small memory limit so we could see the effect as the window size approached something reasonable. You can see the pack sizes getting comparable for --window=20 around --window-memory=1m for the git and linux-2.6 cases. But look at the CPU usage. For a comparable resulting pack size, limiting the window memory uses 4-5x as much CPU. I'm not sure what is causing that behavior. I guess maybe for small objects we end up with a really huge window (in terms of number of objects), but it doesn't end up actually saving us a lot of space because there is not as much space to be saved with small objects. So we spend a lot of extra time looking at objects that don't yield big space savings. For some of the really tiny limits, the "writing" phase ended up dominating. For example, linux-2.6 at 128k ends up with a horribly large pack that takes even longer to run than --window=10. These numbers don't reflect the split between the compression and writing phases, but I noticed while watching the progress meter that the writing phase was quite slow in such cases. Mostly because we end up having to zlib deflate a lot more data (which I confirmed via perf). Interestingly, the phpmyadmin script does not have the same issue. The CPU usage for the object and memory limits are about the same (probably this is due to the fact that the history is dominated by similar-sized .po files, so the two limits end up equating to each other). > > 2. Is there a heuristic or other metric we can figure out to > > differentiate the first two repositories from the third, and use a > > larger window size on the latter? > > Maybe we could look at the size reduction within the delta search loop. > If the reduction quickly diminishes as tested objects are further away > from the target one then the window doesn't have to be very large, > whereas if the reduction remains more or less constant then it might be > worth searching further. That could be used to dynamically size the > window at run time. I really like the idea of dynamically sizing the window based on what we find. If it works. I don't think there's any reason you couldn't have 50 absolutely terrible delta candidates followed by one really amazing delta candidate. But maybe in practice the window tends to get progressively worse due to the heuristics, and outliers are unlikely. I guess we'd have to experiment. > > 3. Does the phpmyadmin case give us any insight into whether we can > > improve our window sorting algorithm? > [...] > > You could test this theory by commenting out the size comparisons in > type_size_sort() and re-run the test. I'll try this next. -Peff PS Here's my updated collection script, just for reference. -- >8 -- #!/bin/sh windows='10 20 128k 256k 512k 1m 2m 4m' repo=$1; shift test $# -gt 0 && windows="$*" cd "$repo" || exit for i in $windows; do case "$i" in *[kmg]) opts="--window=1000 --window-memory=$i" ;; *) opts="--window=$i" ;; esac echo >&2 "Repacking $repo with $opts..." time -f %U -o time.out \ git pack-objects --stdout --all-progress-implied --all \ --no-reuse-delta $opts </dev/null | wc -c >size.out echo "$i `cat size.out` `cat time.out`" rm -f size.out time.out done ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-05-01 16:28 ` Jeff King @ 2012-05-01 17:16 ` Jeff King 2012-05-01 17:59 ` Nicolas Pitre 2012-05-01 17:17 ` Nicolas Pitre 1 sibling, 1 reply; 26+ messages in thread From: Jeff King @ 2012-05-01 17:16 UTC (permalink / raw) To: Nicolas Pitre Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce On Tue, May 01, 2012 at 12:28:06PM -0400, Jeff King wrote: > > You could test this theory by commenting out the size comparisons in > > type_size_sort() and re-run the test. > > I'll try this next. Wow, it behaves horribly. I didn't even let the bigger tests run to completion. Here is the output for git.git (the first line is from the original, unmodified version of git with --window=10): orig | 31.4M (100%) | 54s (100%) 10 | 44.0M (140%) | 169s (310%) 20 | 37.7M (120%) | 232s (428%) 40 | 33.6M (107%) | 331s (608%) 80 | 30.9M ( 99%) | 473s (868%) 160 | 29.4M ( 94%) | 696s (1279%) Unless the window is increased a lot, the packs end up quite a bit larger (and even still we spend a lot more CPU time). -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-05-01 17:16 ` Jeff King @ 2012-05-01 17:59 ` Nicolas Pitre 2012-05-01 18:47 ` Junio C Hamano 2012-05-01 19:35 ` Jeff King 0 siblings, 2 replies; 26+ messages in thread From: Nicolas Pitre @ 2012-05-01 17:59 UTC (permalink / raw) To: Jeff King; +Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce On Tue, 1 May 2012, Jeff King wrote: > On Tue, May 01, 2012 at 12:28:06PM -0400, Jeff King wrote: > > > > You could test this theory by commenting out the size comparisons in > > > type_size_sort() and re-run the test. > > > > I'll try this next. > > Wow, it behaves horribly. I didn't even let the bigger tests run to > completion. Here is the output for git.git (the first line is from the > original, unmodified version of git with --window=10): > > orig | 31.4M (100%) | 54s (100%) > 10 | 44.0M (140%) | 169s (310%) > 20 | 37.7M (120%) | 232s (428%) > 40 | 33.6M (107%) | 331s (608%) > 80 | 30.9M ( 99%) | 473s (868%) > 160 | 29.4M ( 94%) | 696s (1279%) > > Unless the window is increased a lot, the packs end up quite a bit > larger (and even still we spend a lot more CPU time). Bleh. Allright. One final quick test if you feel like it: I've never been sure that the last comparison in type_size_sort() is correct. Maybe it should be the other way around. Currently it reads: return a < b ? -1 : (a > b); While keeping the size comparison commented out, you could try to replace this line with: return b < a ? -1 : (b > a); If this doesn't improve things then it would be clear that this avenue should be abandoned. Nicolas ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-05-01 17:59 ` Nicolas Pitre @ 2012-05-01 18:47 ` Junio C Hamano 2012-05-01 19:22 ` Nicolas Pitre 2012-05-01 20:01 ` Jeff King 2012-05-01 19:35 ` Jeff King 1 sibling, 2 replies; 26+ messages in thread From: Junio C Hamano @ 2012-05-01 18:47 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jeff King, git, Matthieu Moy, Jay Soffian, Shawn Pearce Nicolas Pitre <nico@fluxnic.net> writes: > One final quick test if you feel like it: I've never been sure that > the last comparison in type_size_sort() is correct. Maybe it should be > the other way around. Currently it reads: > > return a < b ? -1 : (a > b); > > While keeping the size comparison commented out, you could try to > replace this line with: > > return b < a ? -1 : (b > a); > > If this doesn't improve things then it would be clear that this avenue > should be abandoned. Very interesting. The difference between the two should only matter if there are many blobs with exactly the same size, and most of them delta horribly with each other. Does the problematic repository exhibit such a characteristic? The original tie-breaks based on the address (the earlier object we read in the original input comes earlier in the output) and yours make the objects later we read (which in turn are from older parts of the history) come early, but adjacency between two objects of the same type and the same size would not change (if A and B were next to each other in this order, your updated sorter will give B and then A still next to each other), so I suspect not much would change in the candidate selection. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-05-01 18:47 ` Junio C Hamano @ 2012-05-01 19:22 ` Nicolas Pitre 2012-05-01 20:01 ` Jeff King 1 sibling, 0 replies; 26+ messages in thread From: Nicolas Pitre @ 2012-05-01 19:22 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, git, Matthieu Moy, Jay Soffian, Shawn Pearce On Tue, 1 May 2012, Junio C Hamano wrote: > Nicolas Pitre <nico@fluxnic.net> writes: > > > One final quick test if you feel like it: I've never been sure that > > the last comparison in type_size_sort() is correct. Maybe it should be > > the other way around. Currently it reads: > > > > return a < b ? -1 : (a > b); > > > > While keeping the size comparison commented out, you could try to > > replace this line with: > > > > return b < a ? -1 : (b > a); > > > > If this doesn't improve things then it would be clear that this avenue > > should be abandoned. > > Very interesting. The difference between the two should only matter if > there are many blobs with exactly the same size, and most of them delta > horribly with each other. Does the problematic repository exhibit such > a characteristic? Not precisely. This is just to verify some hypothesis that could explain the difference in behavior with the phpmyadmin repo. My hypothesis was that recency order could be skewed by the object size when many small changes are made to the same files without varying their size much. So I suggested that a repack run be performed with the object size removed from the sort criteria. However it is important that the last comparison be done in the right direction. Hence my suggestion above. > The original tie-breaks based on the address (the earlier object we read > in the original input comes earlier in the output) and yours make the > objects later we read (which in turn are from older parts of the history) > come early, but adjacency between two objects of the same type and the > same size would not change (if A and B were next to each other in this > order, your updated sorter will give B and then A still next to each > other), so I suspect not much would change in the candidate selection. Note that the size comparison is commented out in those tests. The idea was to get pure recency order. Even for objects of the same size, the delta orientation would change which might or might not provide a clue. But this is really just a wild guess without much thinking at this point, before giving up on this approach. Nicolas ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-05-01 18:47 ` Junio C Hamano 2012-05-01 19:22 ` Nicolas Pitre @ 2012-05-01 20:01 ` Jeff King 1 sibling, 0 replies; 26+ messages in thread From: Jeff King @ 2012-05-01 20:01 UTC (permalink / raw) To: Junio C Hamano Cc: Nicolas Pitre, git, Matthieu Moy, Jay Soffian, Shawn Pearce On Tue, May 01, 2012 at 11:47:26AM -0700, Junio C Hamano wrote: > > While keeping the size comparison commented out, you could try to > > replace this line with: > > > > return b < a ? -1 : (b > a); > > > > If this doesn't improve things then it would be clear that this avenue > > should be abandoned. > > Very interesting. The difference between the two should only matter if > there are many blobs with exactly the same size, and most of them delta > horribly with each other. Does the problematic repository exhibit such > a characteristic? No. Here are the objects with the same sizes: $ git rev-list --objects --all | cut -d' ' -f1 | git cat-file --batch-check | cut -d' ' -f2,3 | sort | uniq -c | sort -rn | head 19722 tree 2222 14068 tree 4393 11418 tree 2156 9994 tree 4676 9479 tree 2189 7944 tree 2255 6454 commit 251 6437 tree 4611 5328 tree 4439 4586 commit 254 So it's mostly trees and commits (the first repeated blob size is on line 332 of the output). The commits aren't all that big even without deltafication, but the trees are. They should be sorted by name_hash, but within a single name, there are going to be a lot of repetitions (I think each of those size clusters is just a repetition of the same "po" directory getting lots of tiny modifications). So we are triggering that part of the sort quite a bit. But by your reasoning here: > The original tie-breaks based on the address (the earlier object we read > in the original input comes earlier in the output) and yours make the > objects later we read (which in turn are from older parts of the history) > come early, but adjacency between two objects of the same type and the > same size would not change (if A and B were next to each other in this > order, your updated sorter will give B and then A still next to each > other), so I suspect not much would change in the candidate selection. I don't think it makes a big difference (and indeed, switching it and repacking the phpmyadmin repository yields the same-size pack, although a lot more CPU time is spent). -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-05-01 17:59 ` Nicolas Pitre 2012-05-01 18:47 ` Junio C Hamano @ 2012-05-01 19:35 ` Jeff King 2012-05-01 20:02 ` Nicolas Pitre 1 sibling, 1 reply; 26+ messages in thread From: Jeff King @ 2012-05-01 19:35 UTC (permalink / raw) To: Nicolas Pitre Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce On Tue, May 01, 2012 at 01:59:08PM -0400, Nicolas Pitre wrote: > One final quick test if you feel like it: I've never been sure that > the last comparison in type_size_sort() is correct. Maybe it should be > the other way around. Currently it reads: > > return a < b ? -1 : (a > b); I think it is right. At least it should put recent things near the front of the array, just as we are putting bigger things there. > > orig | 31.4M (100%) | 54s (100%) > > 10 | 44.0M (140%) | 169s (310%) > > 20 | 37.7M (120%) | 232s (428%) > > 40 | 33.6M (107%) | 331s (608%) > > 80 | 30.9M ( 99%) | 473s (868%) > > 160 | 29.4M ( 94%) | 696s (1279%) > [...] > While keeping the size comparison commented out, you could try to > replace this line with: > > return b < a ? -1 : (b > a); No, it's not better. A few of the pack sizes are better, but some of them are worse. And the CPU times are still quite bad. Here are the numbers: orig | 31.4M (100%) | 54s (100%) 10 | 45.6M (145%) | 158s (292%) 20 | 39.2M (125%) | 205s (377%) 40 | 35.1M (112%) | 275s (505%) 80 | 32.4M (103%) | 388s (713%) 160 | 30.6M ( 98%) | 581s (1067%) -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-05-01 19:35 ` Jeff King @ 2012-05-01 20:02 ` Nicolas Pitre 0 siblings, 0 replies; 26+ messages in thread From: Nicolas Pitre @ 2012-05-01 20:02 UTC (permalink / raw) To: Jeff King; +Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce On Tue, 1 May 2012, Jeff King wrote: > On Tue, May 01, 2012 at 01:59:08PM -0400, Nicolas Pitre wrote: > > > One final quick test if you feel like it: I've never been sure that > > the last comparison in type_size_sort() is correct. Maybe it should be > > the other way around. Currently it reads: > > > > return a < b ? -1 : (a > b); > > I think it is right. At least it should put recent things near the > front of the array, just as we are putting bigger things there. Right. In fact, it seems that _I_ did think about it *five* years ago (man... time flies by) given commit adcc70950e, and then I reversed the whole order in commit b904166ccb to get what we have today. > > replace this line with: > > > > return b < a ? -1 : (b > a); > > No, it's not better. A few of the pack sizes are better, but some of > them are worse. And the CPU times are still quite bad. OK. We can scratch that. Nicolas ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-05-01 16:28 ` Jeff King 2012-05-01 17:16 ` Jeff King @ 2012-05-01 17:17 ` Nicolas Pitre 2012-05-01 17:22 ` Jeff King 1 sibling, 1 reply; 26+ messages in thread From: Nicolas Pitre @ 2012-05-01 17:17 UTC (permalink / raw) To: Jeff King; +Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce On Tue, 1 May 2012, Jeff King wrote: > On Sun, Apr 29, 2012 at 09:53:31AM -0400, Nicolas Pitre wrote: > > > But my remark was related to the fact that you need to double the > > affected resources to gain marginal improvements at some point. This is > > true about computing hardware too: eventually you need way more gates > > and spend much more $$$ to gain some performance, and the added > > performance is never linear with the spending. > > Right, I agree with that. The trick is just finding the right spot on > that curve for each repo to maximize the reward/effort ratio. Absolutely, at least for the default settings. However this is not what --aggressive is meant to be. > > > 1. Should we bump our default window size? The numbers above show that > > > typical repos would benefit from jumping to 20 or even 40. > > > > I think this might be a good indication that the number of objects is a > > bad metric to size the window, as I mentioned previously. > > > > Given that you have the test repos already, could you re-run it with > > --window=1000 and play with --window-memory instead? I would be curious > > to see if this provides more predictable results. > > It doesn't help. The git.git repo does well with about a 1m window > limit. linux-2.6 is somewhere between 1m and 2m. But the phpmyadmin repo > wants more like 16m. So it runs into the same issue as using object > counts. > > But it's much, much worse than that. Here are the actual numbers (same > format as before; left-hand column is either window size (if no unit) or > window-memory limit (if k/m unit), followed by resulting pack size, its > percentage of baseline --window=10 pack, the user CPU time and finally > its percentage of the baseline): > [...] Ouch! Well... so much for good theory. I'm still really surprised and disappointed as I didn't expect such damage at all. However, this is possibly a good baseline to determine a default value for window-memory though. Given your number, we clearly see that good packing can be achieved with relatively little memory and therefore it might be a good idea not to leave this parameter unbounded by default in order to catch potential pathological cases. Maybe 64M would be a good default value? Having a repack process eating up more than 16GB of RAM because its RAM usage is unbounded is certainly not nice. > > Maybe we could look at the size reduction within the delta search loop. > > If the reduction quickly diminishes as tested objects are further away > > from the target one then the window doesn't have to be very large, > > whereas if the reduction remains more or less constant then it might be > > worth searching further. That could be used to dynamically size the > > window at run time. > > I really like the idea of dynamically sizing the window based on what we > find. If it works. I don't think there's any reason you couldn't have 50 > absolutely terrible delta candidates followed by one really amazing > delta candidate. But maybe in practice the window tends to get > progressively worse due to the heuristics, and outliers are unlikely. I > guess we'd have to experiment. Yes. The idea is to continue searching if results are not progressively becoming worse fast enough. Coming up with a good way to infer that is far from obvious though. Nicolas ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-05-01 17:17 ` Nicolas Pitre @ 2012-05-01 17:22 ` Jeff King 2012-05-01 17:47 ` Nicolas Pitre 0 siblings, 1 reply; 26+ messages in thread From: Jeff King @ 2012-05-01 17:22 UTC (permalink / raw) To: Nicolas Pitre Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce On Tue, May 01, 2012 at 01:17:03PM -0400, Nicolas Pitre wrote: > However, this is possibly a good baseline to determine a default value > for window-memory though. Given your number, we clearly see that good > packing can be achieved with relatively little memory and therefore it > might be a good idea not to leave this parameter unbounded by default in > order to catch potential pathological cases. Maybe 64M would be a good > default value? Having a repack process eating up more than 16GB of RAM > because its RAM usage is unbounded is certainly not nice. Would that preclude a 65M object from being delta'd at all? If you are putting a limit of 64M and it means we look at 50 delta candidates instead of 60, then that is probably not going to hurt our size too much. But if you have large objects that do happen to delta, the difference between looking at 0 and 1 objects could have a big impact. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-05-01 17:22 ` Jeff King @ 2012-05-01 17:47 ` Nicolas Pitre 0 siblings, 0 replies; 26+ messages in thread From: Nicolas Pitre @ 2012-05-01 17:47 UTC (permalink / raw) To: Jeff King; +Cc: git, Matthieu Moy, Jay Soffian, Junio C Hamano, Shawn Pearce On Tue, 1 May 2012, Jeff King wrote: > On Tue, May 01, 2012 at 01:17:03PM -0400, Nicolas Pitre wrote: > > > However, this is possibly a good baseline to determine a default value > > for window-memory though. Given your number, we clearly see that good > > packing can be achieved with relatively little memory and therefore it > > might be a good idea not to leave this parameter unbounded by default in > > order to catch potential pathological cases. Maybe 64M would be a good > > default value? Having a repack process eating up more than 16GB of RAM > > because its RAM usage is unbounded is certainly not nice. > > Would that preclude a 65M object from being delta'd at all? If you are > putting a limit of 64M and it means we look at 50 delta candidates > instead of 60, then that is probably not going to hurt our size too > much. But if you have large objects that do happen to delta, the > difference between looking at 0 and 1 objects could have a big impact. If I remember correctly (long time ago since I wrote that code) there is always a minimum of one object in the search window. So with 32M objects you'd have only 2 candidates, with 33M objects and bigger there would be only one candidate. Obviously the window could still be filled with smaller objects if the total object size is less than 64M. Nicolas ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-17 17:53 ` Jay Soffian 2012-04-17 20:52 ` Matthieu Moy @ 2012-04-28 16:56 ` Nicolas Pitre 1 sibling, 0 replies; 26+ messages in thread From: Nicolas Pitre @ 2012-04-28 16:56 UTC (permalink / raw) To: Jay Soffian; +Cc: git, Junio C Hamano, Shawn Pearce, Jeff King On Tue, 17 Apr 2012, Jay Soffian wrote: > On Tue, Apr 17, 2012 at 12:16 PM, Jay Soffian <jaysoffian@gmail.com> wrote: > > This has worked fine on repos large and small. However, starting a > > couple days ago git started running out of memory on a relatively > > modest repo[*] while repacking on a Linux box with 12GB memory (+ 12GB > > swap). I am able to gc the repo by either removing --aggressive or > > .keep'ing the oldest pack. > > Experimentally, setting pack.windowMemory = 256m keeps git memory > usage < 4.5 GB during an aggressive repack. How many threads are used? As mentioned elsewhere, the memory usage parameter should probably be made global rather than per thread, especially with the ever growing number of CPU cores in a system. But this also pauses a balancing problem for optimally distributing memory between threads. > Ironically I end up with a slightly worse pack (63115590 bytes vs > 61518628 bytes) than not using --aggressive. I assume this is because > pack-objects found a better delta chain during the previous aggressive > repack when windowMemory was not set. Exact. When reusing delta data, you inherit the quality of the repack run that created them in the first place. > > 1) If --aggressive does not generally provide a benefit, should it be > > made a no-op? Absolutely not. It does provide benefits, but it comes with a cost in resources. If you don't pay that cost then results won't be there. > I guess I'll revise this question: perhaps --aggressive should be > better explained/discouraged. I found a message from Jeff last month > and stole his words for this patch: > > <snip> > diff --git i/Documentation/git-gc.txt w/Documentation/git-gc.txt > index 815afcb922..ca5bf8b51e 100644 > --- i/Documentation/git-gc.txt > +++ w/Documentation/git-gc.txt > @@ -37,9 +37,8 @@ OPTIONS > Usually 'git gc' runs very quickly while providing good disk > space utilization and performance. This option will cause > 'git gc' to more aggressively optimize the repository at the expense > - of taking much more time. The effects of this optimization are > - persistent, so this option only needs to be used occasionally; every > - few hundred changesets or so. > + of taking much more time and potentially using greater memory. This Scratch "potentially" here. It definitely uses more memory. > + option is rarely needed. See Repacking below. > > --auto:: > With this option, 'git gc' checks whether any housekeeping is > @@ -138,6 +137,39 @@ If you are expecting some objects to be collected > and they aren't, check > all of those locations and decide whether it makes sense in your case to > remove those references. > > +Repacking > +--------- > + > +Under the covers 'git gc' calls several commands to optimize the repository. > +The most significant of these with respect to repository size and general > +performance is linkgit:git-repack[1]. There are basically three levels of > +'gc' with respect to repacking: > + > + 1. `git gc --auto`; if there are too many loose objects (`gc.auto`), they > + all go into a new incremental pack. If there are already too many > + packs (`gc.autopacklimit`), all of the existing packs are re-packed > + together. > + > + Making an incremental pack is by far the fastest because the speed is > + independent of the existing repository history. If git packs > + everything together, it should be more or less the same as (2). > + > + 2. `git gc`; this packs everything into a single pack. It uses default > + window and depth parameters, but importantly, it reuses existing > + deltas. Doing so makes the delta compression phase much faster, and it > + often makes the writing phase faster (because for older objects, git > + is primarily streaming them right out of the existing pack). On a big > + repository though, this does do a lot of I/O, because git has to > + rewrite the whole pack. > + > + 3. `git gc --aggressive`; this is often much slower than (2) because git > + throws out all of the existing deltas and recomputes them from > + scratch. It uses a higher window parameter meaning it will spend > + more time computing, and it may end up with a smaller pack. However, > + unless the repository is known to have initially been poorly packed, > + this option is not needed and will just cause git to perform > + extra work. > + > HOOKS > ----- > > @@ -147,6 +179,7 @@ linkgit:githooks[5] for more information. > > SEE ALSO > -------- > +linkgit:git-pack-refs[1] > linkgit:git-prune[1] > linkgit:git-reflog[1] > linkgit:git-repack[1] > </snip> > > Thoughts? FWIW, Acked-by: Nicolas Pitre <nico@fluxnic.net> Nicolas ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-17 16:16 gc --aggressive Jay Soffian 2012-04-17 17:53 ` Jay Soffian @ 2012-04-17 22:08 ` Jeff King 2012-04-17 22:17 ` Junio C Hamano 1 sibling, 1 reply; 26+ messages in thread From: Jeff King @ 2012-04-17 22:08 UTC (permalink / raw) To: Jay Soffian; +Cc: git, Junio C Hamano, Shawn Pearce On Tue, Apr 17, 2012 at 12:16:15PM -0400, Jay Soffian wrote: > For a couple years now I've had a maintenance script which repacks all > the repos at @dayjob thusly: > > git config repack.usedeltabaseoffset true > git config pack.compression 9 > git config pack.indexversion 2 > git config gc.autopacklimit 4 > git config gc.packrefs true > git config gc.reflogexpire never > git config gc.reflogexpireunreachable never > git gc --auto --aggressive --prune > > This has worked fine on repos large and small. However, starting a > couple days ago git started running out of memory on a relatively > modest repo[*] while repacking on a Linux box with 12GB memory (+ 12GB > swap). I am able to gc the repo by either removing --aggressive or > .keep'ing the oldest pack. I wonder where the memory is going. In theory, the memory consumption for packing comes from keeping all of the objects for a given window in memory (so we are looking for a delta for object X, and we have a window of Y[0]..Y[$window] objects that we will consider). And for a multi-threaded pack, that's per-thread. How many cores are there on this box? Have you tried setting pack.windowMemory to (12 / # of cores) or thereabouts? > 1) If --aggressive does not generally provide a benefit, should it be > made a no-op? In your case, I think it is overkill. But it seems lame that git _can't_ do a full repack on such a beefy machine. You don't want to do it all the time, but you might want to do it at least once. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-17 22:08 ` Jeff King @ 2012-04-17 22:17 ` Junio C Hamano 2012-04-17 22:18 ` Jeff King 0 siblings, 1 reply; 26+ messages in thread From: Junio C Hamano @ 2012-04-17 22:17 UTC (permalink / raw) To: Jeff King; +Cc: Jay Soffian, git, Shawn Pearce Jeff King <peff@peff.net> writes: > ... > I wonder where the memory is going. In theory, the memory consumption > for packing comes from keeping all of the objects for a given window in > memory (so we are looking for a delta for object X, and we have a window > of Y[0]..Y[$window] objects that we will consider). And for a > multi-threaded pack, that's per-thread. > > How many cores are there on this box? Have you tried setting > pack.windowMemory to (12 / # of cores) or thereabouts? Hrm, from the end-user's point of view, it appears that pack.windowMemory ought to mean the total without having to worry about the division of it across threads (which the implementation should be responsible for). > ... But it seems lame that git _can't_ > do a full repack on such a beefy machine. You don't want to do it all > the time, but you might want to do it at least once. True. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-17 22:17 ` Junio C Hamano @ 2012-04-17 22:18 ` Jeff King 2012-04-17 22:34 ` Junio C Hamano 2012-04-18 8:49 ` Andreas Ericsson 0 siblings, 2 replies; 26+ messages in thread From: Jeff King @ 2012-04-17 22:18 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jay Soffian, git, Shawn Pearce On Tue, Apr 17, 2012 at 03:17:28PM -0700, Junio C Hamano wrote: > > How many cores are there on this box? Have you tried setting > > pack.windowMemory to (12 / # of cores) or thereabouts? > > Hrm, from the end-user's point of view, it appears that pack.windowMemory > ought to mean the total without having to worry about the division of it > across threads (which the implementation should be responsible for). Agreed. I had to look in the code to check which it meant. I'm not sure we can change it without regressing existing users, though. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-17 22:18 ` Jeff King @ 2012-04-17 22:34 ` Junio C Hamano 2012-04-28 16:42 ` Nicolas Pitre 2012-04-18 8:49 ` Andreas Ericsson 1 sibling, 1 reply; 26+ messages in thread From: Junio C Hamano @ 2012-04-17 22:34 UTC (permalink / raw) To: Jeff King; +Cc: Jay Soffian, git, Shawn Pearce Jeff King <peff@peff.net> writes: > On Tue, Apr 17, 2012 at 03:17:28PM -0700, Junio C Hamano wrote: > >> > How many cores are there on this box? Have you tried setting >> > pack.windowMemory to (12 / # of cores) or thereabouts? >> >> Hrm, from the end-user's point of view, it appears that pack.windowMemory >> ought to mean the total without having to worry about the division of it >> across threads (which the implementation should be responsible for). > > Agreed. I had to look in the code to check which it meant. I'm not sure > we can change it without regressing existing users, though. This is a tangent, but I noticed that the canned settings for "aggressive" use an arbitrarily hardcoded value of depth=250 and window=250 (tweakable with gc.aggressiveWindow). Even though a shallower depth does cause base candidates with too long a chain hanging to be evicted prematurely while it is still in window and will lead to smaller memory consumption, I do not think the value of "depth" affects the pack-time memory consumption too much. But the runtime performance of the resulting pack may not be great (in the worst case you would have to undelta 249 times to get to the object data). We may want to loosen it a bit. Also it might make sense to make the window size a bit more flexible depending on the nature of your history (you would get bigger benefit with larger window when your history has fine grained commits; if there are not many few-liner commits, larger window may not help you that much). ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-17 22:34 ` Junio C Hamano @ 2012-04-28 16:42 ` Nicolas Pitre 0 siblings, 0 replies; 26+ messages in thread From: Nicolas Pitre @ 2012-04-28 16:42 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, Jay Soffian, git, Shawn Pearce [ coming late to this thread -- thanks to peff who pulled my attention ] On Tue, 17 Apr 2012, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > On Tue, Apr 17, 2012 at 03:17:28PM -0700, Junio C Hamano wrote: > > > >> > How many cores are there on this box? Have you tried setting > >> > pack.windowMemory to (12 / # of cores) or thereabouts? > >> > >> Hrm, from the end-user's point of view, it appears that pack.windowMemory > >> ought to mean the total without having to worry about the division of it > >> across threads (which the implementation should be responsible for). > > > > Agreed. I had to look in the code to check which it meant. I'm not sure > > we can change it without regressing existing users, though. > > This is a tangent, but I noticed that the canned settings for "aggressive" > use an arbitrarily hardcoded value of depth=250 and window=250 (tweakable > with gc.aggressiveWindow). > > Even though a shallower depth does cause base candidates with too long a > chain hanging to be evicted prematurely while it is still in window and > will lead to smaller memory consumption, I do not think the value of > "depth" affects the pack-time memory consumption too much. But the > runtime performance of the resulting pack may not be great (in the worst > case you would have to undelta 249 times to get to the object data). We > may want to loosen it a bit. I think people are having misconceptions about the definition of the word "aggressive". This option is, well, aggressive. By definition this is not meant to be "nice". This is not meant to be fast, or light on memory usage, etc. This means "achieve as much damage you can" to reduce the pack size. If people are using it every night then they must be masochists, or attracted by violence, or getting a bit too casual with word definitions. So if being --aggressive hurts, then don't do it. If people want a loosened version, it would be more appropriate to introduce a --mild, or --bold, or --disruptive option. In the same vain, an --insane option could even be introduced to go even further than --aggressive. This being said, this is no excuse for regressions though. If git is eating up much more memory than it used to, provided with the same repository and repacking parameters than before, then this certainly needs fixing. But making --aggressive less so is not a fix. > Also it might make sense to make the window size a bit more flexible > depending on the nature of your history (you would get bigger benefit with > larger window when your history has fine grained commits; if there are not > many few-liner commits, larger window may not help you that much). How do you detect the history nature of a repository? That's the hard part. Because it should be auto detected as most users won't make a good guess for the best parameter value to use. Anyway, I think that the window size in terms of objects is a bad parameter. Historically that is the first thing we implemented. But the window _memory_ usage is probably a better setting to use. The delta search cost is directly proportional to the amount of data to process and that can be controlled with --window-memory, with the ability to scale up and down the number of objects in the window. Keeping the number of objects constant makes memory usage totally random since this depends on the repository content, and the computing cost to process it is highly unpredictable. This is very counter-intuitive for users. Right now the window is limited by default to 10 objects, and window memory usage is unlimited. This could be reworked so object number, while still being limited to avoid pathological cases, could be much higher, and the window memory usage always limited by default. That default memory usage could be scaled according to the available resources on the system. But if the user wants to play with this, then using a memory usage parameter is much easier to understand with more directly observable system load influence. Nicolas ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: gc --aggressive 2012-04-17 22:18 ` Jeff King 2012-04-17 22:34 ` Junio C Hamano @ 2012-04-18 8:49 ` Andreas Ericsson 1 sibling, 0 replies; 26+ messages in thread From: Andreas Ericsson @ 2012-04-18 8:49 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Jay Soffian, git, Shawn Pearce On 04/18/2012 12:18 AM, Jeff King wrote: > On Tue, Apr 17, 2012 at 03:17:28PM -0700, Junio C Hamano wrote: > >>> How many cores are there on this box? Have you tried setting >>> pack.windowMemory to (12 / # of cores) or thereabouts? >> >> Hrm, from the end-user's point of view, it appears that pack.windowMemory >> ought to mean the total without having to worry about the division of it >> across threads (which the implementation should be responsible for). > > Agreed. I had to look in the code to check which it meant. I'm not sure > we can change it without regressing existing users, though. > Introduce a new one. core.maxmemoryusage = <something>, which acts as an upper bound in all codepaths where we actually keep track. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2012-05-01 20:03 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-04-17 16:16 gc --aggressive Jay Soffian 2012-04-17 17:53 ` Jay Soffian 2012-04-17 20:52 ` Matthieu Moy 2012-04-17 21:58 ` Jeff King 2012-04-28 12:25 ` Jeff King 2012-04-28 17:11 ` Nicolas Pitre 2012-04-29 11:34 ` Jeff King 2012-04-29 13:53 ` Nicolas Pitre 2012-05-01 16:28 ` Jeff King 2012-05-01 17:16 ` Jeff King 2012-05-01 17:59 ` Nicolas Pitre 2012-05-01 18:47 ` Junio C Hamano 2012-05-01 19:22 ` Nicolas Pitre 2012-05-01 20:01 ` Jeff King 2012-05-01 19:35 ` Jeff King 2012-05-01 20:02 ` Nicolas Pitre 2012-05-01 17:17 ` Nicolas Pitre 2012-05-01 17:22 ` Jeff King 2012-05-01 17:47 ` Nicolas Pitre 2012-04-28 16:56 ` Nicolas Pitre 2012-04-17 22:08 ` Jeff King 2012-04-17 22:17 ` Junio C Hamano 2012-04-17 22:18 ` Jeff King 2012-04-17 22:34 ` Junio C Hamano 2012-04-28 16:42 ` Nicolas Pitre 2012-04-18 8:49 ` Andreas Ericsson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).