* Performance of "git gc..." is extremely bad in some cases @ 2021-03-08 21:15 Anthony Muller 2021-03-08 22:29 ` Bryan Turner 0 siblings, 1 reply; 5+ messages in thread From: Anthony Muller @ 2021-03-08 21:15 UTC (permalink / raw) To: git What did you do before the bug happened? (Steps to reproduce your issue) git clone https://github.com/notracking/hosts-blocklists cd hosts-blocklists git reflog expire --all --expire=now && git gc --prune=now --aggressive What did you expect to happen? (Expected behavior) Running gc on a ~300 MB repo should not take 1 hour 55 minutes when running gc on a 2.6 GB repo (LLVM) only takes 24 minutes. What happened instead? (Actual behavior) Command took 1h 55m to complete on a ~300MB repo and used enough resources that the machine is almost unusable. What's different between what you expected and what actually happened? Compression stage uses the majority of the resources and time. Compression itself, when compared to something like zlib or lzma, should not take very long. While more may be happening as objects are compressed, the amount of time gc takes to compress the objects and the resources it consumed are both unreasonable. Memory: RSS = 3451152 KB (3.29 GB), VSZ = 29286272 KB (27.92 GB) Time: 12902.83s user 8995.41s system 315% cpu 1:55:36.73 total I've seen this issue with a number of repos and size of the repo does not determine if this happens. LLVM @ 2.6 GB worked flawlessly, a 900 MB repo never finished, this 300 MB repo takes forever, and if you test something like chromium git will just crash. [System Info] hardware: 2.9Ghz Quad Core i7 git version: git version 2.30.0 cpu: x86_64 no commit associated with this build sizeof-long: 8 sizeof-size_t: 8 shell-path: /bin/sh uname: Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64 x86_64 compiler info: clang: 12.0.0 (clang-1200.0.32.28) libc info: no libc information available $SHELL (typically, interactive shell): /usr/local/bin/zsh ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Performance of "git gc..." is extremely bad in some cases 2021-03-08 21:15 Performance of "git gc..." is extremely bad in some cases Anthony Muller @ 2021-03-08 22:29 ` Bryan Turner [not found] ` <178140c3b3b.c7a29306868075.2037370475662478386@monospace.sh> 2021-03-08 23:56 ` brian m. carlson 0 siblings, 2 replies; 5+ messages in thread From: Bryan Turner @ 2021-03-08 22:29 UTC (permalink / raw) To: Anthony Muller; +Cc: git On Mon, Mar 8, 2021 at 1:32 PM Anthony Muller <anthony@monospace.sh> wrote: > > What did you do before the bug happened? (Steps to reproduce your issue) > > git clone https://github.com/notracking/hosts-blocklists > cd hosts-blocklists > git reflog expire --all --expire=now && git gc --prune=now --aggressive --aggressive tells git gc to discard all of its existing delta chains and go find new ones, and to be fairly aggressive in how it looks for candidates. This is going to be the primary source of the resource usage you see, as well as the time. Aggressive GCs are something you do once in a (very great) while. If you try this without the --aggressive, how does it look? > > > What did you expect to happen? (Expected behavior) > > Running gc on a ~300 MB repo should not take 1 hour 55 minutes when > running gc on a 2.6 GB repo (LLVM) only takes 24 minutes. > > > What happened instead? (Actual behavior) > > Command took 1h 55m to complete on a ~300MB repo and used enough > resources that the machine is almost unusable. > > > What's different between what you expected and what actually happened? > > Compression stage uses the majority of the resources and time. Compression > itself, when compared to something like zlib or lzma, should not take very long. > While more may be happening as objects are compressed, the amount of time > gc takes to compress the objects and the resources it consumed are both > unreasonable. The compression happening here is delta compression, not simple compression like zip. Git searches across the repository for similar objects and stores them as chains with a base object and (essentially) instructions for converting that base object into another object. That's significantly more resource-intensive work than zipping some data. > > Memory: RSS = 3451152 KB (3.29 GB), VSZ = 29286272 KB (27.92 GB) > Time: 12902.83s user 8995.41s system 315% cpu 1:55:36.73 total Git offers several knobs that can be used to influence (though not necessarily control) its resource usage. On 64-bit Linux the defaults are 1 thread per logical CPU (so hyperthreaded CPUs use double) and _unlimited_ memory usage per thread. You might want to investigate some options like pack.threads and pack.windowmemory to apply some constraints. > > I've seen this issue with a number of repos and size of the repo does not > determine if this happens. LLVM @ 2.6 GB worked flawlessly, a 900 MB > repo never finished, this 300 MB repo takes forever, and if you test something > like chromium git will just crash. > > > [System Info] > hardware: 2.9Ghz Quad Core i7 > git version: > git version 2.30.0 > cpu: x86_64 > no commit associated with this build > sizeof-long: 8 > sizeof-size_t: 8 > shell-path: /bin/sh > uname: Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64 x86_64 > compiler info: clang: 12.0.0 (clang-1200.0.32.28) > libc info: no libc information available > $SHELL (typically, interactive shell): /usr/local/bin/zsh > Hope this helps! -b ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <178140c3b3b.c7a29306868075.2037370475662478386@monospace.sh>]
* Re: Performance of "git gc..." is extremely bad in some cases [not found] ` <178140c3b3b.c7a29306868075.2037370475662478386@monospace.sh> @ 2021-03-08 23:55 ` Bryan Turner 0 siblings, 0 replies; 5+ messages in thread From: Bryan Turner @ 2021-03-08 23:55 UTC (permalink / raw) To: Anthony Muller, Git Users Re-adding the list. On Mon, Mar 8, 2021 at 2:54 PM Anthony Muller <anthony@monospace.sh> wrote: > > ---- On Mon, 08 Mar 2021 22:29:16 +0000 Bryan Turner <bturner@atlassian.com> wrote ---- > > On Mon, Mar 8, 2021 at 1:32 PM Anthony Muller <anthony@monospace.sh> wrote: > > > > > > What did you do before the bug happened? (Steps to reproduce your issue) > > > > > > git clone https://github.com/notracking/hosts-blocklists > > > cd hosts-blocklists > > > git reflog expire --all --expire=now && git gc --prune=now --aggressive > > > > --aggressive tells git gc to discard all of its existing delta chains > > and go find new ones, and to be fairly aggressive in how it looks for > > candidates. This is going to be the primary source of the resource > > usage you see, as well as the time. > > > > Aggressive GCs are something you do once in a (very great) while. If > > you try this without the --aggressive, how does it look? > > Hi Bryan, > > Without --aggressive it's fine and I do expect it to take longer using aggressive. > > I find it very odd that a repo ~8x in size and with probably 400x as many objects took 1/4 the time though. I would think size and object count would play a role in time and resources. Looking at that blocklists repository, it looks like it's not many files or commits, but the files are pretty large (10-25MB). For delta compression, large files can cause a lot of pain. If you set core.bigFileThreshold=5m (a reduction from 512m by default) and pack.windowmemory=1g, for me locally, at least, "fixes" the "problem" (which is to say it changes the behavior). The GC runs in under 10 minutes: $ /usr/bin/time -l git gc --prune=now --aggressive Enumerating objects: 10777, done. Counting objects: 100% (10777/10777), done. Delta compression using up to 20 threads Compressing objects: 100% (8672/8672), done. Writing objects: 100% (10777/10777), done. Reusing bitmaps: 101, done. Selecting bitmap commits: 2146, done. Building bitmaps: 100% (126/126), done. Total 10777 (delta 3986), reused 6784 (delta 0) 298.00 real 996.76 user 18.84 sys 9284980736 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 2861811 page reclaims 1 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 296 signals received 172 voluntary context switches 171245 involuntary context switches 20586171 instructions retired 28100595 cycles elapsed 880640 peak memory footprint Of course, that also takes the size of the repository from 367MB to 2.3GB--not exactly your desired outcome if you're trying to save space. From there I tried just reducing the threads from 20 to 8 and using the 1g window memory limit, but leaving the bigFileThreshold at default. That allows for delta compressing everything, and for me completes in just under 12 minutes: $ /usr/bin/time -l git gc --prune=now --aggressive Enumerating objects: 10777, done. Counting objects: 100% (10777/10777), done. Delta compression using up to 8 threads Compressing objects: 100% (10077/10077), done. Writing objects: 100% (10777/10777), done. Reusing bitmaps: 101, done. Selecting bitmap commits: 2146, done. Building bitmaps: 100% (126/126), done. Total 10777 (delta 5387), reused 5383 (delta 0) 713.98 real 3053.41 user 31.91 sys 13408837632 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 3804319 page reclaims 1 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 712 signals received 57 voluntary context switches 1011681 involuntary context switches 20568579 instructions retired 31734809 cycles elapsed 872448 peak memory footprint That also reduced the repository from 367MB to 320MB. (Technically from 2.3GB to 320MB, since I this after the earlier attempt.) Of course, there's a machine difference to consider here as well. I'm guessing you're on a MacBook Pro, based on the specs part of the bug report. My testing here is on a 10 core iMac Pro with 64GB of RAM, so some of the difference may just be that I'm on a less constrained system. > > What factors would make that happen? Is it a combination of more commits with fewer objects? Big files are the biggest issue, in my experience. The total number of objects (it's not really about object type too much, as far as I can tell) certainly has an impact, but having big files (where "big" here is anything larger than a normal source code file, which is typically well under 1MB) is likely to balloon both time and resource consumption. > > I've been using aggressive after cloning repos I use primarily for reference/offline/etc to recover a lot of wasted space. To some extent I'm not sure there's an easy answer, for this. It may come down to looking at the repositories before you do a local GC to see what "shape" they have (starting size on disk, in-repository file sizes, etc.) and deciding from there whether the savings is likely to be worth the time investment. > > > > > > > > > > > > What did you expect to happen? (Expected behavior) > > > > > > Running gc on a ~300 MB repo should not take 1 hour 55 minutes when > > > running gc on a 2.6 GB repo (LLVM) only takes 24 minutes. > > > > > > > > > What happened instead? (Actual behavior) > > > > > > Command took 1h 55m to complete on a ~300MB repo and used enough > > > resources that the machine is almost unusable. > > > > > > > > > What's different between what you expected and what actually happened? > > > > > > Compression stage uses the majority of the resources and time. Compression > > > itself, when compared to something like zlib or lzma, should not take very long. > > > While more may be happening as objects are compressed, the amount of time > > > gc takes to compress the objects and the resources it consumed are both > > > unreasonable. > > > > The compression happening here is delta compression, not simple > > compression like zip. Git searches across the repository for similar > > objects and stores them as chains with a base object and (essentially) > > instructions for converting that base object into another object. > > That's significantly more resource-intensive work than zipping some > > data. > > > > > > > > Memory: RSS = 3451152 KB (3.29 GB), VSZ = 29286272 KB (27.92 GB) > > > Time: 12902.83s user 8995.41s system 315% cpu 1:55:36.73 total > > > > Git offers several knobs that can be used to influence (though not > > necessarily control) its resource usage. On 64-bit Linux the defaults > > are 1 thread per logical CPU (so hyperthreaded CPUs use double) and > > _unlimited_ memory usage per thread. You might want to investigate > > some options like pack.threads and pack.windowmemory to apply some > > constraints. > > > > > > > > I've seen this issue with a number of repos and size of the repo does not > > > determine if this happens. LLVM @ 2.6 GB worked flawlessly, a 900 MB > > > repo never finished, this 300 MB repo takes forever, and if you test something > > > like chromium git will just crash. I should add that for something like Chromium, and potentially whatever 900MB repository you tested with, you're very likely to need to do some explicit configuration for things like threads/window memory unless you're on a _very_ beefy machine. The default unlimited behavior is very likely to run afoul of the OOM killer (or something similar). > > > > > > > > > [System Info] > > > hardware: 2.9Ghz Quad Core i7 > > > git version: > > > git version 2.30.0 > > > cpu: x86_64 > > > no commit associated with this build > > > sizeof-long: 8 > > > sizeof-size_t: 8 > > > shell-path: /bin/sh > > > uname: Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64 x86_64 > > > compiler info: clang: 12.0.0 (clang-1200.0.32.28) > > > libc info: no libc information available > > > $SHELL (typically, interactive shell): /usr/local/bin/zsh > > > > > Hope this helps! -b ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Performance of "git gc..." is extremely bad in some cases 2021-03-08 22:29 ` Bryan Turner [not found] ` <178140c3b3b.c7a29306868075.2037370475662478386@monospace.sh> @ 2021-03-08 23:56 ` brian m. carlson 2021-03-09 0:14 ` Anthony Muller 1 sibling, 1 reply; 5+ messages in thread From: brian m. carlson @ 2021-03-08 23:56 UTC (permalink / raw) To: Bryan Turner; +Cc: Anthony Muller, git [-- Attachment #1: Type: text/plain, Size: 2163 bytes --] On 2021-03-08 at 22:29:16, Bryan Turner wrote: > On Mon, Mar 8, 2021 at 1:32 PM Anthony Muller <anthony@monospace.sh> wrote: > > > > What did you do before the bug happened? (Steps to reproduce your issue) > > > > git clone https://github.com/notracking/hosts-blocklists > > cd hosts-blocklists > > git reflog expire --all --expire=now && git gc --prune=now --aggressive > > --aggressive tells git gc to discard all of its existing delta chains > and go find new ones, and to be fairly aggressive in how it looks for > candidates. This is going to be the primary source of the resource > usage you see, as well as the time. > > Aggressive GCs are something you do once in a (very great) while. If > you try this without the --aggressive, how does it look? I should point out that this repository is also rather pathologically structured. Almost every commit is an automatic commit updating the same five files which are text files ranging from 5 MB to 11 MB. When you use --aggressive, as Bryan pointed out, you're asking to throw away all the deltas and try really hard to compute all of them fresh. That's going to use a lot of memory because you're loading many large text files into memory. It's also going to use a lot of CPU because these files do indeed delta extremely well, and since computing deltas on larger files is more expensive, especially when there are many of them. And that's just the blobs. The trees and commits are also going to be nearly identically structured and will also delta well with virtually every other similar object of their type. Normally Git sorts by size which helps pick better candidates, but since these are all going to be identically sized, the performance is going to suffer. Now, I have the advantage in this case of being a person who's sometimes on call for the maintenance of Git repositories and in that capacity, that this is pathologically structured is obvious to me. But, yeah, I would definitely not run --aggressive on this repo unless I needed to and I would not expect it to perform well. -- brian m. carlson (he/him or they/them) Houston, Texas, US [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 263 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Performance of "git gc..." is extremely bad in some cases 2021-03-08 23:56 ` brian m. carlson @ 2021-03-09 0:14 ` Anthony Muller 0 siblings, 0 replies; 5+ messages in thread From: Anthony Muller @ 2021-03-09 0:14 UTC (permalink / raw) To: brian m. carlson; +Cc: Bryan Turner, git Thank you Brian and Bryan. You both clarified what was happening and now I know what to look for. I can use a shallow clone for most repos, but there are some I want to keep history for. I don't need a full copy of this repo, but it was a good repo to show the issue I was facing. Thanks again! ---- On Mon, 08 Mar 2021 23:56:53 +0000 brian m. carlson <sandals@crustytoothpaste.net> wrote ---- > On 2021-03-08 at 22:29:16, Bryan Turner wrote: > > On Mon, Mar 8, 2021 at 1:32 PM Anthony Muller <anthony@monospace.sh> wrote: > > > > > > What did you do before the bug happened? (Steps to reproduce your issue) > > > > > > git clone https://github.com/notracking/hosts-blocklists > > > cd hosts-blocklists > > > git reflog expire --all --expire=now && git gc --prune=now --aggressive > > > > --aggressive tells git gc to discard all of its existing delta chains > > and go find new ones, and to be fairly aggressive in how it looks for > > candidates. This is going to be the primary source of the resource > > usage you see, as well as the time. > > > > Aggressive GCs are something you do once in a (very great) while. If > > you try this without the --aggressive, how does it look? > > I should point out that this repository is also rather pathologically > structured. Almost every commit is an automatic commit updating the > same five files which are text files ranging from 5 MB to 11 MB. > > When you use --aggressive, as Bryan pointed out, you're asking to throw > away all the deltas and try really hard to compute all of them fresh. > That's going to use a lot of memory because you're loading many large > text files into memory. It's also going to use a lot of CPU because > these files do indeed delta extremely well, and since computing deltas > on larger files is more expensive, especially when there are many of > them. > > And that's just the blobs. The trees and commits are also going to be > nearly identically structured and will also delta well with virtually > every other similar object of their type. Normally Git sorts by size > which helps pick better candidates, but since these are all going to be > identically sized, the performance is going to suffer. > > Now, I have the advantage in this case of being a person who's sometimes > on call for the maintenance of Git repositories and in that capacity, > that this is pathologically structured is obvious to me. But, yeah, I > would definitely not run --aggressive on this repo unless I needed to > and I would not expect it to perform well. > -- > brian m. carlson (he/him or they/them) > Houston, Texas, US > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-03-09 0:14 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-03-08 21:15 Performance of "git gc..." is extremely bad in some cases Anthony Muller
2021-03-08 22:29 ` Bryan Turner
[not found] ` <178140c3b3b.c7a29306868075.2037370475662478386@monospace.sh>
2021-03-08 23:55 ` Bryan Turner
2021-03-08 23:56 ` brian m. carlson
2021-03-09 0:14 ` Anthony Muller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox