* Re: [RFH] Why do osx CI jobs so unreliable?
@ 2026-06-20 15:33 Michael Montalbo
2026-06-21 21:34 ` Jeff King
0 siblings, 1 reply; 20+ messages in thread
From: Michael Montalbo @ 2026-06-20 15:33 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git, Junio C Hamano
Patrick Steinhardt <ps@pks.im> writes:
> So I strongly suspect that it most be one of the t555* tests.
> [...]
> Maybe this is something that's specific to GitHub's environment...
I think you're right it's t5551/t5559. The runs Junio linked:
osx-clang cancelled 360min
osx-gcc cancelled 360min
osx-reftable success 35min
osx-meson success 61min
All four run the same t5551/t5559 under EXPENSIVE. The two that
finished differ in just two ways, which look like the levers:
osx-reftable generates the 100k-ref advertisement in ~24ms vs ~1.2s
for loose refs on macOS (so much less time mid-response), and
osx-meson runs tests at nproc while the prove jobs hardcode --jobs=10
on a 3-core runner (over recent master/next the prove jobs hang ~40%,
meson ~10%).
When it is wedged the whole chain sits at 0% CPU. upload-pack is
blocked in write() on the ls-refs advertisement, curl blocked in
select(). So it looks like an HTTP/2 flow-control stall on the
response side. The same stall resets itself after ~60-85s on my Linux
box and on a bare-metal Mac, but not on the GitHub runner; I haven't
pinned down why yet.
On the chance those two levers are the fix, a branch off master:
https://github.com/mmontalbo/git/tree/mm/macos-ci-hang-fix
- pack the refs in t5551's enormous-ref-negotiation test (doesn't
change what it checks on the wire, just avoids re-reading 100k loose
files to advertise them, like reftable already does)
- use the core count for $JOBS on the GitHub macOS path, matching the
GitLab branch in the same ci/lib.sh and what meson does
I ran the two macOS jobs under EXPENSIVE about eight times with these
and they all finished in ~30-44min instead of hanging. Happy to send
out a patch if it's helpful.
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [RFH] Why do osx CI jobs so unreliable? 2026-06-20 15:33 [RFH] Why do osx CI jobs so unreliable? Michael Montalbo @ 2026-06-21 21:34 ` Jeff King 2026-06-22 4:42 ` Patrick Steinhardt 2026-06-22 5:05 ` Junio C Hamano 0 siblings, 2 replies; 20+ messages in thread From: Jeff King @ 2026-06-21 21:34 UTC (permalink / raw) To: Michael Montalbo; +Cc: Patrick Steinhardt, git, Junio C Hamano On Sat, Jun 20, 2026 at 08:33:13AM -0700, Michael Montalbo wrote: > Patrick Steinhardt <ps@pks.im> writes: > > So I strongly suspect that it most be one of the t555* tests. > > [...] > > Maybe this is something that's specific to GitHub's environment... > > I think you're right it's t5551/t5559. The runs Junio linked: > > osx-clang cancelled 360min > osx-gcc cancelled 360min > osx-reftable success 35min > osx-meson success 61min > > All four run the same t5551/t5559 under EXPENSIVE. The two that > finished differ in just two ways, which look like the levers: > osx-reftable generates the 100k-ref advertisement in ~24ms vs ~1.2s > for loose refs on macOS (so much less time mid-response), and > osx-meson runs tests at nproc while the prove jobs hardcode --jobs=10 > on a 3-core runner (over recent master/next the prove jobs hang ~40%, > meson ~10%). If the problem is a racy deadlock, there is a reasonable chance that some jobs may simply be lucky. Even if things like packing refs help, I suspect the problem may still be lurking. Maybe I'm just a pessimist, though. ;) > When it is wedged the whole chain sits at 0% CPU. upload-pack is > blocked in write() on the ls-refs advertisement, curl blocked in > select(). So it looks like an HTTP/2 flow-control stall on the > response side. The same stall resets itself after ~60-85s on my Linux > box and on a bare-metal Mac, but not on the GitHub runner; I haven't > pinned down why yet. We had some HTTP/2 stalls/deadlocks in the past, and they were dependent on libcurl and apache (actually h2_mod) versions. IIRC some of the non-TLS code paths for HTTP/2 were not well tested, which led to 8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course after that commit those cleartext code paths should not be a problem, so that is probably not exactly the issue now. But it might be worth checking the versions you're running locally versus what's in the GitHub runner. -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable? 2026-06-21 21:34 ` Jeff King @ 2026-06-22 4:42 ` Patrick Steinhardt 2026-06-22 9:47 ` Patrick Steinhardt 2026-06-22 5:05 ` Junio C Hamano 1 sibling, 1 reply; 20+ messages in thread From: Patrick Steinhardt @ 2026-06-22 4:42 UTC (permalink / raw) To: Jeff King; +Cc: Michael Montalbo, git, Junio C Hamano On Sun, Jun 21, 2026 at 05:34:07PM -0400, Jeff King wrote: > On Sat, Jun 20, 2026 at 08:33:13AM -0700, Michael Montalbo wrote: > > > Patrick Steinhardt <ps@pks.im> writes: > > > So I strongly suspect that it most be one of the t555* tests. > > > [...] > > > Maybe this is something that's specific to GitHub's environment... > > > > I think you're right it's t5551/t5559. The runs Junio linked: > > > > osx-clang cancelled 360min > > osx-gcc cancelled 360min > > osx-reftable success 35min > > osx-meson success 61min > > > > All four run the same t5551/t5559 under EXPENSIVE. The two that > > finished differ in just two ways, which look like the levers: > > osx-reftable generates the 100k-ref advertisement in ~24ms vs ~1.2s > > for loose refs on macOS (so much less time mid-response), and > > osx-meson runs tests at nproc while the prove jobs hardcode --jobs=10 > > on a 3-core runner (over recent master/next the prove jobs hang ~40%, > > meson ~10%). > > If the problem is a racy deadlock, there is a reasonable chance that > some jobs may simply be lucky. Even if things like packing refs help, I > suspect the problem may still be lurking. Maybe I'm just a pessimist, > though. ;) I had the same thought. > > When it is wedged the whole chain sits at 0% CPU. upload-pack is > > blocked in write() on the ls-refs advertisement, curl blocked in > > select(). So it looks like an HTTP/2 flow-control stall on the > > response side. The same stall resets itself after ~60-85s on my Linux > > box and on a bare-metal Mac, but not on the GitHub runner; I haven't > > pinned down why yet. > > We had some HTTP/2 stalls/deadlocks in the past, and they were dependent > on libcurl and apache (actually h2_mod) versions. IIRC some of the > non-TLS code paths for HTTP/2 were not well tested, which led to > 8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course > after that commit those cleartext code paths should not be a problem, so > that is probably not exactly the issue now. > > But it might be worth checking the versions you're running locally > versus what's in the GitHub runner. I didn't observe any similar hangs in GitLab's CI systems, so I wonder whether this is because of different versions of curl. And indeed we use different versions: - On GitHub we use 8.6.0. - On GitLab we use 8.7.1. Now this of course doesn't mean that updating the curl version is the fix to this whole issue, as there's a ton of other factors that could play a role in whether or not the test hangs. So while we could just upgrade parts of the stack and cross our fingers, but that feels rather unsatisfactory. Still, one place to start could be to update our build images to macOS 15. But the big question to me is whether the hang is because of a bug in Git with how we drive curl, a bug in curl itself, or a bug in Apache. Patrick ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable? 2026-06-22 4:42 ` Patrick Steinhardt @ 2026-06-22 9:47 ` Patrick Steinhardt 2026-06-22 9:55 ` Patrick Steinhardt 0 siblings, 1 reply; 20+ messages in thread From: Patrick Steinhardt @ 2026-06-22 9:47 UTC (permalink / raw) To: Jeff King; +Cc: Michael Montalbo, git, Junio C Hamano On Mon, Jun 22, 2026 at 06:42:24AM +0200, Patrick Steinhardt wrote: > On Sun, Jun 21, 2026 at 05:34:07PM -0400, Jeff King wrote: > > On Sat, Jun 20, 2026 at 08:33:13AM -0700, Michael Montalbo wrote: [snip] > > > When it is wedged the whole chain sits at 0% CPU. upload-pack is > > > blocked in write() on the ls-refs advertisement, curl blocked in > > > select(). So it looks like an HTTP/2 flow-control stall on the > > > response side. The same stall resets itself after ~60-85s on my Linux > > > box and on a bare-metal Mac, but not on the GitHub runner; I haven't > > > pinned down why yet. > > > > We had some HTTP/2 stalls/deadlocks in the past, and they were dependent > > on libcurl and apache (actually h2_mod) versions. IIRC some of the > > non-TLS code paths for HTTP/2 were not well tested, which led to > > 8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course > > after that commit those cleartext code paths should not be a problem, so > > that is probably not exactly the issue now. > > > > But it might be worth checking the versions you're running locally > > versus what's in the GitHub runner. > > I didn't observe any similar hangs in GitLab's CI systems, so I wonder > whether this is because of different versions of curl. And indeed we use > different versions: > > - On GitHub we use 8.6.0. > > - On GitLab we use 8.7.1. > > Now this of course doesn't mean that updating the curl version is the > fix to this whole issue, as there's a ton of other factors that could > play a role in whether or not the test hangs. So while we could just > upgrade parts of the stack and cross our fingers, but that feels rather > unsatisfactory. Still, one place to start could be to update our build > images to macOS 15. > > But the big question to me is whether the hang is because of a bug in > Git with how we drive curl, a bug in curl itself, or a bug in Apache. I noticed that a osx-clang job failed today in t5551 [1]. This time it didn't hang, but produced an actual error: 2026-06-22T09:25:45.1984230Z ++ git -C too-many-refs fetch -q --tags 2026-06-22T09:25:45.1984420Z error: RPC failed; curl 18 transfer closed with outstanding read data remaining 2026-06-22T09:25:45.1984520Z fatal: expected flush after ref listing 2026-06-22T09:25:45.1984610Z error: last command exited with $?=128 2026-06-22T09:25:45.1984660Z ++ rm -f tags 2026-06-22T09:25:45.1984710Z ++ : 2026-06-22T09:25:45.1984830Z not ok 35 - http can handle enormous ref negotiation There was a second test failing similarly. Patrick [1]: https://github.com/git/git/actions/runs/27940620478/job/82672854726 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable? 2026-06-22 9:47 ` Patrick Steinhardt @ 2026-06-22 9:55 ` Patrick Steinhardt 2026-06-22 10:29 ` Patrick Steinhardt 0 siblings, 1 reply; 20+ messages in thread From: Patrick Steinhardt @ 2026-06-22 9:55 UTC (permalink / raw) To: Jeff King; +Cc: Michael Montalbo, git, Junio C Hamano On Mon, Jun 22, 2026 at 11:48:01AM +0200, Patrick Steinhardt wrote: > On Mon, Jun 22, 2026 at 06:42:24AM +0200, Patrick Steinhardt wrote: > > On Sun, Jun 21, 2026 at 05:34:07PM -0400, Jeff King wrote: > > > On Sat, Jun 20, 2026 at 08:33:13AM -0700, Michael Montalbo wrote: > [snip] > > > > When it is wedged the whole chain sits at 0% CPU. upload-pack is > > > > blocked in write() on the ls-refs advertisement, curl blocked in > > > > select(). So it looks like an HTTP/2 flow-control stall on the > > > > response side. The same stall resets itself after ~60-85s on my Linux > > > > box and on a bare-metal Mac, but not on the GitHub runner; I haven't > > > > pinned down why yet. > > > > > > We had some HTTP/2 stalls/deadlocks in the past, and they were dependent > > > on libcurl and apache (actually h2_mod) versions. IIRC some of the > > > non-TLS code paths for HTTP/2 were not well tested, which led to > > > 8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course > > > after that commit those cleartext code paths should not be a problem, so > > > that is probably not exactly the issue now. > > > > > > But it might be worth checking the versions you're running locally > > > versus what's in the GitHub runner. > > > > I didn't observe any similar hangs in GitLab's CI systems, so I wonder > > whether this is because of different versions of curl. And indeed we use > > different versions: > > > > - On GitHub we use 8.6.0. > > > > - On GitLab we use 8.7.1. > > > > Now this of course doesn't mean that updating the curl version is the > > fix to this whole issue, as there's a ton of other factors that could > > play a role in whether or not the test hangs. So while we could just > > upgrade parts of the stack and cross our fingers, but that feels rather > > unsatisfactory. Still, one place to start could be to update our build > > images to macOS 15. > > > > But the big question to me is whether the hang is because of a bug in > > Git with how we drive curl, a bug in curl itself, or a bug in Apache. > > I noticed that a osx-clang job failed today in t5551 [1]. This time it > didn't hang, but produced an actual error: > > 2026-06-22T09:25:45.1984230Z ++ git -C too-many-refs fetch -q --tags > 2026-06-22T09:25:45.1984420Z error: RPC failed; curl 18 transfer closed with outstanding read data remaining > 2026-06-22T09:25:45.1984520Z fatal: expected flush after ref listing > 2026-06-22T09:25:45.1984610Z error: last command exited with $?=128 > 2026-06-22T09:25:45.1984660Z ++ rm -f tags > 2026-06-22T09:25:45.1984710Z ++ : > 2026-06-22T09:25:45.1984830Z not ok 35 - http can handle enormous ref negotiation > > There was a second test failing similarly. Oh, and Linux is also failing in the same test suite [1], even though the job logs are truncated, so it's hard to say whether it's the same failure or not. There certainly seems to be a deeper issue here. We could of course just disable the test again, but by now I do wonder whether this would paper over an actual bug. Patrick [1]: https://github.com/git/git/actions/runs/27940620478/job/82672854864 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable? 2026-06-22 9:55 ` Patrick Steinhardt @ 2026-06-22 10:29 ` Patrick Steinhardt 2026-06-26 3:27 ` Michael Montalbo 0 siblings, 1 reply; 20+ messages in thread From: Patrick Steinhardt @ 2026-06-22 10:29 UTC (permalink / raw) To: Jeff King; +Cc: Michael Montalbo, git, Junio C Hamano On Mon, Jun 22, 2026 at 11:55:31AM +0200, Patrick Steinhardt wrote: > On Mon, Jun 22, 2026 at 11:48:01AM +0200, Patrick Steinhardt wrote: > > On Mon, Jun 22, 2026 at 06:42:24AM +0200, Patrick Steinhardt wrote: > > > On Sun, Jun 21, 2026 at 05:34:07PM -0400, Jeff King wrote: > > > > On Sat, Jun 20, 2026 at 08:33:13AM -0700, Michael Montalbo wrote: > > [snip] > > > > > When it is wedged the whole chain sits at 0% CPU. upload-pack is > > > > > blocked in write() on the ls-refs advertisement, curl blocked in > > > > > select(). So it looks like an HTTP/2 flow-control stall on the > > > > > response side. The same stall resets itself after ~60-85s on my Linux > > > > > box and on a bare-metal Mac, but not on the GitHub runner; I haven't > > > > > pinned down why yet. > > > > > > > > We had some HTTP/2 stalls/deadlocks in the past, and they were dependent > > > > on libcurl and apache (actually h2_mod) versions. IIRC some of the > > > > non-TLS code paths for HTTP/2 were not well tested, which led to > > > > 8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course > > > > after that commit those cleartext code paths should not be a problem, so > > > > that is probably not exactly the issue now. > > > > > > > > But it might be worth checking the versions you're running locally > > > > versus what's in the GitHub runner. > > > > > > I didn't observe any similar hangs in GitLab's CI systems, so I wonder > > > whether this is because of different versions of curl. And indeed we use > > > different versions: > > > > > > - On GitHub we use 8.6.0. > > > > > > - On GitLab we use 8.7.1. > > > > > > Now this of course doesn't mean that updating the curl version is the > > > fix to this whole issue, as there's a ton of other factors that could > > > play a role in whether or not the test hangs. So while we could just > > > upgrade parts of the stack and cross our fingers, but that feels rather > > > unsatisfactory. Still, one place to start could be to update our build > > > images to macOS 15. > > > > > > But the big question to me is whether the hang is because of a bug in > > > Git with how we drive curl, a bug in curl itself, or a bug in Apache. > > > > I noticed that a osx-clang job failed today in t5551 [1]. This time it > > didn't hang, but produced an actual error: > > > > 2026-06-22T09:25:45.1984230Z ++ git -C too-many-refs fetch -q --tags > > 2026-06-22T09:25:45.1984420Z error: RPC failed; curl 18 transfer closed with outstanding read data remaining > > 2026-06-22T09:25:45.1984520Z fatal: expected flush after ref listing > > 2026-06-22T09:25:45.1984610Z error: last command exited with $?=128 > > 2026-06-22T09:25:45.1984660Z ++ rm -f tags > > 2026-06-22T09:25:45.1984710Z ++ : > > 2026-06-22T09:25:45.1984830Z not ok 35 - http can handle enormous ref negotiation > > > > There was a second test failing similarly. > > Oh, and Linux is also failing in the same test suite [1], even though > the job logs are truncated, so it's hard to say whether it's the same > failure or not. > > There certainly seems to be a deeper issue here. We could of course just > disable the test again, but by now I do wonder whether this would paper > over an actual bug. > > Patrick > > [1]: https://github.com/git/git/actions/runs/27940620478/job/82672854864 Sorry for the repeated spam. I think the issue is rather simple: we're hitting timeouts in Apache. If you apply the following diff: diff --git a/t/lib-httpd/apache.conf b/t/lib-httpd/apache.conf index 40a690b0bb..4054fe008f 100644 --- a/t/lib-httpd/apache.conf +++ b/t/lib-httpd/apache.conf @@ -302,3 +302,5 @@ RewriteRule ^/half-auth-complete/ - [E=AUTHREQUIRED:yes] SVNPath "${LIB_HTTPD_SVNPATH}" </Location> </IfDefine> + +Timeout 1 Then you'll see the same errors locally: $ GIT_TEST_LONG=Yes meson test t5551-http-fetch-smart --test-args=-ix -i Failed to clone 'sub'. Retry scheduled Cloning into '/home/pks/Development/git/build/test-output/trash directory.t5551-http-fetch-smart/sub'... error: RPC failed; curl 18 transfer closed with outstanding read data remaining fatal: early EOF fatal: fetch-pack: invalid index-pack output fatal: clone of 'http://127.0.0.1:5551/smart_headers/repo.git' into submodule path '/home/pks/Development/git/build/test-output/trash directory.t5551-http-fetch-smart/sub' failed Failed to clone 'sub' a second time, aborting error: last command exited with $?=1 not ok 36 - custom http headers # # test_must_fail git -c http.extraheader="x-magic-two: cadabra" \ # fetch "$HTTPD_URL/smart_headers/repo.git" && # git -c http.extraheader="x-magic-one: abra" \ # -c http.extraheader="x-magic-two: cadabra" \ # fetch "$HTTPD_URL/smart_headers/repo.git" && # git update-index --add --cacheinfo 160000,$(git rev-parse HEAD),sub && # git config -f .gitmodules submodule.sub.path sub && # git config -f .gitmodules submodule.sub.url \ # "$HTTPD_URL/smart_headers/repo.git" && # git submodule init sub && # test_must_fail git submodule update sub && # git -c http.extraheader="x-magic-one: abra" \ # -c http.extraheader="x-magic-two: cadabra" \ # submodule update sub # 1..36 And Apache also logs this as a timeout: [Mon Jun 22 10:26:52.115717 2026] [cgi:warn] [pid 3686957:tid 3686957] [client 127.0.0.1:55114] AH01220: Timeout waiting for output from CGI script /home/pks/Development/git/build/git-http-backend [Mon Jun 22 10:26:52.115748 2026] [core:error] [pid 3686957:tid 3686957] (70007)The timeout specified has expired: [client 127.0.0.1:55114] AH00574: ap_content_length_filter: apr_bucket_read() failed [Mon Jun 22 10:27:01.567533 2026] [cgi:warn] [pid 3686958:tid 3686958] [client 127.0.0.1:54384] AH01220: Timeout waiting for output from CGI script /home/pks/Development/git/build/git-http-backend [Mon Jun 22 10:27:01.567559 2026] [core:error] [pid 3686958:tid 3686958] (70007)The timeout specified has expired: [client 127.0.0.1:54384] AH00574: ap_content_length_filter: apr_bucket_read() failed This is because our keepalive mechanisms aren't helping: - The TCP-level keepalives don't help with Apache. - The application-level sideband keepalives don't apply to the "ls-refs" endpoint. Whether that's the same issue like we see in macOS sometimes is a different question. Patrick ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable? 2026-06-22 10:29 ` Patrick Steinhardt @ 2026-06-26 3:27 ` Michael Montalbo 2026-06-26 5:16 ` Jeff King 0 siblings, 1 reply; 20+ messages in thread From: Michael Montalbo @ 2026-06-26 3:27 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Jeff King, git, Junio C Hamano Patrick Steinhardt <ps@pks.im> writes: > I think the issue is rather simple: we're hitting timeouts in Apache. > [...] > This is because our keepalive mechanisms aren't helping [...] > Whether that's the same issue like we see in macOS sometimes is a > different question. I think that is the trigger for issues we've been seeing. I spent some time investigating the Apache side over the last week and maybe found a mod_http2 bug, which I filed upstream with a potential fix: bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=70131 fix: https://github.com/mmontalbo/httpd/pull/2 To Patrick's earlier question of whether this is a Git, curl, or Apache bug: as best I can tell it's Apache. I could reproduce it with no Git involved at all (just Apache and a small CGI that goes quiet past the Timeout), and across several curl versions (8.6.0, which is what the GitHub runners use, up to 8.20.0), so I don't think bumping curl would help. It also seems to wear two faces from the same trigger: over HTTP/1.1 Apache closes the connection and curl bails with the "transfer closed" error (which looks like what you hit with Timeout=1, and the recent failures on both macOS and Linux), and over HTTP/2 it does not reliably reset the stream, so the client just waits, which is the six-hour macOS hang. I share the pessimism from earlier in the thread, though: I think the real fix is upstream in Apache, and anything we do on our side mostly just bounds the symptom in the meantime. Given there could be a potential reliability issue with an upstream dependency like Apache, I was considering what mitigation strategies might help: - Enforce some kind of lower bound speed limit and a client-side timeout so runs that wedge fail fast (and loudly) instead of hanging. - Potentially provide some affordance for retrying flaky tests that might fail due to upstream dependencies. Git already has some HTTP retry support (http.maxRetries and friends, added recently), but as far as I can tell it only triggers on HTTP 429 rate limiting, so it would not catch a stall like this on its own. A test-level retry is not something I like that much, since it might encourage papering over flakiness that should be resolved, but it was a consideration vs requiring a fresh CI run to resolve the flake. - Make slow tests faster by optimizing the test itself and/or the test runner configuration (e.g., job number matching cores) so wedges become less likely. For the first one, I think Git already provides some affordances. There is a stall-based timeout that just ships disabled: as I understand it http.lowSpeedLimit sets a bytes/sec floor and http.lowSpeedTime how long a transfer can sit below it before curl gives up, so it would catch a wedged connection without punishing one that is just slow. Enabling it for the http tests might look something like: diff --git a/t/lib-httpd.sh b/t/lib-httpd.sh @@ GIT_TRACE=$GIT_TRACE; export GIT_TRACE +# Abort a transfer that makes essentially no progress for a while, +# so a wedged connection fails in seconds instead of hanging to the +# job cap. Tiny limit, generous window, so it only trips on a true +# stall; override either var, or set the limit to 0, to disable. +GIT_HTTP_LOW_SPEED_LIMIT=${GIT_HTTP_LOW_SPEED_LIMIT-1} +GIT_HTTP_LOW_SPEED_TIME=${GIT_HTTP_LOW_SPEED_TIME-60} +export GIT_HTTP_LOW_SPEED_LIMIT GIT_HTTP_LOW_SPEED_TIME I went conservative on the values on purpose: a floor of 1 byte/sec should only really fire on a true zero-progress stall, not on something that is just crawling on a slow runner, and the 60s window is generous for the same reason. When I tried it locally against a stall-proxy it did turn an otherwise indefinite hang into a bounded abort (a tighter limit/window brings that down to single-digit seconds). It probably does not need to be suite-wide either; it could be scoped per-command with git -c, which the http tests already lean on for this kind of thing (t5551 passes http.postbuffer and http.extraheader that way), if a narrower blast radius feels safer. I only dug into the first option in any depth, since I wanted to sanity-check the direction before writing patches. Does turning on a stall timeout for the http tests seem reasonable? Are there other strategies that we should implement? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable? 2026-06-26 3:27 ` Michael Montalbo @ 2026-06-26 5:16 ` Jeff King 2026-06-26 10:50 ` Patrick Steinhardt 0 siblings, 1 reply; 20+ messages in thread From: Jeff King @ 2026-06-26 5:16 UTC (permalink / raw) To: Michael Montalbo; +Cc: Patrick Steinhardt, git, Junio C Hamano On Thu, Jun 25, 2026 at 08:27:35PM -0700, Michael Montalbo wrote: > I think that is the trigger for issues we've been seeing. I spent > some time investigating the Apache side over the last week and maybe > found a mod_http2 bug, which I filed upstream with a potential fix: > > bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=70131 > fix: https://github.com/mmontalbo/httpd/pull/2 Thanks both of you for digging into this. I'm not familiar enough with Apache's code to pass confident judgement, but your findings certainly convinced me that this is just an apache bug. > Given there could be a potential reliability issue with an upstream > dependency like Apache, I was considering what mitigation strategies > might help: > [...] Depending on how widespread the Apache bug is, another option might just be: do nothing and wait for it to get fixed. Trying to make the wedged state fail fast and loudly is mostly just punting on the problem. We'd still see spurious failures. We've so far resisted the urge to do any automatic flaky-test retries, preferring instead to just try to root out the flakes. I'm a little hesitant to start now, because I think our strategy has mostly been good so far, and I've seen some horrible counter-examples where flakes and retries become a routine drag on development (and I'm afraid that accommodating flakes might make them more common). > - Make slow tests faster by optimizing the test itself and/or > the test runner configuration (e.g., job number matching > cores) so wedges become less likely. It sounds like the bad state is triggered when Apache hits a timeout, and we hit that timeout because the system is slow or busy. We could try to make things less slow, but would it work equally well to increase that timeout? -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable? 2026-06-26 5:16 ` Jeff King @ 2026-06-26 10:50 ` Patrick Steinhardt 2026-06-26 13:45 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Patrick Steinhardt @ 2026-06-26 10:50 UTC (permalink / raw) To: Jeff King; +Cc: Michael Montalbo, git, Junio C Hamano On Fri, Jun 26, 2026 at 01:16:57AM -0400, Jeff King wrote: > On Thu, Jun 25, 2026 at 08:27:35PM -0700, Michael Montalbo wrote: > > > I think that is the trigger for issues we've been seeing. I spent > > some time investigating the Apache side over the last week and maybe > > found a mod_http2 bug, which I filed upstream with a potential fix: > > > > bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=70131 > > fix: https://github.com/mmontalbo/httpd/pull/2 > > Thanks both of you for digging into this. I'm not familiar enough with > Apache's code to pass confident judgement, but your findings certainly > convinced me that this is just an apache bug. The bug manifests both with HTTP/1.1 and HTTP/2 though, so this wouldn't fully fix the flakes we see, right? > > Given there could be a potential reliability issue with an upstream > > dependency like Apache, I was considering what mitigation strategies > > might help: > > [...] > > Depending on how widespread the Apache bug is, another option might just > be: do nothing and wait for it to get fixed. > > Trying to make the wedged state fail fast and loudly is mostly just > punting on the problem. We'd still see spurious failures. We've so far > resisted the urge to do any automatic flaky-test retries, preferring > instead to just try to root out the flakes. I'm a little hesitant to > start now, because I think our strategy has mostly been good so far, and > I've seen some horrible counter-examples where flakes and retries become > a routine drag on development (and I'm afraid that accommodating flakes > might make them more common). I agree. I'm not a fan of retry logic, as every flaky test may mask an actual bug that we haven't fully investigated yet. > > - Make slow tests faster by optimizing the test itself and/or > > the test runner configuration (e.g., job number matching > > cores) so wedges become less likely. > > It sounds like the bad state is triggered when Apache hits a timeout, > and we hit that timeout because the system is slow or busy. We could try > to make things less slow, but would it work equally well to increase > that timeout? I was also wondering whether we can maybe work around the issue by increasing the Apache timeout value. That sounds like an easy potential solution to try, and from all we've discovered so far it doesn't feel like this is something we can address on the Git side. Patrick ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable? 2026-06-26 10:50 ` Patrick Steinhardt @ 2026-06-26 13:45 ` Junio C Hamano 2026-06-26 23:26 ` Michael Montalbo 2026-06-26 23:43 ` [RFH] Why do osx CI jobs so unreliable? Jeff King 2 siblings, 0 replies; 20+ messages in thread From: Junio C Hamano @ 2026-06-26 13:45 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Jeff King, Michael Montalbo, git Patrick Steinhardt <ps@pks.im> writes: >> Trying to make the wedged state fail fast and loudly is mostly just >> punting on the problem. We'd still see spurious failures. We've so far >> resisted the urge to do any automatic flaky-test retries, preferring >> instead to just try to root out the flakes. I'm a little hesitant to >> start now, because I think our strategy has mostly been good so far, and >> I've seen some horrible counter-examples where flakes and retries become >> a routine drag on development (and I'm afraid that accommodating flakes >> might make them more common). > > I agree. I'm not a fan of retry logic, as every flaky test may mask an > actual bug that we haven't fully investigated yet. Can't agree more. > I was also wondering whether we can maybe work around the issue by > increasing the Apache timeout value. That sounds like an easy potential > solution to try, and from all we've discovered so far it doesn't feel > like this is something we can address on the Git side. Thanks, all, for looking into this. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable? 2026-06-26 10:50 ` Patrick Steinhardt 2026-06-26 13:45 ` Junio C Hamano @ 2026-06-26 23:26 ` Michael Montalbo 2026-06-28 7:57 ` [PATCH 0/3] fixing expensive http test timeouts Jeff King 2026-06-26 23:43 ` [RFH] Why do osx CI jobs so unreliable? Jeff King 2 siblings, 1 reply; 20+ messages in thread From: Michael Montalbo @ 2026-06-26 23:26 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Jeff King, git, Junio C Hamano On Fri, Jun 26, 2026 at 3:50 AM Patrick Steinhardt <ps@pks.im> wrote: > The bug manifests both with HTTP/1.1 and HTTP/2 though, so this wouldn't > fully fix the flakes we see, right? Yes you are right. The linked fix would just prevent the hanging after timeout for HTTP/2 tests, but still leaves HTTP/1.1 fakes. > I was also wondering whether we can maybe work around the issue by > increasing the Apache timeout value. That sounds like an easy potential > solution to try, and from all we've discovered so far it doesn't feel > like this is something we can address on the Git side. I think Peff and Patrick's suggestion to just increase the Apache timeout makes sense. I ran some experiments using a really long timeout with an artificially slowed down CI runner and all the jobs made progress (if slowly) without stalling, and eventually completed successfully: https://github.com/mmontalbo/git/actions/runs/28267019651 I haven't spent a lot of time trying to figure out what the right timeout value should be. An hour definitely seems like overkill, with something on the order of 5-10 minutes seeming more reasonable, but I don't have a principled number. ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 0/3] fixing expensive http test timeouts 2026-06-26 23:26 ` Michael Montalbo @ 2026-06-28 7:57 ` Jeff King 2026-06-28 8:00 ` [PATCH 1/3] t/lib-httpd: bump apache timeout Jeff King ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Jeff King @ 2026-06-28 7:57 UTC (permalink / raw) To: Michael Montalbo; +Cc: Patrick Steinhardt, git, Junio C Hamano On Fri, Jun 26, 2026 at 04:26:28PM -0700, Michael Montalbo wrote: > I think Peff and Patrick's suggestion to just increase the Apache timeout > makes sense. I ran some experiments using a really long timeout with an > artificially slowed down CI runner and all the jobs made progress > (if slowly) without stalling, and eventually completed successfully: > > https://github.com/mmontalbo/git/actions/runs/28267019651 > > I haven't spent a lot of time trying to figure out what the right timeout > value should be. An hour definitely seems like overkill, with something > on the order of 5-10 minutes seeming more reasonable, but I don't > have a principled number. Here are some patches to keep things moving along. I arbitrarily picked 10 minutes, because multiplying the 1-minute default by 10 felt right. ;) The first one just bumps the timeout and should make our problems go away. The other two are optimizations, but I'm on the fence on whether the final patch is worth it. Thanks again for all of the digging. [1/3]: t/lib-httpd: bump apache timeout [2/3]: t5551: put many-tags case into its own repo [3/3]: t5551: pack refs after creating many tags t/lib-httpd/apache.conf | 1 + t/t5551-http-fetch-smart.sh | 10 ++++++---- 2 files changed, 7 insertions(+), 4 deletions(-) -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 1/3] t/lib-httpd: bump apache timeout 2026-06-28 7:57 ` [PATCH 0/3] fixing expensive http test timeouts Jeff King @ 2026-06-28 8:00 ` Jeff King 2026-06-28 8:03 ` [PATCH 2/3] t5551: put many-tags case into its own repo Jeff King 2026-06-28 8:07 ` [PATCH 3/3] t5551: pack refs after creating many tags Jeff King 2 siblings, 0 replies; 20+ messages in thread From: Jeff King @ 2026-06-28 8:00 UTC (permalink / raw) To: Michael Montalbo; +Cc: Patrick Steinhardt, git, Junio C Hamano Since enabling more tests with 7a094d68a2 (ci: run expensive tests on push builds to integration branches, 2026-05-08), we sometimes see test failures or timeouts in GitHub CI. The culprit seems to be the "enormous ref negotiation" test in t5551, which creates ~100k tag refs in our http server-side repo. Iterating through the loose refs of this repo to generate a ref advertisement can take a long time, especially on a platform with slow I/O. On my otherwise unloaded local machine, a cold cache ref advertisement takes ~10s. On a busy CI machine running tests in parallel, it can presumably top 60s, which runs afoul of Apache's default CGI timeout. The result in t5551 is a test failure, where Apache simply hangs up the connection and the client reports an error. But worse, t5559 runs the same test with HTTP/2, and a bug in Apache causes the connection to hang indefinitely! We eventually see this as a CI timeout after 6 hours. Let's bump Apache's timeout to something much larger: 600 seconds. This doesn't eliminate the possibility of a timeout, but it makes it much less likely. It should eliminate both the test failures and the CI timeouts in practice, and it protects us from running into similar problems with other tests in the future. There are two counter-arguments to consider. One, could/should we just make the test faster? Probably yes. The biggest mistake here is having such an absurd number of unpacked refs on a system which is bottle-necked on I/O. But I think it's worth bumping the timeout so that we can fix this (and possibly other) correctness issues, and then consider performance separately (which we'll do in subsequent patches). And two, is this just papering over a problem that users might see in the real world? We could teach Git to handle this case more gracefully with optimizations or keep-alives. But I think it's really an artificial situation. You need a combination of this silly number of loose refs, plus a very heavily loaded system. If you were trying to run a real server and it took more than 60s to generate the ref advertisement, I don't think the timeout is your biggest problem. Your crappy service is, and you should adjust your resources to match your load. I.e., it is probably reasonable for Git to assume that advertisements happen fast-ish and don't need protocol-level keepalives. Though the patch here is small, tons of work went into analyzing the problem. Many thanks to the contributors credited below. Helped-by: Michael Montalbo <mmontalbo@gmail.com> Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Jeff King <peff@peff.net> --- I didn't reference Michael's bugzilla report directly, because you can't read it without a login. :( Maybe it's worth doing anyway? t/lib-httpd/apache.conf | 1 + 1 file changed, 1 insertion(+) diff --git a/t/lib-httpd/apache.conf b/t/lib-httpd/apache.conf index 40a690b0bb..4149fc1078 100644 --- a/t/lib-httpd/apache.conf +++ b/t/lib-httpd/apache.conf @@ -4,6 +4,7 @@ DocumentRoot www LogFormat "%h %l %u %t \"%r\" %>s %b" common CustomLog access.log common ErrorLog error.log +Timeout 600 <IfModule !mod_log_config.c> LoadModule log_config_module modules/mod_log_config.so </IfModule> -- 2.55.0.rc2.353.gf769b6597e ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 2/3] t5551: put many-tags case into its own repo 2026-06-28 7:57 ` [PATCH 0/3] fixing expensive http test timeouts Jeff King 2026-06-28 8:00 ` [PATCH 1/3] t/lib-httpd: bump apache timeout Jeff King @ 2026-06-28 8:03 ` Jeff King 2026-06-28 21:44 ` Junio C Hamano 2026-06-28 8:07 ` [PATCH 3/3] t5551: pack refs after creating many tags Jeff King 2 siblings, 1 reply; 20+ messages in thread From: Jeff King @ 2026-06-28 8:03 UTC (permalink / raw) To: Michael Montalbo; +Cc: Patrick Steinhardt, git, Junio C Hamano Most of the t5551 http fetch tests use a handful of refs. But there are a few test cases which check our handling of large numbers of refs. These tests use the same server-side repo, so all subsequent tests end up having to consider those extra refs, too. The result is that the test script is a bit slower than it needs to be. In a normal run, moving the "2,000 tags" test into its own repo drops my runtime for the whole script from ~2.7s to ~1.9s. This is a modest gain, but when we add the "--long" flag it gets much bigger. There we trigger a test (marked with EXPENSIVE) that adds 100,000 tags, and the script runtime jumps to ~95s. But if we use the same "many tags" repo for that, our runtime drops to just ~37s. This is a pretty easy win to drop the cost of the script. It may even be a larger gain on a heavily loaded system, since one of the main costs here is unpacked refs, which are heavy on system time and I/O costs. It's possible we are reducing test coverage, since all of those other tests were inadvertently using large ref advertisements (and thus could have uncovered some unexpected interaction). But that seems somewhat unlikely; the tests targeted at the large number of refs are doing roughly similar things to the other tests. Note that the real performance culprit is the 100k-tag --long test, not the 2k-tag one. So we could just let the 100k one use its own repo, and keep the 2k tags in the main repo. But since these two tests are somewhat interlinked, it's easier to just move them both (and it does provide a small gain even for the 2000-tag test). I also notice that the 2000-tag test is gated on the CMDLINE_LIMIT prereq, and without that the later EXPENSIVE test will fail (since we won't have a too-many-refs clone). Nobody seems to have noticed or complained after many years, and I left it alone for this patch. Signed-off-by: Jeff King <peff@peff.net> --- t/t5551-http-fetch-smart.sh | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/t/t5551-http-fetch-smart.sh b/t/t5551-http-fetch-smart.sh index e236e526f0..cd851f24b8 100755 --- a/t/t5551-http-fetch-smart.sh +++ b/t/t5551-http-fetch-smart.sh @@ -397,15 +397,16 @@ create_tags () { } test_expect_success 'create 2,000 tags in the repo' ' + git init "$HTTPD_DOCUMENT_ROOT_PATH/many-tags.git" && ( - cd "$HTTPD_DOCUMENT_ROOT_PATH/repo.git" && + cd "$HTTPD_DOCUMENT_ROOT_PATH/many-tags.git" && create_tags 1 2000 ) ' test_expect_success CMDLINE_LIMIT \ 'clone the 2,000 tag repo to check OS command line overflow' ' - run_with_limited_cmdline git clone $HTTPD_URL/smart/repo.git too-many-refs && + run_with_limited_cmdline git clone $HTTPD_URL/smart/many-tags.git too-many-refs && ( cd too-many-refs && git for-each-ref refs/tags >actual && @@ -483,12 +484,12 @@ test_expect_success 'test allowanysha1inwant with unreachable' ' test_expect_success EXPENSIVE 'http can handle enormous ref negotiation' ' test_when_finished "rm -f tags" && ( - cd "$HTTPD_DOCUMENT_ROOT_PATH/repo.git" && + cd "$HTTPD_DOCUMENT_ROOT_PATH/many-tags.git" && create_tags 2001 50000 ) && git -C too-many-refs fetch -q --tags && ( - cd "$HTTPD_DOCUMENT_ROOT_PATH/repo.git" && + cd "$HTTPD_DOCUMENT_ROOT_PATH/many-tags.git" && create_tags 50001 100000 ) && git -C too-many-refs fetch -q --tags && -- 2.55.0.rc2.353.gf769b6597e ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 2/3] t5551: put many-tags case into its own repo 2026-06-28 8:03 ` [PATCH 2/3] t5551: put many-tags case into its own repo Jeff King @ 2026-06-28 21:44 ` Junio C Hamano 2026-06-29 0:34 ` Jeff King 0 siblings, 1 reply; 20+ messages in thread From: Junio C Hamano @ 2026-06-28 21:44 UTC (permalink / raw) To: Jeff King; +Cc: Michael Montalbo, Patrick Steinhardt, git Jeff King <peff@peff.net> writes: > diff --git a/t/t5551-http-fetch-smart.sh b/t/t5551-http-fetch-smart.sh > index e236e526f0..cd851f24b8 100755 > --- a/t/t5551-http-fetch-smart.sh > +++ b/t/t5551-http-fetch-smart.sh > @@ -397,15 +397,16 @@ create_tags () { > } > > test_expect_success 'create 2,000 tags in the repo' ' > + git init "$HTTPD_DOCUMENT_ROOT_PATH/many-tags.git" && > ( > - cd "$HTTPD_DOCUMENT_ROOT_PATH/repo.git" && > + cd "$HTTPD_DOCUMENT_ROOT_PATH/many-tags.git" && > create_tags 1 2000 > ) > ' While all the other repositories used in this tests are bare repositories, this new one is a non-bare repository. It shouldn't make any difference, but since I noticed it... ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/3] t5551: put many-tags case into its own repo 2026-06-28 21:44 ` Junio C Hamano @ 2026-06-29 0:34 ` Jeff King 0 siblings, 0 replies; 20+ messages in thread From: Jeff King @ 2026-06-29 0:34 UTC (permalink / raw) To: Junio C Hamano; +Cc: Michael Montalbo, Patrick Steinhardt, git On Sun, Jun 28, 2026 at 02:44:32PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > diff --git a/t/t5551-http-fetch-smart.sh b/t/t5551-http-fetch-smart.sh > > index e236e526f0..cd851f24b8 100755 > > --- a/t/t5551-http-fetch-smart.sh > > +++ b/t/t5551-http-fetch-smart.sh > > @@ -397,15 +397,16 @@ create_tags () { > > } > > > > test_expect_success 'create 2,000 tags in the repo' ' > > + git init "$HTTPD_DOCUMENT_ROOT_PATH/many-tags.git" && > > ( > > - cd "$HTTPD_DOCUMENT_ROOT_PATH/repo.git" && > > + cd "$HTTPD_DOCUMENT_ROOT_PATH/many-tags.git" && > > create_tags 1 2000 > > ) > > ' > > While all the other repositories used in this tests are bare > repositories, this new one is a non-bare repository. > > It shouldn't make any difference, but since I noticed it... Ah, yeah. It should work either way, but it is slightly confusing for it to be non-bare. I'll wait to re-send (though if nothing else comes up, it may be simpler for you to just amend on your side). -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 3/3] t5551: pack refs after creating many tags 2026-06-28 7:57 ` [PATCH 0/3] fixing expensive http test timeouts Jeff King 2026-06-28 8:00 ` [PATCH 1/3] t/lib-httpd: bump apache timeout Jeff King 2026-06-28 8:03 ` [PATCH 2/3] t5551: put many-tags case into its own repo Jeff King @ 2026-06-28 8:07 ` Jeff King 2026-06-28 21:25 ` Junio C Hamano 2 siblings, 1 reply; 20+ messages in thread From: Jeff King @ 2026-06-28 8:07 UTC (permalink / raw) To: Michael Montalbo; +Cc: Patrick Steinhardt, git, Junio C Hamano We have two tests that create 2,000 and 100,000 tags respectively. After doing so, the resulting state can be a bit slow to work with when using the "files" ref backend, as each of those refs is in its own file. This isn't a very realistic scenario, as we'd expect most of those refs to be packed. If they accrue over time along with objects, they'd get packed by maintenance/gc runs. And if you have a process that creates a ton of refs at once (like a big fast-import), the usual recommendation is to run maintenance afterwards. So let's follow that recommendation and pack the refs ourselves. Unfortunately, this does not seem to produce an improvement to the run-time of the test script! That's because after producing this state, we perform only a few fetches of it. And packing the refs costs at least as much as serving a ref advertisement (both have to iterate the refs, but packing additionally must write .lock files as we pack). My wall-clock time was slightly improved (but within the noise) with this patch, but my user and system CPU time were slightly worse! However, on a loaded system with I/O bottlenecks, it may be a net win. That's somewhat of a guess, though. It would be nice if we had a way to generate all of these refs without writing so many individual files. But even if we taught the ref code to write large cases directly to the packed-refs file, we'd still need to take individual locks. The real solution is a backend like reftable, which shaves ~30% off of the test runtime. Signed-off-by: Jeff King <peff@peff.net> --- I'm iffy on whether this one is worth it. If you apply just this patch without patch 2, then the run-time does improve quite a bit. The cost of packing is amortized by the improved performance for all of those subsequent tests (but after patch 2, they never even see the unpacked state). Likewise, I suspect this would make our timeout problems go away even without patch 1. So the whole series _could_ be reduced to just this one patch. But hopefully the reasoning given in the earlier patches makes sense, at which point this one is kind of superfluous. t/t5551-http-fetch-smart.sh | 1 + 1 file changed, 1 insertion(+) diff --git a/t/t5551-http-fetch-smart.sh b/t/t5551-http-fetch-smart.sh index cd851f24b8..e2e729216f 100755 --- a/t/t5551-http-fetch-smart.sh +++ b/t/t5551-http-fetch-smart.sh @@ -393,6 +393,7 @@ create_tags () { tag=$(perl -e "print \"bla\" x 30") && sed -e "s|^:\([^ ]*\) \(.*\)$|create refs/tags/$tag-\1 \2|" <marks >input && git update-ref --stdin <input && + git pack-refs --all && rm input } -- 2.55.0.rc2.353.gf769b6597e ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 3/3] t5551: pack refs after creating many tags 2026-06-28 8:07 ` [PATCH 3/3] t5551: pack refs after creating many tags Jeff King @ 2026-06-28 21:25 ` Junio C Hamano 0 siblings, 0 replies; 20+ messages in thread From: Junio C Hamano @ 2026-06-28 21:25 UTC (permalink / raw) To: Jeff King; +Cc: Michael Montalbo, Patrick Steinhardt, git Jeff King <peff@peff.net> writes: > So let's follow that recommendation and pack the refs ourselves. > Unfortunately, this does not seem to produce an improvement to the > run-time of the test script! That's because after producing this state, > we perform only a few fetches of it. And packing the refs costs at least > as much as serving a ref advertisement (both have to iterate the refs, > but packing additionally must write .lock files as we pack). Testing a pathological set-up with too many loose refs may have extra value, as long as we are also testing the recommended set-up, so ideally we should have both ;-) but if we have to pick only one and drop the other, we probably should be testing the packed case. > I'm iffy on whether this one is worth it. I am ambivalent, too, about this change for the purpose of the "yeek, apache times out while enumerating refs" issue. But see above ;-) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable? 2026-06-26 10:50 ` Patrick Steinhardt 2026-06-26 13:45 ` Junio C Hamano 2026-06-26 23:26 ` Michael Montalbo @ 2026-06-26 23:43 ` Jeff King 2 siblings, 0 replies; 20+ messages in thread From: Jeff King @ 2026-06-26 23:43 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Michael Montalbo, git, Junio C Hamano On Fri, Jun 26, 2026 at 12:50:17PM +0200, Patrick Steinhardt wrote: > > Thanks both of you for digging into this. I'm not familiar enough with > > Apache's code to pass confident judgement, but your findings certainly > > convinced me that this is just an apache bug. > > The bug manifests both with HTTP/1.1 and HTTP/2 though, so this wouldn't > fully fix the flakes we see, right? If I understand the situation correctly, there are really two problems. First, the EXPENSIVE tests in t5551 sometimes trigger a timeout with Apache's stock settings. This presumably became a problem recently due to 7a094d68a2 (ci: run expensive tests on push builds to integration branches, 2026-05-08). The same problem exists in t5559, which just wraps t5551 but tells us to use http2. This timeout will cause test failures in t5551, because we aren't able to complete a request we expected to. Obviously bad and annoying. The second problem is that when Apache hits the timeout in HTTP/2 mode, it hangs forever. And then the CI job hangs for 6 hours until it's killed, which is an even more annoying failure. So the root cause is the same (a timeout), but the effect depends on HTTP/1.1 vs HTTP/2. I was able to reproduce both cases on my local Debian unstable system by dropping the timeout as you suggested. Running t5551 with GIT_TEST_LONG yields a failure, whereas running t5559 yields a hang. We can mitigate both cases by bumping the timeout value, since it's addressing the root cause. There's an open question of whether this is just papering over a problem that real users might experience, and whether Git should be doing more to keep the connection alive. I think it's probably OK to ignore this in practice. This is an intentionally large request being served by a very underpowered platform. The default apache timeout is 60s. If a real-world server is seeing ls-refs requests take that long then they probably need to reconsider some other decisions, from ref packing to better hardware to dropping some users. ;) I don't think trying to insert keepalives at the Git layer here is worth the trouble. To give a sense of the time options, here are a few timings from my local machine, timing "git upload-pack . </dev/null >/dev/null" in t5551's big repo.git (that's a v0 advertisement, but it should be roughly the same work as the v2 ls-refs). cold-cache, refs not packed: real 0m9.973s user 0m0.354s sys 0m1.364s warm cache, refs not packed: real 0m0.410s user 0m0.153s sys 0m0.257s cold-cache, refs packed: real 0m0.149s user 0m0.086s sys 0m0.035s warm cache, refs packed: real 0m0.069s user 0m0.054s sys 0m0.016s So 10s is pretty abysmal (and on an SSD, no less). I would expect the cache to be warm (we just wrote these refs!) but I could also believe that CI systems are under heavy I/O and memory pressure, so we sometimes end up crossing the 60s mark. So bumping Apache's timeout to 600s or something would probably be a fine mitigation. That's still not _solving_ the problem, but presumably an order of magnitude is enough for it to never come up in practice. Michael suggested packing the refs as a mitigation. I was lukewarm on that in my previous email, because it wasn't clear to me how close we were on the timeout budget, and if it would just make the race less frequent (rather than never happen). But seeing those cold-cache numbers makes me think it might be worth doing just on principle to make the tests more efficient, and any timeout mitigation is a bonus. Of course the pack-refs process (and the initial ref writes) will still have to touch all of those loose files, so those will still be slow. But they're not on a timeout, and I suspect we read the result many more times than we write/pack (the test failures we are seeing are not in the expensive tests, but just "normal" tests that are stuck with the gigantic ref state). -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable? 2026-06-21 21:34 ` Jeff King 2026-06-22 4:42 ` Patrick Steinhardt @ 2026-06-22 5:05 ` Junio C Hamano 1 sibling, 0 replies; 20+ messages in thread From: Junio C Hamano @ 2026-06-22 5:05 UTC (permalink / raw) To: Jeff King; +Cc: Michael Montalbo, Patrick Steinhardt, git Jeff King <peff@peff.net> writes: > If the problem is a racy deadlock, there is a reasonable chance that > some jobs may simply be lucky. Even if things like packing refs help, I > suspect the problem may still be lurking. Maybe I'm just a pessimist, > though. ;) I share the pessimism X-<. > We had some HTTP/2 stalls/deadlocks in the past, and they were dependent > on libcurl and apache (actually h2_mod) versions. IIRC some of the > non-TLS code paths for HTTP/2 were not well tested, which led to > 8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course > after that commit those cleartext code paths should not be a problem, so > that is probably not exactly the issue now. > > But it might be worth checking the versions you're running locally > versus what's in the GitHub runner. True. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2026-06-29 0:34 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-20 15:33 [RFH] Why do osx CI jobs so unreliable? Michael Montalbo 2026-06-21 21:34 ` Jeff King 2026-06-22 4:42 ` Patrick Steinhardt 2026-06-22 9:47 ` Patrick Steinhardt 2026-06-22 9:55 ` Patrick Steinhardt 2026-06-22 10:29 ` Patrick Steinhardt 2026-06-26 3:27 ` Michael Montalbo 2026-06-26 5:16 ` Jeff King 2026-06-26 10:50 ` Patrick Steinhardt 2026-06-26 13:45 ` Junio C Hamano 2026-06-26 23:26 ` Michael Montalbo 2026-06-28 7:57 ` [PATCH 0/3] fixing expensive http test timeouts Jeff King 2026-06-28 8:00 ` [PATCH 1/3] t/lib-httpd: bump apache timeout Jeff King 2026-06-28 8:03 ` [PATCH 2/3] t5551: put many-tags case into its own repo Jeff King 2026-06-28 21:44 ` Junio C Hamano 2026-06-29 0:34 ` Jeff King 2026-06-28 8:07 ` [PATCH 3/3] t5551: pack refs after creating many tags Jeff King 2026-06-28 21:25 ` Junio C Hamano 2026-06-26 23:43 ` [RFH] Why do osx CI jobs so unreliable? Jeff King 2026-06-22 5:05 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox