[RFH] Why do osx CI jobs so unreliable?

Git development
 help / color / mirror / Atom feed

* [RFH] Why do osx CI jobs so unreliable?
@ 2026-06-19  0:35 Junio C Hamano
  2026-06-19 14:03 ` Patrick Steinhardt
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2026-06-19  0:35 UTC (permalink / raw)
  To: git

I've been observing that in recent push-out to 'master' and 'next',
osx-* jobs in GitHub Actions CI keep running for 6 hours and get
killed.

What is troubling is that this seems to be very flaky.  For example,
https://github.com/git/git/actions/runs/27778820659 is testing
95e20213 (Hopefully final batch before -rc2, 2026-06-17) which got
killed after wasting 6 hours in osx-clang and osx-gcc jobs.

https://github.com/git/git/actions/runs/27790036076 is testing
the same 'master', with a patch to .github/workflows/main.yml to
remove everything except for config and osx-* jobs, which succeeded
within 30 minutes.

Stumped...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
  2026-06-19  0:35 Junio C Hamano
@ 2026-06-19 14:03 ` Patrick Steinhardt
  0 siblings, 0 replies; 13+ messages in thread
From: Patrick Steinhardt @ 2026-06-19 14:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Thu, Jun 18, 2026 at 05:35:23PM -0700, Junio C Hamano wrote:
> I've been observing that in recent push-out to 'master' and 'next',
> osx-* jobs in GitHub Actions CI keep running for 6 hours and get
> killed.
> 
> What is troubling is that this seems to be very flaky.  For example,
> https://github.com/git/git/actions/runs/27778820659 is testing
> 95e20213 (Hopefully final batch before -rc2, 2026-06-17) which got
> killed after wasting 6 hours in osx-clang and osx-gcc jobs.
> 
> https://github.com/git/git/actions/runs/27790036076 is testing
> the same 'master', with a patch to .github/workflows/main.yml to
> remove everything except for config and osx-* jobs, which succeeded
> within 30 minutes.
> 
> Stumped...

So the raw logs have the following trailer:

  2026-06-18T23:53:33.2996180Z Cleaning up orphan processes
  2026-06-18T23:53:33.7900380Z Terminate orphan process: pid (34022) (git-remote-http)
  2026-06-18T23:53:33.9848670Z Terminate orphan process: pid (15488) (httpd)
  2026-06-18T23:53:34.0321490Z Terminate orphan process: pid (13146) (httpd)
  2026-06-18T23:53:34.0808280Z Terminate orphan process: pid (13145) (httpd)
  2026-06-18T23:53:34.1212760Z Terminate orphan process: pid (13144) (httpd)
  2026-06-18T23:53:34.1570160Z Terminate orphan process: pid (13141) (httpd)
  2026-06-18T23:53:34.1924140Z Terminate orphan process: pid (12553) (bash)
  2026-06-18T23:53:34.2472970Z Terminate orphan process: pid (12552) (tee)
  2026-06-18T23:53:34.6547890Z Terminate orphan process: pid (21209) (bash)

So I strongly suspect that it most be one of the t555* tests.
Furthermore, the t5551 and t5559 (both of which are actually the same
test) are the only test suites that use lib-httpd.sh and which are
missing in the job logs.

I have not been able to reproduce this hang on my macOS virtual machine
though, and on GitLab I didn't notice a similar hang recently. Maybe
this is something that's specific to GitHub's environment...? No idea.

Patrick

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
@ 2026-06-20 15:33 Michael Montalbo
  2026-06-21 21:34 ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Montalbo @ 2026-06-20 15:33 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano

Patrick Steinhardt <ps@pks.im> writes:
> So I strongly suspect that it most be one of the t555* tests.
> [...]
> Maybe this is something that's specific to GitHub's environment...

I think you're right it's t5551/t5559. The runs Junio linked:

  osx-clang     cancelled  360min
  osx-gcc       cancelled  360min
  osx-reftable  success     35min
  osx-meson     success     61min

All four run the same t5551/t5559 under EXPENSIVE. The two that
finished differ in just two ways, which look like the levers:
osx-reftable generates the 100k-ref advertisement in ~24ms vs ~1.2s
for loose refs on macOS (so much less time mid-response), and
osx-meson runs tests at nproc while the prove jobs hardcode --jobs=10
on a 3-core runner (over recent master/next the prove jobs hang ~40%,
meson ~10%).

When it is wedged the whole chain sits at 0% CPU. upload-pack is
blocked in write() on the ls-refs advertisement, curl blocked in
select(). So it looks like an HTTP/2 flow-control stall on the
response side. The same stall resets itself after ~60-85s on my Linux
box and on a bare-metal Mac, but not on the GitHub runner; I haven't
pinned down why yet.

On the chance those two levers are the fix, a branch off master:

  https://github.com/mmontalbo/git/tree/mm/macos-ci-hang-fix

  - pack the refs in t5551's enormous-ref-negotiation test (doesn't
    change what it checks on the wire, just avoids re-reading 100k loose
    files to advertise them, like reftable already does)
  - use the core count for $JOBS on the GitHub macOS path, matching the
    GitLab branch in the same ci/lib.sh and what meson does

I ran the two macOS jobs under EXPENSIVE about eight times with these
and they all finished in ~30-44min instead of hanging. Happy to send
out a patch if it's helpful.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
  2026-06-20 15:33 [RFH] Why do osx CI jobs so unreliable? Michael Montalbo
@ 2026-06-21 21:34 ` Jeff King
  2026-06-22  4:42   ` Patrick Steinhardt
  2026-06-22  5:05   ` Junio C Hamano
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff King @ 2026-06-21 21:34 UTC (permalink / raw)
  To: Michael Montalbo; +Cc: Patrick Steinhardt, git, Junio C Hamano

On Sat, Jun 20, 2026 at 08:33:13AM -0700, Michael Montalbo wrote:

> Patrick Steinhardt <ps@pks.im> writes:
> > So I strongly suspect that it most be one of the t555* tests.
> > [...]
> > Maybe this is something that's specific to GitHub's environment...
> 
> I think you're right it's t5551/t5559. The runs Junio linked:
> 
>   osx-clang     cancelled  360min
>   osx-gcc       cancelled  360min
>   osx-reftable  success     35min
>   osx-meson     success     61min
> 
> All four run the same t5551/t5559 under EXPENSIVE. The two that
> finished differ in just two ways, which look like the levers:
> osx-reftable generates the 100k-ref advertisement in ~24ms vs ~1.2s
> for loose refs on macOS (so much less time mid-response), and
> osx-meson runs tests at nproc while the prove jobs hardcode --jobs=10
> on a 3-core runner (over recent master/next the prove jobs hang ~40%,
> meson ~10%).

If the problem is a racy deadlock, there is a reasonable chance that
some jobs may simply be lucky. Even if things like packing refs help, I
suspect the problem may still be lurking. Maybe I'm just a pessimist,
though. ;)

> When it is wedged the whole chain sits at 0% CPU. upload-pack is
> blocked in write() on the ls-refs advertisement, curl blocked in
> select(). So it looks like an HTTP/2 flow-control stall on the
> response side. The same stall resets itself after ~60-85s on my Linux
> box and on a bare-metal Mac, but not on the GitHub runner; I haven't
> pinned down why yet.

We had some HTTP/2 stalls/deadlocks in the past, and they were dependent
on libcurl and apache (actually h2_mod) versions. IIRC some of the
non-TLS code paths for HTTP/2 were not well tested, which led to
8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course
after that commit those cleartext code paths should not be a problem, so
that is probably not exactly the issue now.

But it might be worth checking the versions you're running locally
versus what's in the GitHub runner.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
  2026-06-21 21:34 ` Jeff King
@ 2026-06-22  4:42   ` Patrick Steinhardt
  2026-06-22  9:47     ` Patrick Steinhardt
  2026-06-22  5:05   ` Junio C Hamano
  1 sibling, 1 reply; 13+ messages in thread
From: Patrick Steinhardt @ 2026-06-22  4:42 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Montalbo, git, Junio C Hamano

On Sun, Jun 21, 2026 at 05:34:07PM -0400, Jeff King wrote:
> On Sat, Jun 20, 2026 at 08:33:13AM -0700, Michael Montalbo wrote:
> 
> > Patrick Steinhardt <ps@pks.im> writes:
> > > So I strongly suspect that it most be one of the t555* tests.
> > > [...]
> > > Maybe this is something that's specific to GitHub's environment...
> > 
> > I think you're right it's t5551/t5559. The runs Junio linked:
> > 
> >   osx-clang     cancelled  360min
> >   osx-gcc       cancelled  360min
> >   osx-reftable  success     35min
> >   osx-meson     success     61min
> > 
> > All four run the same t5551/t5559 under EXPENSIVE. The two that
> > finished differ in just two ways, which look like the levers:
> > osx-reftable generates the 100k-ref advertisement in ~24ms vs ~1.2s
> > for loose refs on macOS (so much less time mid-response), and
> > osx-meson runs tests at nproc while the prove jobs hardcode --jobs=10
> > on a 3-core runner (over recent master/next the prove jobs hang ~40%,
> > meson ~10%).
> 
> If the problem is a racy deadlock, there is a reasonable chance that
> some jobs may simply be lucky. Even if things like packing refs help, I
> suspect the problem may still be lurking. Maybe I'm just a pessimist,
> though. ;)

I had the same thought.

> > When it is wedged the whole chain sits at 0% CPU. upload-pack is
> > blocked in write() on the ls-refs advertisement, curl blocked in
> > select(). So it looks like an HTTP/2 flow-control stall on the
> > response side. The same stall resets itself after ~60-85s on my Linux
> > box and on a bare-metal Mac, but not on the GitHub runner; I haven't
> > pinned down why yet.
> 
> We had some HTTP/2 stalls/deadlocks in the past, and they were dependent
> on libcurl and apache (actually h2_mod) versions. IIRC some of the
> non-TLS code paths for HTTP/2 were not well tested, which led to
> 8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course
> after that commit those cleartext code paths should not be a problem, so
> that is probably not exactly the issue now.
> 
> But it might be worth checking the versions you're running locally
> versus what's in the GitHub runner.

I didn't observe any similar hangs in GitLab's CI systems, so I wonder
whether this is because of different versions of curl. And indeed we use
different versions:

  - On GitHub we use 8.6.0.

  - On GitLab we use 8.7.1.

Now this of course doesn't mean that updating the curl version is the
fix to this whole issue, as there's a ton of other factors that could
play a role in whether or not the test hangs. So while we could just
upgrade parts of the stack and cross our fingers, but that feels rather
unsatisfactory. Still, one place to start could be to update our build
images to macOS 15.

But the big question to me is whether the hang is because of a bug in
Git with how we drive curl, a bug in curl itself, or a bug in Apache.

Patrick

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
  2026-06-21 21:34 ` Jeff King
  2026-06-22  4:42   ` Patrick Steinhardt
@ 2026-06-22  5:05   ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2026-06-22  5:05 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Montalbo, Patrick Steinhardt, git

Jeff King <peff@peff.net> writes:

> If the problem is a racy deadlock, there is a reasonable chance that
> some jobs may simply be lucky. Even if things like packing refs help, I
> suspect the problem may still be lurking. Maybe I'm just a pessimist,
> though. ;)

I share the pessimism X-<.

> We had some HTTP/2 stalls/deadlocks in the past, and they were dependent
> on libcurl and apache (actually h2_mod) versions. IIRC some of the
> non-TLS code paths for HTTP/2 were not well tested, which led to
> 8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course
> after that commit those cleartext code paths should not be a problem, so
> that is probably not exactly the issue now.
>
> But it might be worth checking the versions you're running locally
> versus what's in the GitHub runner.

True.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
  2026-06-22  4:42   ` Patrick Steinhardt
@ 2026-06-22  9:47     ` Patrick Steinhardt
  2026-06-22  9:55       ` Patrick Steinhardt
  0 siblings, 1 reply; 13+ messages in thread
From: Patrick Steinhardt @ 2026-06-22  9:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Montalbo, git, Junio C Hamano

On Mon, Jun 22, 2026 at 06:42:24AM +0200, Patrick Steinhardt wrote:
> On Sun, Jun 21, 2026 at 05:34:07PM -0400, Jeff King wrote:
> > On Sat, Jun 20, 2026 at 08:33:13AM -0700, Michael Montalbo wrote:
[snip]
> > > When it is wedged the whole chain sits at 0% CPU. upload-pack is
> > > blocked in write() on the ls-refs advertisement, curl blocked in
> > > select(). So it looks like an HTTP/2 flow-control stall on the
> > > response side. The same stall resets itself after ~60-85s on my Linux
> > > box and on a bare-metal Mac, but not on the GitHub runner; I haven't
> > > pinned down why yet.
> > 
> > We had some HTTP/2 stalls/deadlocks in the past, and they were dependent
> > on libcurl and apache (actually h2_mod) versions. IIRC some of the
> > non-TLS code paths for HTTP/2 were not well tested, which led to
> > 8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course
> > after that commit those cleartext code paths should not be a problem, so
> > that is probably not exactly the issue now.
> > 
> > But it might be worth checking the versions you're running locally
> > versus what's in the GitHub runner.
> 
> I didn't observe any similar hangs in GitLab's CI systems, so I wonder
> whether this is because of different versions of curl. And indeed we use
> different versions:
> 
>   - On GitHub we use 8.6.0.
> 
>   - On GitLab we use 8.7.1.
> 
> Now this of course doesn't mean that updating the curl version is the
> fix to this whole issue, as there's a ton of other factors that could
> play a role in whether or not the test hangs. So while we could just
> upgrade parts of the stack and cross our fingers, but that feels rather
> unsatisfactory. Still, one place to start could be to update our build
> images to macOS 15.
> 
> But the big question to me is whether the hang is because of a bug in
> Git with how we drive curl, a bug in curl itself, or a bug in Apache.

I noticed that a osx-clang job failed today in t5551 [1]. This time it
didn't hang, but produced an actual error:

    2026-06-22T09:25:45.1984230Z ++ git -C too-many-refs fetch -q --tags
    2026-06-22T09:25:45.1984420Z error: RPC failed; curl 18 transfer closed with outstanding read data remaining
    2026-06-22T09:25:45.1984520Z fatal: expected flush after ref listing
    2026-06-22T09:25:45.1984610Z error: last command exited with $?=128
    2026-06-22T09:25:45.1984660Z ++ rm -f tags
    2026-06-22T09:25:45.1984710Z ++ :
    2026-06-22T09:25:45.1984830Z not ok 35 - http can handle enormous ref negotiation

There was a second test failing similarly.

Patrick

[1]: https://github.com/git/git/actions/runs/27940620478/job/82672854726

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
  2026-06-22  9:47     ` Patrick Steinhardt
@ 2026-06-22  9:55       ` Patrick Steinhardt
  2026-06-22 10:29         ` Patrick Steinhardt
  0 siblings, 1 reply; 13+ messages in thread
From: Patrick Steinhardt @ 2026-06-22  9:55 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Montalbo, git, Junio C Hamano

On Mon, Jun 22, 2026 at 11:48:01AM +0200, Patrick Steinhardt wrote:
> On Mon, Jun 22, 2026 at 06:42:24AM +0200, Patrick Steinhardt wrote:
> > On Sun, Jun 21, 2026 at 05:34:07PM -0400, Jeff King wrote:
> > > On Sat, Jun 20, 2026 at 08:33:13AM -0700, Michael Montalbo wrote:
> [snip]
> > > > When it is wedged the whole chain sits at 0% CPU. upload-pack is
> > > > blocked in write() on the ls-refs advertisement, curl blocked in
> > > > select(). So it looks like an HTTP/2 flow-control stall on the
> > > > response side. The same stall resets itself after ~60-85s on my Linux
> > > > box and on a bare-metal Mac, but not on the GitHub runner; I haven't
> > > > pinned down why yet.
> > > 
> > > We had some HTTP/2 stalls/deadlocks in the past, and they were dependent
> > > on libcurl and apache (actually h2_mod) versions. IIRC some of the
> > > non-TLS code paths for HTTP/2 were not well tested, which led to
> > > 8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course
> > > after that commit those cleartext code paths should not be a problem, so
> > > that is probably not exactly the issue now.
> > > 
> > > But it might be worth checking the versions you're running locally
> > > versus what's in the GitHub runner.
> > 
> > I didn't observe any similar hangs in GitLab's CI systems, so I wonder
> > whether this is because of different versions of curl. And indeed we use
> > different versions:
> > 
> >   - On GitHub we use 8.6.0.
> > 
> >   - On GitLab we use 8.7.1.
> > 
> > Now this of course doesn't mean that updating the curl version is the
> > fix to this whole issue, as there's a ton of other factors that could
> > play a role in whether or not the test hangs. So while we could just
> > upgrade parts of the stack and cross our fingers, but that feels rather
> > unsatisfactory. Still, one place to start could be to update our build
> > images to macOS 15.
> > 
> > But the big question to me is whether the hang is because of a bug in
> > Git with how we drive curl, a bug in curl itself, or a bug in Apache.
> 
> I noticed that a osx-clang job failed today in t5551 [1]. This time it
> didn't hang, but produced an actual error:
> 
>     2026-06-22T09:25:45.1984230Z ++ git -C too-many-refs fetch -q --tags
>     2026-06-22T09:25:45.1984420Z error: RPC failed; curl 18 transfer closed with outstanding read data remaining
>     2026-06-22T09:25:45.1984520Z fatal: expected flush after ref listing
>     2026-06-22T09:25:45.1984610Z error: last command exited with $?=128
>     2026-06-22T09:25:45.1984660Z ++ rm -f tags
>     2026-06-22T09:25:45.1984710Z ++ :
>     2026-06-22T09:25:45.1984830Z not ok 35 - http can handle enormous ref negotiation
> 
> There was a second test failing similarly.

Oh, and Linux is also failing in the same test suite [1], even though
the job logs are truncated, so it's hard to say whether it's the same
failure or not.

There certainly seems to be a deeper issue here. We could of course just
disable the test again, but by now I do wonder whether this would paper
over an actual bug.

Patrick

[1]: https://github.com/git/git/actions/runs/27940620478/job/82672854864

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
  2026-06-22  9:55       ` Patrick Steinhardt
@ 2026-06-22 10:29         ` Patrick Steinhardt
  2026-06-26  3:27           ` Michael Montalbo
  0 siblings, 1 reply; 13+ messages in thread
From: Patrick Steinhardt @ 2026-06-22 10:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Montalbo, git, Junio C Hamano

On Mon, Jun 22, 2026 at 11:55:31AM +0200, Patrick Steinhardt wrote:
> On Mon, Jun 22, 2026 at 11:48:01AM +0200, Patrick Steinhardt wrote:
> > On Mon, Jun 22, 2026 at 06:42:24AM +0200, Patrick Steinhardt wrote:
> > > On Sun, Jun 21, 2026 at 05:34:07PM -0400, Jeff King wrote:
> > > > On Sat, Jun 20, 2026 at 08:33:13AM -0700, Michael Montalbo wrote:
> > [snip]
> > > > > When it is wedged the whole chain sits at 0% CPU. upload-pack is
> > > > > blocked in write() on the ls-refs advertisement, curl blocked in
> > > > > select(). So it looks like an HTTP/2 flow-control stall on the
> > > > > response side. The same stall resets itself after ~60-85s on my Linux
> > > > > box and on a bare-metal Mac, but not on the GitHub runner; I haven't
> > > > > pinned down why yet.
> > > > 
> > > > We had some HTTP/2 stalls/deadlocks in the past, and they were dependent
> > > > on libcurl and apache (actually h2_mod) versions. IIRC some of the
> > > > non-TLS code paths for HTTP/2 were not well tested, which led to
> > > > 8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course
> > > > after that commit those cleartext code paths should not be a problem, so
> > > > that is probably not exactly the issue now.
> > > > 
> > > > But it might be worth checking the versions you're running locally
> > > > versus what's in the GitHub runner.
> > > 
> > > I didn't observe any similar hangs in GitLab's CI systems, so I wonder
> > > whether this is because of different versions of curl. And indeed we use
> > > different versions:
> > > 
> > >   - On GitHub we use 8.6.0.
> > > 
> > >   - On GitLab we use 8.7.1.
> > > 
> > > Now this of course doesn't mean that updating the curl version is the
> > > fix to this whole issue, as there's a ton of other factors that could
> > > play a role in whether or not the test hangs. So while we could just
> > > upgrade parts of the stack and cross our fingers, but that feels rather
> > > unsatisfactory. Still, one place to start could be to update our build
> > > images to macOS 15.
> > > 
> > > But the big question to me is whether the hang is because of a bug in
> > > Git with how we drive curl, a bug in curl itself, or a bug in Apache.
> > 
> > I noticed that a osx-clang job failed today in t5551 [1]. This time it
> > didn't hang, but produced an actual error:
> > 
> >     2026-06-22T09:25:45.1984230Z ++ git -C too-many-refs fetch -q --tags
> >     2026-06-22T09:25:45.1984420Z error: RPC failed; curl 18 transfer closed with outstanding read data remaining
> >     2026-06-22T09:25:45.1984520Z fatal: expected flush after ref listing
> >     2026-06-22T09:25:45.1984610Z error: last command exited with $?=128
> >     2026-06-22T09:25:45.1984660Z ++ rm -f tags
> >     2026-06-22T09:25:45.1984710Z ++ :
> >     2026-06-22T09:25:45.1984830Z not ok 35 - http can handle enormous ref negotiation
> > 
> > There was a second test failing similarly.
> 
> Oh, and Linux is also failing in the same test suite [1], even though
> the job logs are truncated, so it's hard to say whether it's the same
> failure or not.
> 
> There certainly seems to be a deeper issue here. We could of course just
> disable the test again, but by now I do wonder whether this would paper
> over an actual bug.
> 
> Patrick
> 
> [1]: https://github.com/git/git/actions/runs/27940620478/job/82672854864

Sorry for the repeated spam.

I think the issue is rather simple: we're hitting timeouts in Apache. If
you apply the following diff:

diff --git a/t/lib-httpd/apache.conf b/t/lib-httpd/apache.conf
index 40a690b0bb..4054fe008f 100644
--- a/t/lib-httpd/apache.conf
+++ b/t/lib-httpd/apache.conf
@@ -302,3 +302,5 @@ RewriteRule ^/half-auth-complete/ - [E=AUTHREQUIRED:yes]
 		SVNPath "${LIB_HTTPD_SVNPATH}"
 	</Location>
 </IfDefine>
+
+Timeout 1

Then you'll see the same errors locally:

    $ GIT_TEST_LONG=Yes meson test t5551-http-fetch-smart --test-args=-ix -i
    Failed to clone 'sub'. Retry scheduled
    Cloning into '/home/pks/Development/git/build/test-output/trash directory.t5551-http-fetch-smart/sub'...
    error: RPC failed; curl 18 transfer closed with outstanding read data remaining
    fatal: early EOF
    fatal: fetch-pack: invalid index-pack output
    fatal: clone of 'http://127.0.0.1:5551/smart_headers/repo.git' into submodule path '/home/pks/Development/git/build/test-output/trash directory.t5551-http-fetch-smart/sub' failed
    Failed to clone 'sub' a second time, aborting
    error: last command exited with $?=1
    not ok 36 - custom http headers
    #	
    #		test_must_fail git -c http.extraheader="x-magic-two: cadabra" \
    #			fetch "$HTTPD_URL/smart_headers/repo.git" &&
    #		git -c http.extraheader="x-magic-one: abra" \
    #		    -c http.extraheader="x-magic-two: cadabra" \
    #		    fetch "$HTTPD_URL/smart_headers/repo.git" &&
    #		git update-index --add --cacheinfo 160000,$(git rev-parse HEAD),sub &&
    #		git config -f .gitmodules submodule.sub.path sub &&
    #		git config -f .gitmodules submodule.sub.url \
    #			"$HTTPD_URL/smart_headers/repo.git" &&
    #		git submodule init sub &&
    #		test_must_fail git submodule update sub &&
    #		git -c http.extraheader="x-magic-one: abra" \
    #		    -c http.extraheader="x-magic-two: cadabra" \
    #			submodule update sub
    #	
    1..36

And Apache also logs this as a timeout:

    [Mon Jun 22 10:26:52.115717 2026] [cgi:warn] [pid 3686957:tid 3686957] [client 127.0.0.1:55114] AH01220: Timeout waiting for output from CGI script /home/pks/Development/git/build/git-http-backend
    [Mon Jun 22 10:26:52.115748 2026] [core:error] [pid 3686957:tid 3686957] (70007)The timeout specified has expired: [client 127.0.0.1:55114] AH00574: ap_content_length_filter: apr_bucket_read() failed
    [Mon Jun 22 10:27:01.567533 2026] [cgi:warn] [pid 3686958:tid 3686958] [client 127.0.0.1:54384] AH01220: Timeout waiting for output from CGI script /home/pks/Development/git/build/git-http-backend
    [Mon Jun 22 10:27:01.567559 2026] [core:error] [pid 3686958:tid 3686958] (70007)The timeout specified has expired: [client 127.0.0.1:54384] AH00574: ap_content_length_filter: apr_bucket_read() failed

This is because our keepalive mechanisms aren't helping:

  - The TCP-level keepalives don't help with Apache.

  - The application-level sideband keepalives don't apply to the
    "ls-refs" endpoint.

Whether that's the same issue like we see in macOS sometimes is a
different question.

Patrick

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
  2026-06-22 10:29         ` Patrick Steinhardt
@ 2026-06-26  3:27           ` Michael Montalbo
  2026-06-26  5:16             ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Montalbo @ 2026-06-26  3:27 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Jeff King, git, Junio C Hamano

Patrick Steinhardt <ps@pks.im> writes:
> I think the issue is rather simple: we're hitting timeouts in Apache.
> [...]
> This is because our keepalive mechanisms aren't helping [...]
> Whether that's the same issue like we see in macOS sometimes is a
> different question.

I think that is the trigger for issues we've been seeing. I spent
some time investigating the Apache side over the last week and maybe
found a mod_http2 bug, which I filed upstream with a potential fix:

  bug:  https://bz.apache.org/bugzilla/show_bug.cgi?id=70131
  fix:  https://github.com/mmontalbo/httpd/pull/2

To Patrick's earlier question of whether this is a Git, curl, or Apache
bug: as best I can tell it's Apache. I could reproduce it with no Git
involved at all (just Apache and a small CGI that goes quiet past the
Timeout), and across several curl versions (8.6.0, which is what the
GitHub runners use, up to 8.20.0), so I don't think bumping curl would
help. It also seems to wear two faces from the same trigger: over
HTTP/1.1 Apache closes the connection and curl bails with the
"transfer closed" error (which looks like what you hit with Timeout=1,
and the recent failures on both macOS and Linux), and over HTTP/2 it
does not reliably reset the stream, so the client just waits, which is
the six-hour macOS hang. I share the pessimism from earlier in the
thread, though: I think the real fix is upstream in Apache, and
anything we do on our side mostly just bounds the symptom in the
meantime.

Given there could be a potential reliability issue with an upstream
dependency like Apache, I was considering what mitigation strategies
might help:

  - Enforce some kind of lower bound speed limit and a client-side
    timeout so runs that wedge fail fast (and loudly) instead of
    hanging.

  - Potentially provide some affordance for retrying flaky tests
    that might fail due to upstream dependencies. Git already has
    some HTTP retry support (http.maxRetries and friends, added
    recently), but as far as I can tell it only triggers on HTTP 429
    rate limiting, so it would not catch a stall like this on its
    own. A test-level retry is not something I like that much, since
    it might encourage papering over flakiness that should be
    resolved, but it was a consideration vs requiring a fresh CI run
    to resolve the flake.

  - Make slow tests faster by optimizing the test itself and/or
    the test runner configuration (e.g., job number matching
    cores) so wedges become less likely.

For the first one, I think Git already provides some affordances. There
is a stall-based timeout that just ships disabled: as I understand it
http.lowSpeedLimit sets a bytes/sec floor and http.lowSpeedTime how long
a transfer can sit below it before curl gives up, so it would catch a
wedged connection without punishing one that is just slow. Enabling it
for the http tests might look something like:

    diff --git a/t/lib-httpd.sh b/t/lib-httpd.sh
    @@ GIT_TRACE=$GIT_TRACE; export GIT_TRACE
    +# Abort a transfer that makes essentially no progress for a while,
    +# so a wedged connection fails in seconds instead of hanging to the
    +# job cap. Tiny limit, generous window, so it only trips on a true
    +# stall; override either var, or set the limit to 0, to disable.
    +GIT_HTTP_LOW_SPEED_LIMIT=${GIT_HTTP_LOW_SPEED_LIMIT-1}
    +GIT_HTTP_LOW_SPEED_TIME=${GIT_HTTP_LOW_SPEED_TIME-60}
    +export GIT_HTTP_LOW_SPEED_LIMIT GIT_HTTP_LOW_SPEED_TIME

I went conservative on the values on purpose: a floor of 1 byte/sec
should only really fire on a true zero-progress stall, not on something
that is just crawling on a slow runner, and the 60s window is generous
for the same reason. When I tried it locally against a stall-proxy it
did turn an otherwise indefinite hang into a bounded abort (a tighter
limit/window brings that down to single-digit seconds). It probably does
not need to be suite-wide either; it could be scoped per-command with
git -c, which the http tests already lean on for this kind of thing
(t5551 passes http.postbuffer and http.extraheader that way), if a
narrower blast radius feels safer.

I only dug into the first option in any depth, since I wanted to
sanity-check the direction before writing patches. Does turning on a
stall timeout for the http tests seem reasonable? Are there other
strategies that we should implement?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
  2026-06-26  3:27           ` Michael Montalbo
@ 2026-06-26  5:16             ` Jeff King
  2026-06-26 10:50               ` Patrick Steinhardt
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2026-06-26  5:16 UTC (permalink / raw)
  To: Michael Montalbo; +Cc: Patrick Steinhardt, git, Junio C Hamano

On Thu, Jun 25, 2026 at 08:27:35PM -0700, Michael Montalbo wrote:

> I think that is the trigger for issues we've been seeing. I spent
> some time investigating the Apache side over the last week and maybe
> found a mod_http2 bug, which I filed upstream with a potential fix:
> 
>   bug:  https://bz.apache.org/bugzilla/show_bug.cgi?id=70131
>   fix:  https://github.com/mmontalbo/httpd/pull/2

Thanks both of you for digging into this. I'm not familiar enough with
Apache's code to pass confident judgement, but your findings certainly
convinced me that this is just an apache bug.

> Given there could be a potential reliability issue with an upstream
> dependency like Apache, I was considering what mitigation strategies
> might help:
> [...]

Depending on how widespread the Apache bug is, another option might just
be: do nothing and wait for it to get fixed.

Trying to make the wedged state fail fast and loudly is mostly just
punting on the problem. We'd still see spurious failures. We've so far
resisted the urge to do any automatic flaky-test retries, preferring
instead to just try to root out the flakes. I'm a little hesitant to
start now, because I think our strategy has mostly been good so far, and
I've seen some horrible counter-examples where flakes and retries become
a routine drag on development (and I'm afraid that accommodating flakes
might make them more common).

>   - Make slow tests faster by optimizing the test itself and/or
>     the test runner configuration (e.g., job number matching
>     cores) so wedges become less likely.

It sounds like the bad state is triggered when Apache hits a timeout,
and we hit that timeout because the system is slow or busy. We could try
to make things less slow, but would it work equally well to increase
that timeout?

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
  2026-06-26  5:16             ` Jeff King
@ 2026-06-26 10:50               ` Patrick Steinhardt
  2026-06-26 13:45                 ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Patrick Steinhardt @ 2026-06-26 10:50 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Montalbo, git, Junio C Hamano

On Fri, Jun 26, 2026 at 01:16:57AM -0400, Jeff King wrote:
> On Thu, Jun 25, 2026 at 08:27:35PM -0700, Michael Montalbo wrote:
> 
> > I think that is the trigger for issues we've been seeing. I spent
> > some time investigating the Apache side over the last week and maybe
> > found a mod_http2 bug, which I filed upstream with a potential fix:
> > 
> >   bug:  https://bz.apache.org/bugzilla/show_bug.cgi?id=70131
> >   fix:  https://github.com/mmontalbo/httpd/pull/2
> 
> Thanks both of you for digging into this. I'm not familiar enough with
> Apache's code to pass confident judgement, but your findings certainly
> convinced me that this is just an apache bug.

The bug manifests both with HTTP/1.1 and HTTP/2 though, so this wouldn't
fully fix the flakes we see, right?

> > Given there could be a potential reliability issue with an upstream
> > dependency like Apache, I was considering what mitigation strategies
> > might help:
> > [...]
> 
> Depending on how widespread the Apache bug is, another option might just
> be: do nothing and wait for it to get fixed.
> 
> Trying to make the wedged state fail fast and loudly is mostly just
> punting on the problem. We'd still see spurious failures. We've so far
> resisted the urge to do any automatic flaky-test retries, preferring
> instead to just try to root out the flakes. I'm a little hesitant to
> start now, because I think our strategy has mostly been good so far, and
> I've seen some horrible counter-examples where flakes and retries become
> a routine drag on development (and I'm afraid that accommodating flakes
> might make them more common).

I agree. I'm not a fan of retry logic, as every flaky test may mask an
actual bug that we haven't fully investigated yet.

> >   - Make slow tests faster by optimizing the test itself and/or
> >     the test runner configuration (e.g., job number matching
> >     cores) so wedges become less likely.
> 
> It sounds like the bad state is triggered when Apache hits a timeout,
> and we hit that timeout because the system is slow or busy. We could try
> to make things less slow, but would it work equally well to increase
> that timeout?

I was also wondering whether we can maybe work around the issue by
increasing the Apache timeout value. That sounds like an easy potential
solution to try, and from all we've discovered so far it doesn't feel
like this is something we can address on the Git side.

Patrick

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
  2026-06-26 10:50               ` Patrick Steinhardt
@ 2026-06-26 13:45                 ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2026-06-26 13:45 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Jeff King, Michael Montalbo, git

Patrick Steinhardt <ps@pks.im> writes:

>> Trying to make the wedged state fail fast and loudly is mostly just
>> punting on the problem. We'd still see spurious failures. We've so far
>> resisted the urge to do any automatic flaky-test retries, preferring
>> instead to just try to root out the flakes. I'm a little hesitant to
>> start now, because I think our strategy has mostly been good so far, and
>> I've seen some horrible counter-examples where flakes and retries become
>> a routine drag on development (and I'm afraid that accommodating flakes
>> might make them more common).
>
> I agree. I'm not a fan of retry logic, as every flaky test may mask an
> actual bug that we haven't fully investigated yet.

Can't agree more.

> I was also wondering whether we can maybe work around the issue by
> increasing the Apache timeout value. That sounds like an easy potential
> solution to try, and from all we've discovered so far it doesn't feel
> like this is something we can address on the Git side.

Thanks, all, for looking into this.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-06-26 13:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-20 15:33 [RFH] Why do osx CI jobs so unreliable? Michael Montalbo
2026-06-21 21:34 ` Jeff King
2026-06-22  4:42   ` Patrick Steinhardt
2026-06-22  9:47     ` Patrick Steinhardt
2026-06-22  9:55       ` Patrick Steinhardt
2026-06-22 10:29         ` Patrick Steinhardt
2026-06-26  3:27           ` Michael Montalbo
2026-06-26  5:16             ` Jeff King
2026-06-26 10:50               ` Patrick Steinhardt
2026-06-26 13:45                 ` Junio C Hamano
2026-06-22  5:05   ` Junio C Hamano
  -- strict thread matches above, loose matches on Subject: below --
2026-06-19  0:35 Junio C Hamano
2026-06-19 14:03 ` Patrick Steinhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox