* [RFH] Why do osx CI jobs so unreliable?
@ 2026-06-19 0:35 Junio C Hamano
2026-06-19 14:03 ` Patrick Steinhardt
0 siblings, 1 reply; 3+ messages in thread
From: Junio C Hamano @ 2026-06-19 0:35 UTC (permalink / raw)
To: git
I've been observing that in recent push-out to 'master' and 'next',
osx-* jobs in GitHub Actions CI keep running for 6 hours and get
killed.
What is troubling is that this seems to be very flaky. For example,
https://github.com/git/git/actions/runs/27778820659 is testing
95e20213 (Hopefully final batch before -rc2, 2026-06-17) which got
killed after wasting 6 hours in osx-clang and osx-gcc jobs.
https://github.com/git/git/actions/runs/27790036076 is testing
the same 'master', with a patch to .github/workflows/main.yml to
remove everything except for config and osx-* jobs, which succeeded
within 30 minutes.
Stumped...
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable?
2026-06-19 0:35 [RFH] Why do osx CI jobs so unreliable? Junio C Hamano
@ 2026-06-19 14:03 ` Patrick Steinhardt
0 siblings, 0 replies; 3+ messages in thread
From: Patrick Steinhardt @ 2026-06-19 14:03 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Thu, Jun 18, 2026 at 05:35:23PM -0700, Junio C Hamano wrote:
> I've been observing that in recent push-out to 'master' and 'next',
> osx-* jobs in GitHub Actions CI keep running for 6 hours and get
> killed.
>
> What is troubling is that this seems to be very flaky. For example,
> https://github.com/git/git/actions/runs/27778820659 is testing
> 95e20213 (Hopefully final batch before -rc2, 2026-06-17) which got
> killed after wasting 6 hours in osx-clang and osx-gcc jobs.
>
> https://github.com/git/git/actions/runs/27790036076 is testing
> the same 'master', with a patch to .github/workflows/main.yml to
> remove everything except for config and osx-* jobs, which succeeded
> within 30 minutes.
>
> Stumped...
So the raw logs have the following trailer:
2026-06-18T23:53:33.2996180Z Cleaning up orphan processes
2026-06-18T23:53:33.7900380Z Terminate orphan process: pid (34022) (git-remote-http)
2026-06-18T23:53:33.9848670Z Terminate orphan process: pid (15488) (httpd)
2026-06-18T23:53:34.0321490Z Terminate orphan process: pid (13146) (httpd)
2026-06-18T23:53:34.0808280Z Terminate orphan process: pid (13145) (httpd)
2026-06-18T23:53:34.1212760Z Terminate orphan process: pid (13144) (httpd)
2026-06-18T23:53:34.1570160Z Terminate orphan process: pid (13141) (httpd)
2026-06-18T23:53:34.1924140Z Terminate orphan process: pid (12553) (bash)
2026-06-18T23:53:34.2472970Z Terminate orphan process: pid (12552) (tee)
2026-06-18T23:53:34.6547890Z Terminate orphan process: pid (21209) (bash)
So I strongly suspect that it most be one of the t555* tests.
Furthermore, the t5551 and t5559 (both of which are actually the same
test) are the only test suites that use lib-httpd.sh and which are
missing in the job logs.
I have not been able to reproduce this hang on my macOS virtual machine
though, and on GitLab I didn't notice a similar hang recently. Maybe
this is something that's specific to GitHub's environment...? No idea.
Patrick
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable?
@ 2026-06-20 15:33 Michael Montalbo
0 siblings, 0 replies; 3+ messages in thread
From: Michael Montalbo @ 2026-06-20 15:33 UTC (permalink / raw)
To: Patrick Steinhardt; +Cc: git, Junio C Hamano
Patrick Steinhardt <ps@pks.im> writes:
> So I strongly suspect that it most be one of the t555* tests.
> [...]
> Maybe this is something that's specific to GitHub's environment...
I think you're right it's t5551/t5559. The runs Junio linked:
osx-clang cancelled 360min
osx-gcc cancelled 360min
osx-reftable success 35min
osx-meson success 61min
All four run the same t5551/t5559 under EXPENSIVE. The two that
finished differ in just two ways, which look like the levers:
osx-reftable generates the 100k-ref advertisement in ~24ms vs ~1.2s
for loose refs on macOS (so much less time mid-response), and
osx-meson runs tests at nproc while the prove jobs hardcode --jobs=10
on a 3-core runner (over recent master/next the prove jobs hang ~40%,
meson ~10%).
When it is wedged the whole chain sits at 0% CPU. upload-pack is
blocked in write() on the ls-refs advertisement, curl blocked in
select(). So it looks like an HTTP/2 flow-control stall on the
response side. The same stall resets itself after ~60-85s on my Linux
box and on a bare-metal Mac, but not on the GitHub runner; I haven't
pinned down why yet.
On the chance those two levers are the fix, a branch off master:
https://github.com/mmontalbo/git/tree/mm/macos-ci-hang-fix
- pack the refs in t5551's enormous-ref-negotiation test (doesn't
change what it checks on the wire, just avoids re-reading 100k loose
files to advertise them, like reftable already does)
- use the core count for $JOBS on the GitHub macOS path, matching the
GitLab branch in the same ci/lib.sh and what meson does
I ran the two macOS jobs under EXPENSIVE about eight times with these
and they all finished in ~30-44min instead of hanging. Happy to send
out a patch if it's helpful.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-06-20 15:33 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-19 0:35 [RFH] Why do osx CI jobs so unreliable? Junio C Hamano
2026-06-19 14:03 ` Patrick Steinhardt
-- strict thread matches above, loose matches on Subject: below --
2026-06-20 15:33 Michael Montalbo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.