Git development
 help / color / mirror / Atom feed
* [RFH] Why do osx CI jobs so unreliable?
@ 2026-06-19  0:35 Junio C Hamano
  2026-06-19 14:03 ` Patrick Steinhardt
  0 siblings, 1 reply; 3+ messages in thread
From: Junio C Hamano @ 2026-06-19  0:35 UTC (permalink / raw)
  To: git

I've been observing that in recent push-out to 'master' and 'next',
osx-* jobs in GitHub Actions CI keep running for 6 hours and get
killed.

What is troubling is that this seems to be very flaky.  For example,
https://github.com/git/git/actions/runs/27778820659 is testing
95e20213 (Hopefully final batch before -rc2, 2026-06-17) which got
killed after wasting 6 hours in osx-clang and osx-gcc jobs.

https://github.com/git/git/actions/runs/27790036076 is testing
the same 'master', with a patch to .github/workflows/main.yml to
remove everything except for config and osx-* jobs, which succeeded
within 30 minutes.

Stumped...

^ permalink raw reply	[flat|nested] 3+ messages in thread
* Re: [RFH] Why do osx CI jobs so unreliable?
@ 2026-06-20 15:33 Michael Montalbo
  0 siblings, 0 replies; 3+ messages in thread
From: Michael Montalbo @ 2026-06-20 15:33 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano

Patrick Steinhardt <ps@pks.im> writes:
> So I strongly suspect that it most be one of the t555* tests.
> [...]
> Maybe this is something that's specific to GitHub's environment...

I think you're right it's t5551/t5559. The runs Junio linked:

  osx-clang     cancelled  360min
  osx-gcc       cancelled  360min
  osx-reftable  success     35min
  osx-meson     success     61min

All four run the same t5551/t5559 under EXPENSIVE. The two that
finished differ in just two ways, which look like the levers:
osx-reftable generates the 100k-ref advertisement in ~24ms vs ~1.2s
for loose refs on macOS (so much less time mid-response), and
osx-meson runs tests at nproc while the prove jobs hardcode --jobs=10
on a 3-core runner (over recent master/next the prove jobs hang ~40%,
meson ~10%).

When it is wedged the whole chain sits at 0% CPU. upload-pack is
blocked in write() on the ls-refs advertisement, curl blocked in
select(). So it looks like an HTTP/2 flow-control stall on the
response side. The same stall resets itself after ~60-85s on my Linux
box and on a bare-metal Mac, but not on the GitHub runner; I haven't
pinned down why yet.

On the chance those two levers are the fix, a branch off master:

  https://github.com/mmontalbo/git/tree/mm/macos-ci-hang-fix

  - pack the refs in t5551's enormous-ref-negotiation test (doesn't
    change what it checks on the wire, just avoids re-reading 100k loose
    files to advertise them, like reftable already does)
  - use the core count for $JOBS on the GitHub macOS path, matching the
    GitLab branch in the same ci/lib.sh and what meson does

I ran the two macOS jobs under EXPENSIVE about eight times with these
and they all finished in ~30-44min instead of hanging. Happy to send
out a patch if it's helpful.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-20 15:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-19  0:35 [RFH] Why do osx CI jobs so unreliable? Junio C Hamano
2026-06-19 14:03 ` Patrick Steinhardt
  -- strict thread matches above, loose matches on Subject: below --
2026-06-20 15:33 Michael Montalbo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox