[RFH] Why do osx CI jobs so unreliable?

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFH] Why do osx CI jobs so unreliable?
@ 2026-06-19  0:35 Junio C Hamano
  2026-06-19 14:03 ` Patrick Steinhardt
  0 siblings, 1 reply; 3+ messages in thread
From: Junio C Hamano @ 2026-06-19  0:35 UTC (permalink / raw)
  To: git

I've been observing that in recent push-out to 'master' and 'next',
osx-* jobs in GitHub Actions CI keep running for 6 hours and get
killed.

What is troubling is that this seems to be very flaky.  For example,
https://github.com/git/git/actions/runs/27778820659 is testing
95e20213 (Hopefully final batch before -rc2, 2026-06-17) which got
killed after wasting 6 hours in osx-clang and osx-gcc jobs.

https://github.com/git/git/actions/runs/27790036076 is testing
the same 'master', with a patch to .github/workflows/main.yml to
remove everything except for config and osx-* jobs, which succeeded
within 30 minutes.

Stumped...

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
  2026-06-19  0:35 [RFH] Why do osx CI jobs so unreliable? Junio C Hamano
@ 2026-06-19 14:03 ` Patrick Steinhardt
  0 siblings, 0 replies; 3+ messages in thread
From: Patrick Steinhardt @ 2026-06-19 14:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Thu, Jun 18, 2026 at 05:35:23PM -0700, Junio C Hamano wrote:
> I've been observing that in recent push-out to 'master' and 'next',
> osx-* jobs in GitHub Actions CI keep running for 6 hours and get
> killed.
> 
> What is troubling is that this seems to be very flaky.  For example,
> https://github.com/git/git/actions/runs/27778820659 is testing
> 95e20213 (Hopefully final batch before -rc2, 2026-06-17) which got
> killed after wasting 6 hours in osx-clang and osx-gcc jobs.
> 
> https://github.com/git/git/actions/runs/27790036076 is testing
> the same 'master', with a patch to .github/workflows/main.yml to
> remove everything except for config and osx-* jobs, which succeeded
> within 30 minutes.
> 
> Stumped...

So the raw logs have the following trailer:

  2026-06-18T23:53:33.2996180Z Cleaning up orphan processes
  2026-06-18T23:53:33.7900380Z Terminate orphan process: pid (34022) (git-remote-http)
  2026-06-18T23:53:33.9848670Z Terminate orphan process: pid (15488) (httpd)
  2026-06-18T23:53:34.0321490Z Terminate orphan process: pid (13146) (httpd)
  2026-06-18T23:53:34.0808280Z Terminate orphan process: pid (13145) (httpd)
  2026-06-18T23:53:34.1212760Z Terminate orphan process: pid (13144) (httpd)
  2026-06-18T23:53:34.1570160Z Terminate orphan process: pid (13141) (httpd)
  2026-06-18T23:53:34.1924140Z Terminate orphan process: pid (12553) (bash)
  2026-06-18T23:53:34.2472970Z Terminate orphan process: pid (12552) (tee)
  2026-06-18T23:53:34.6547890Z Terminate orphan process: pid (21209) (bash)

So I strongly suspect that it most be one of the t555* tests.
Furthermore, the t5551 and t5559 (both of which are actually the same
test) are the only test suites that use lib-httpd.sh and which are
missing in the job logs.

I have not been able to reproduce this hang on my macOS virtual machine
though, and on GitLab I didn't notice a similar hang recently. Maybe
this is something that's specific to GitHub's environment...? No idea.

Patrick

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFH] Why do osx CI jobs so unreliable?
@ 2026-06-20 15:33 Michael Montalbo
  0 siblings, 0 replies; 3+ messages in thread
From: Michael Montalbo @ 2026-06-20 15:33 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Junio C Hamano

Patrick Steinhardt <ps@pks.im> writes:
> So I strongly suspect that it most be one of the t555* tests.
> [...]
> Maybe this is something that's specific to GitHub's environment...

I think you're right it's t5551/t5559. The runs Junio linked:

  osx-clang     cancelled  360min
  osx-gcc       cancelled  360min
  osx-reftable  success     35min
  osx-meson     success     61min

All four run the same t5551/t5559 under EXPENSIVE. The two that
finished differ in just two ways, which look like the levers:
osx-reftable generates the 100k-ref advertisement in ~24ms vs ~1.2s
for loose refs on macOS (so much less time mid-response), and
osx-meson runs tests at nproc while the prove jobs hardcode --jobs=10
on a 3-core runner (over recent master/next the prove jobs hang ~40%,
meson ~10%).

When it is wedged the whole chain sits at 0% CPU. upload-pack is
blocked in write() on the ls-refs advertisement, curl blocked in
select(). So it looks like an HTTP/2 flow-control stall on the
response side. The same stall resets itself after ~60-85s on my Linux
box and on a bare-metal Mac, but not on the GitHub runner; I haven't
pinned down why yet.

On the chance those two levers are the fix, a branch off master:

  https://github.com/mmontalbo/git/tree/mm/macos-ci-hang-fix

  - pack the refs in t5551's enormous-ref-negotiation test (doesn't
    change what it checks on the wire, just avoids re-reading 100k loose
    files to advertise them, like reftable already does)
  - use the core count for $JOBS on the GitHub macOS path, matching the
    GitLab branch in the same ci/lib.sh and what meson does

I ran the two macOS jobs under EXPENSIVE about eight times with these
and they all finished in ~30-44min instead of hanging. Happy to send
out a patch if it's helpful.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-20 15:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-19  0:35 [RFH] Why do osx CI jobs so unreliable? Junio C Hamano
2026-06-19 14:03 ` Patrick Steinhardt
  -- strict thread matches above, loose matches on Subject: below --
2026-06-20 15:33 Michael Montalbo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.