Re: [RFH] Why do osx CI jobs so unreliable?

Git development
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: Patrick Steinhardt <ps@pks.im>
Cc: Michael Montalbo <mmontalbo@gmail.com>,
	git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>
Subject: Re: [RFH] Why do osx CI jobs so unreliable?
Date: Fri, 26 Jun 2026 19:43:12 -0400	[thread overview]
Message-ID: <20260626234312.GA3156205@coredump.intra.peff.net> (raw)
In-Reply-To: <aj5ZaZK7xylfs4Xw@pks.im>

On Fri, Jun 26, 2026 at 12:50:17PM +0200, Patrick Steinhardt wrote:

> > Thanks both of you for digging into this. I'm not familiar enough with
> > Apache's code to pass confident judgement, but your findings certainly
> > convinced me that this is just an apache bug.
> 
> The bug manifests both with HTTP/1.1 and HTTP/2 though, so this wouldn't
> fully fix the flakes we see, right?

If I understand the situation correctly, there are really two problems.

First, the EXPENSIVE tests in t5551 sometimes trigger a timeout with
Apache's stock settings. This presumably became a problem recently due
to 7a094d68a2 (ci: run expensive tests on push builds to integration
branches, 2026-05-08). The same problem exists in t5559, which just
wraps t5551 but tells us to use http2.

This timeout will cause test failures in t5551, because we aren't able
to complete a request we expected to. Obviously bad and annoying.

The second problem is that when Apache hits the timeout in HTTP/2 mode,
it hangs forever. And then the CI job hangs for 6 hours until it's
killed, which is an even more annoying failure.

So the root cause is the same (a timeout), but the effect depends on
HTTP/1.1 vs HTTP/2. I was able to reproduce both cases on my local
Debian unstable system by dropping the timeout as you suggested. Running
t5551 with GIT_TEST_LONG yields a failure, whereas running t5559 yields
a hang.

We can mitigate both cases by bumping the timeout value, since it's
addressing the root cause.

There's an open question of whether this is just papering over a problem
that real users might experience, and whether Git should be doing more
to keep the connection alive. I think it's probably OK to ignore this in
practice. This is an intentionally large request being served by a very
underpowered platform. The default apache timeout is 60s. If a
real-world server is seeing ls-refs requests take that long then they
probably need to reconsider some other decisions, from ref packing to
better hardware to dropping some users. ;) I don't think trying to
insert keepalives at the Git layer here is worth the trouble.

To give a sense of the time options, here are a few timings from my
local machine, timing "git upload-pack . </dev/null >/dev/null" in
t5551's big repo.git (that's a v0 advertisement, but it should be
roughly the same work as the v2 ls-refs).

  cold-cache, refs not packed:
  real	0m9.973s
  user	0m0.354s
  sys	0m1.364s

  warm cache, refs not packed:
  real	0m0.410s
  user	0m0.153s
  sys	0m0.257s

  cold-cache, refs packed:
  real	0m0.149s
  user	0m0.086s
  sys	0m0.035s

  warm cache, refs packed:
  real	0m0.069s
  user	0m0.054s
  sys	0m0.016s

So 10s is pretty abysmal (and on an SSD, no less). I would expect the
cache to be warm (we just wrote these refs!) but I could also believe
that CI systems are under heavy I/O and memory pressure, so we sometimes
end up crossing the 60s mark.

So bumping Apache's timeout to 600s or something would probably be a
fine mitigation. That's still not _solving_ the problem, but presumably
an order of magnitude is enough for it to never come up in practice.

Michael suggested packing the refs as a mitigation. I was lukewarm on
that in my previous email, because it wasn't clear to me how close we
were on the timeout budget, and if it would just make the race less
frequent (rather than never happen). But seeing those cold-cache numbers
makes me think it might be worth doing just on principle to make the
tests more efficient, and any timeout mitigation is a bonus.

Of course the pack-refs process (and the initial ref writes) will still
have to touch all of those loose files, so those will still be slow. But
they're not on a timeout, and I suspect we read the result many more
times than we write/pack (the test failures we are seeing are not in the
expensive tests, but just "normal" tests that are stuck with the
gigantic ref state).

-Peff

next prev parent reply	other threads:[~2026-06-26 23:43 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-20 15:33 [RFH] Why do osx CI jobs so unreliable? Michael Montalbo
2026-06-21 21:34 ` Jeff King
2026-06-22  4:42   ` Patrick Steinhardt
2026-06-22  9:47     ` Patrick Steinhardt
2026-06-22  9:55       ` Patrick Steinhardt
2026-06-22 10:29         ` Patrick Steinhardt
2026-06-26  3:27           ` Michael Montalbo
2026-06-26  5:16             ` Jeff King
2026-06-26 10:50               ` Patrick Steinhardt
2026-06-26 13:45                 ` Junio C Hamano
2026-06-26 23:26                 ` Michael Montalbo
2026-06-26 23:43                 ` Jeff King [this message]
2026-06-22  5:05   ` Junio C Hamano
  -- strict thread matches above, loose matches on Subject: below --
2026-06-19  0:35 Junio C Hamano
2026-06-19 14:03 ` Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260626234312.GA3156205@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mmontalbo@gmail.com \
    --cc=ps@pks.im \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox