From: Patrick Steinhardt <ps@pks.im>
To: Jeff King <peff@peff.net>
Cc: Michael Montalbo <mmontalbo@gmail.com>,
git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>
Subject: Re: [RFH] Why do osx CI jobs so unreliable?
Date: Mon, 22 Jun 2026 12:29:53 +0200 [thread overview]
Message-ID: <ajkOoRhqaAcy6gBg@pks.im> (raw)
In-Reply-To: <ajkGkB2ckf3p43QR@pks.im>
On Mon, Jun 22, 2026 at 11:55:31AM +0200, Patrick Steinhardt wrote:
> On Mon, Jun 22, 2026 at 11:48:01AM +0200, Patrick Steinhardt wrote:
> > On Mon, Jun 22, 2026 at 06:42:24AM +0200, Patrick Steinhardt wrote:
> > > On Sun, Jun 21, 2026 at 05:34:07PM -0400, Jeff King wrote:
> > > > On Sat, Jun 20, 2026 at 08:33:13AM -0700, Michael Montalbo wrote:
> > [snip]
> > > > > When it is wedged the whole chain sits at 0% CPU. upload-pack is
> > > > > blocked in write() on the ls-refs advertisement, curl blocked in
> > > > > select(). So it looks like an HTTP/2 flow-control stall on the
> > > > > response side. The same stall resets itself after ~60-85s on my Linux
> > > > > box and on a bare-metal Mac, but not on the GitHub runner; I haven't
> > > > > pinned down why yet.
> > > >
> > > > We had some HTTP/2 stalls/deadlocks in the past, and they were dependent
> > > > on libcurl and apache (actually h2_mod) versions. IIRC some of the
> > > > non-TLS code paths for HTTP/2 were not well tested, which led to
> > > > 8f2146dbf1 (t5559: make SSL/TLS the default, 2023-02-23). Of course
> > > > after that commit those cleartext code paths should not be a problem, so
> > > > that is probably not exactly the issue now.
> > > >
> > > > But it might be worth checking the versions you're running locally
> > > > versus what's in the GitHub runner.
> > >
> > > I didn't observe any similar hangs in GitLab's CI systems, so I wonder
> > > whether this is because of different versions of curl. And indeed we use
> > > different versions:
> > >
> > > - On GitHub we use 8.6.0.
> > >
> > > - On GitLab we use 8.7.1.
> > >
> > > Now this of course doesn't mean that updating the curl version is the
> > > fix to this whole issue, as there's a ton of other factors that could
> > > play a role in whether or not the test hangs. So while we could just
> > > upgrade parts of the stack and cross our fingers, but that feels rather
> > > unsatisfactory. Still, one place to start could be to update our build
> > > images to macOS 15.
> > >
> > > But the big question to me is whether the hang is because of a bug in
> > > Git with how we drive curl, a bug in curl itself, or a bug in Apache.
> >
> > I noticed that a osx-clang job failed today in t5551 [1]. This time it
> > didn't hang, but produced an actual error:
> >
> > 2026-06-22T09:25:45.1984230Z ++ git -C too-many-refs fetch -q --tags
> > 2026-06-22T09:25:45.1984420Z error: RPC failed; curl 18 transfer closed with outstanding read data remaining
> > 2026-06-22T09:25:45.1984520Z fatal: expected flush after ref listing
> > 2026-06-22T09:25:45.1984610Z error: last command exited with $?=128
> > 2026-06-22T09:25:45.1984660Z ++ rm -f tags
> > 2026-06-22T09:25:45.1984710Z ++ :
> > 2026-06-22T09:25:45.1984830Z not ok 35 - http can handle enormous ref negotiation
> >
> > There was a second test failing similarly.
>
> Oh, and Linux is also failing in the same test suite [1], even though
> the job logs are truncated, so it's hard to say whether it's the same
> failure or not.
>
> There certainly seems to be a deeper issue here. We could of course just
> disable the test again, but by now I do wonder whether this would paper
> over an actual bug.
>
> Patrick
>
> [1]: https://github.com/git/git/actions/runs/27940620478/job/82672854864
Sorry for the repeated spam.
I think the issue is rather simple: we're hitting timeouts in Apache. If
you apply the following diff:
diff --git a/t/lib-httpd/apache.conf b/t/lib-httpd/apache.conf
index 40a690b0bb..4054fe008f 100644
--- a/t/lib-httpd/apache.conf
+++ b/t/lib-httpd/apache.conf
@@ -302,3 +302,5 @@ RewriteRule ^/half-auth-complete/ - [E=AUTHREQUIRED:yes]
SVNPath "${LIB_HTTPD_SVNPATH}"
</Location>
</IfDefine>
+
+Timeout 1
Then you'll see the same errors locally:
$ GIT_TEST_LONG=Yes meson test t5551-http-fetch-smart --test-args=-ix -i
Failed to clone 'sub'. Retry scheduled
Cloning into '/home/pks/Development/git/build/test-output/trash directory.t5551-http-fetch-smart/sub'...
error: RPC failed; curl 18 transfer closed with outstanding read data remaining
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
fatal: clone of 'http://127.0.0.1:5551/smart_headers/repo.git' into submodule path '/home/pks/Development/git/build/test-output/trash directory.t5551-http-fetch-smart/sub' failed
Failed to clone 'sub' a second time, aborting
error: last command exited with $?=1
not ok 36 - custom http headers
#
# test_must_fail git -c http.extraheader="x-magic-two: cadabra" \
# fetch "$HTTPD_URL/smart_headers/repo.git" &&
# git -c http.extraheader="x-magic-one: abra" \
# -c http.extraheader="x-magic-two: cadabra" \
# fetch "$HTTPD_URL/smart_headers/repo.git" &&
# git update-index --add --cacheinfo 160000,$(git rev-parse HEAD),sub &&
# git config -f .gitmodules submodule.sub.path sub &&
# git config -f .gitmodules submodule.sub.url \
# "$HTTPD_URL/smart_headers/repo.git" &&
# git submodule init sub &&
# test_must_fail git submodule update sub &&
# git -c http.extraheader="x-magic-one: abra" \
# -c http.extraheader="x-magic-two: cadabra" \
# submodule update sub
#
1..36
And Apache also logs this as a timeout:
[Mon Jun 22 10:26:52.115717 2026] [cgi:warn] [pid 3686957:tid 3686957] [client 127.0.0.1:55114] AH01220: Timeout waiting for output from CGI script /home/pks/Development/git/build/git-http-backend
[Mon Jun 22 10:26:52.115748 2026] [core:error] [pid 3686957:tid 3686957] (70007)The timeout specified has expired: [client 127.0.0.1:55114] AH00574: ap_content_length_filter: apr_bucket_read() failed
[Mon Jun 22 10:27:01.567533 2026] [cgi:warn] [pid 3686958:tid 3686958] [client 127.0.0.1:54384] AH01220: Timeout waiting for output from CGI script /home/pks/Development/git/build/git-http-backend
[Mon Jun 22 10:27:01.567559 2026] [core:error] [pid 3686958:tid 3686958] (70007)The timeout specified has expired: [client 127.0.0.1:54384] AH00574: ap_content_length_filter: apr_bucket_read() failed
This is because our keepalive mechanisms aren't helping:
- The TCP-level keepalives don't help with Apache.
- The application-level sideband keepalives don't apply to the
"ls-refs" endpoint.
Whether that's the same issue like we see in macOS sometimes is a
different question.
Patrick
next prev parent reply other threads:[~2026-06-22 10:30 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-20 15:33 [RFH] Why do osx CI jobs so unreliable? Michael Montalbo
2026-06-21 21:34 ` Jeff King
2026-06-22 4:42 ` Patrick Steinhardt
2026-06-22 9:47 ` Patrick Steinhardt
2026-06-22 9:55 ` Patrick Steinhardt
2026-06-22 10:29 ` Patrick Steinhardt [this message]
2026-06-22 5:05 ` Junio C Hamano
-- strict thread matches above, loose matches on Subject: below --
2026-06-19 0:35 Junio C Hamano
2026-06-19 14:03 ` Patrick Steinhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ajkOoRhqaAcy6gBg@pks.im \
--to=ps@pks.im \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=mmontalbo@gmail.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox