From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from cloud.peff.net (cloud.peff.net [217.216.95.84]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 351A3283FDC for ; Fri, 26 Jun 2026 23:43:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.216.95.84 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782517397; cv=none; b=Jzu6RyxglUUGDibdLbmENxlsm1FrKNXBU9TCzbhhzUvS8MybEf07Lc3a594zP0Wn2wHLPg/zdBzmvNBZrkwSdPK8iyPKEt+aTcGfVruZIYxY/YmZ7lvzP0xz1+4ZPs6k7ifjiW9Fe27MmZocP2OTMt8BlEE62Rz8cZ5zSz6NK/M= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782517397; c=relaxed/simple; bh=3zWp1qaR3bcoeNkGt7htpq1dZdyqT+hW6jmEGUXg0LM=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=HxssxeZ0cHpz1T0CsCq3o1SXAfbUAR1WcbFUcwRObeN7wJVxvQX7GNfsNTWaLIKNdTPgg9KPxFqMf6oOvb5/a9vjF+kplRJKtAuNDbzZkaY9v0gqXcg4L55sMP3CVW19867wKvteoT9FhkiJ3zf5rxc+05O4HCmdSQHfnx17AJg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=peff.net; spf=pass smtp.mailfrom=peff.net; dkim=pass (2048-bit key) header.d=peff.net header.i=@peff.net header.b=SlaZ+Zw8; arc=none smtp.client-ip=217.216.95.84 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=peff.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=peff.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=peff.net header.i=@peff.net header.b="SlaZ+Zw8" Received: (qmail 43642 invoked by uid 106); 26 Jun 2026 23:43:13 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=peff.net; h=date:from:to:cc:subject:message-id:references:mime-version:content-type:in-reply-to; s=20240930; bh=3zWp1qaR3bcoeNkGt7htpq1dZdyqT+hW6jmEGUXg0LM=; b=SlaZ+Zw8LZ1bkK+reX2Shqq0MF4MJg4kRWpoaSONJMoEaEHe3OgKPJc42eSZDf9h3M18h/P9UdgB7zJeL7aRNqwln3bUm0LZZETdOHwmN+li8vsLO2O6HGjXVNMJ6sbR0myfAwx18snqXtEp2TmzmmqGRR1GHTf8ebVe58jisvNEFIMTPcmLI8/n9qPtXk3yMLR+7XO+VmE3fkWIra3vFOAmNGDv4d5XYHqm0Tfwi/d4833zRbfgmp7uVTqnkDMmSHPz1+sjEtY0xn4vmw6Q+pgnHgZwd5DJoV2P853ndrrQ0Yyur+lf9KAF8EYdc/wrssFWhlshClXvDCtrVpjG4g== Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Fri, 26 Jun 2026 23:43:13 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 89777 invoked by uid 111); 26 Jun 2026 23:43:13 -0000 Received: from coredump.intra.peff.net (HELO coredump.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Fri, 26 Jun 2026 19:43:13 -0400 Authentication-Results: peff.net; auth=none Date: Fri, 26 Jun 2026 19:43:12 -0400 From: Jeff King To: Patrick Steinhardt Cc: Michael Montalbo , git@vger.kernel.org, Junio C Hamano Subject: Re: [RFH] Why do osx CI jobs so unreliable? Message-ID: <20260626234312.GA3156205@coredump.intra.peff.net> References: <20260621213407.GC2297179@coredump.intra.peff.net> <20260626051657.GB3138423@coredump.intra.peff.net> Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: On Fri, Jun 26, 2026 at 12:50:17PM +0200, Patrick Steinhardt wrote: > > Thanks both of you for digging into this. I'm not familiar enough with > > Apache's code to pass confident judgement, but your findings certainly > > convinced me that this is just an apache bug. > > The bug manifests both with HTTP/1.1 and HTTP/2 though, so this wouldn't > fully fix the flakes we see, right? If I understand the situation correctly, there are really two problems. First, the EXPENSIVE tests in t5551 sometimes trigger a timeout with Apache's stock settings. This presumably became a problem recently due to 7a094d68a2 (ci: run expensive tests on push builds to integration branches, 2026-05-08). The same problem exists in t5559, which just wraps t5551 but tells us to use http2. This timeout will cause test failures in t5551, because we aren't able to complete a request we expected to. Obviously bad and annoying. The second problem is that when Apache hits the timeout in HTTP/2 mode, it hangs forever. And then the CI job hangs for 6 hours until it's killed, which is an even more annoying failure. So the root cause is the same (a timeout), but the effect depends on HTTP/1.1 vs HTTP/2. I was able to reproduce both cases on my local Debian unstable system by dropping the timeout as you suggested. Running t5551 with GIT_TEST_LONG yields a failure, whereas running t5559 yields a hang. We can mitigate both cases by bumping the timeout value, since it's addressing the root cause. There's an open question of whether this is just papering over a problem that real users might experience, and whether Git should be doing more to keep the connection alive. I think it's probably OK to ignore this in practice. This is an intentionally large request being served by a very underpowered platform. The default apache timeout is 60s. If a real-world server is seeing ls-refs requests take that long then they probably need to reconsider some other decisions, from ref packing to better hardware to dropping some users. ;) I don't think trying to insert keepalives at the Git layer here is worth the trouble. To give a sense of the time options, here are a few timings from my local machine, timing "git upload-pack . /dev/null" in t5551's big repo.git (that's a v0 advertisement, but it should be roughly the same work as the v2 ls-refs). cold-cache, refs not packed: real 0m9.973s user 0m0.354s sys 0m1.364s warm cache, refs not packed: real 0m0.410s user 0m0.153s sys 0m0.257s cold-cache, refs packed: real 0m0.149s user 0m0.086s sys 0m0.035s warm cache, refs packed: real 0m0.069s user 0m0.054s sys 0m0.016s So 10s is pretty abysmal (and on an SSD, no less). I would expect the cache to be warm (we just wrote these refs!) but I could also believe that CI systems are under heavy I/O and memory pressure, so we sometimes end up crossing the 60s mark. So bumping Apache's timeout to 600s or something would probably be a fine mitigation. That's still not _solving_ the problem, but presumably an order of magnitude is enough for it to never come up in practice. Michael suggested packing the refs as a mitigation. I was lukewarm on that in my previous email, because it wasn't clear to me how close we were on the timeout budget, and if it would just make the race less frequent (rather than never happen). But seeing those cold-cache numbers makes me think it might be worth doing just on principle to make the tests more efficient, and any timeout mitigation is a bonus. Of course the pack-refs process (and the initial ref writes) will still have to touch all of those loose files, so those will still be slow. But they're not on a timeout, and I suspect we read the result many more times than we write/pack (the test failures we are seeing are not in the expensive tests, but just "normal" tests that are stuck with the gigantic ref state). -Peff