git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, Derrick Stolee <stolee@gmail.com>
Subject: Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
Date: Thu, 16 May 2024 14:36:19 +0200	[thread overview]
Message-ID: <ZkX9w6etjDVAh-ln@tanuki> (raw)
In-Reply-To: <Zj8blb0QqC2zdOAC@framework>

[-- Attachment #1: Type: text/plain, Size: 3989 bytes --]

On Sat, May 11, 2024 at 09:17:41AM +0200, Patrick Steinhardt wrote:
> On Fri, May 10, 2024 at 04:13:48PM -0400, Jeff King wrote:
> > On Fri, May 10, 2024 at 09:00:04AM +0200, Patrick Steinhardt wrote:
> > 
> > > On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote:
> > > [snip]
> > > > I'd like to report that this let me get a successful CI run, but I'm
> > > > running into the thing where osx jobs seem to randomly hang sometimes
> > > > and hit the 6-hour timeout. But I did confirm that this lets us get to
> > > > the actual build/test, and not barf while installing dependencies.
> > > 
> > > Yeah, this one is puzzling to me. We see the same thing on GitLab CI,
> > > and until now I haven't yet figured out why that is.
> > 
> > Drat. I was hoping maybe it was a problem in GitHub CI and somebody else
> > would eventually fix it. ;)
> > 
> > It feels like a deadlock somewhere, though whether it is in our code, or
> > in our tests, or some system-ish issue with prove, perl, etc, I don't
> > know. It would be nice to catch it in the act and see what the process
> > tree looks like. I guess poking around in the test environment with
> > tmate might work, though I don't know if there's a way to get tmate
> > running simultaneously with the hung step (so you'd probably have to
> > connect, kick off the "make test" manually and hope it hangs).
> 
> My hunch tells me that it's the Perforce tests -- after all, this is
> where the jobs get stuck, too. In "lib-git-p4.sh" we already document
> that p4d is known to crash at times, and overall the logic to spawn the
> server is quite convoluted.
> 
> I did try to get more useful logs yesterday. But as usual, once you
> _want_ to reproduce a failure like this is doesn't happen anymore.

I was spending (or rather wasting?) some more time on this. With the
below diff I was able to get a list of processes running after ~50
minutes:

    diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
    index 98dda42045..d5570b59d3 100755
    --- a/ci/run-build-and-tests.sh
    +++ b/ci/run-build-and-tests.sh
    @@ -51,8 +51,15 @@ esac
     group Build make
     if test -n "$run_tests"
     then
    -	group "Run tests" make test ||
    +	(
    +		sleep 3200 &&
    +		mkdir -p t/failed-test-artifacts &&
    +		ps -A >t/failed-test-artifacts/ps 2>&1
    +	) &
    +	pid=$!
    +	group "Run tests" gtimeout 1h make test ||
        handle_failed_tests
    +	kill "$pid"
     fi
     check_unignored_build_artifacts

I trimmed that process list to the following set of relevant processes:

  PID TTY           TIME CMD
 5196 ??         0:00.01 /bin/sh t9211-scalar-clone.sh --verbose-log -x
 5242 ??         0:00.00 /bin/sh t9211-scalar-clone.sh --verbose-log -x
 5244 ??         0:00.00 tee -a /Volumes/RAMDisk/test-results/t9211-scalar-clone.out
 5245 ??         0:00.09 /bin/sh t9211-scalar-clone.sh --verbose-log -x
 7235 ??         0:00.02 /Users/gitlab/builds/gitlab-org/git/scalar clone file:///Volumes/RAMDisk/trash directory.t9211-scalar-clone/to-clone maint-fail
 7265 ??         0:00.01 /Users/gitlab/builds/gitlab-org/git/git fetch --quiet --no-progress origin
 7276 ??         0:00.01 /Users/gitlab/builds/gitlab-org/git/git fsmonitor--daemon run --detach --ipc-threads=8

So it seems like the issue is t9211, and the hang happens in "scalar
clone warns when background maintenance fails" specifically. What
exactly the root cause is I have no clue though. Maybe an fsmonitor
race, maybe something else entirely. Hard to say as I have never seen
this happen on any other platform than macOS, and I do not have access
to a Mac myself.

The issue also doesn't seem to occur when running t9211 on its own, but
only when running the full test suite. This may further indicate that
there is a race condition, where the additional load improves the
likelihood of it. Or there is bad interaction with another test.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2024-05-16 12:36 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-09 16:22 [PATCH 0/3] un-breaking osx-gcc ci job Jeff King
2024-05-09 16:23 ` [PATCH 1/3] ci: drop mention of BREW_INSTALL_PACKAGES variable Jeff King
2024-05-09 16:24 ` [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job Jeff King
2024-05-10  7:00   ` Patrick Steinhardt
2024-05-10 20:16     ` Jeff King
2024-05-10 20:32   ` Kyle Lippincott
2024-05-10 20:48     ` Junio C Hamano
2024-05-10 22:02     ` Jeff King
2024-05-10 22:47       ` Junio C Hamano
2024-05-11 17:21         ` Patrick Steinhardt
2024-05-16  7:19           ` Jeff King
2024-05-16  7:27             ` Jeff King
2024-05-16  9:54             ` Patrick Steinhardt
2024-05-17  8:19               ` Jeff King
2024-05-17  8:33                 ` Patrick Steinhardt
2024-05-17 16:59                 ` Junio C Hamano
2024-05-23  9:10                   ` Jeff King
2024-05-23 15:35                     ` Junio C Hamano
2024-05-09 16:25 ` [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc Jeff King
2024-05-10  7:00   ` Patrick Steinhardt
2024-05-10 20:13     ` Jeff King
2024-05-11  7:17       ` Patrick Steinhardt
2024-05-16 12:36         ` Patrick Steinhardt [this message]
2024-05-17  8:11           ` Jeff King
2024-05-17  8:25             ` Patrick Steinhardt
2024-05-17 11:30               ` Patrick Steinhardt
2024-05-26  6:34                 ` Philip
2024-05-26 19:23                   ` Junio C Hamano
2024-05-27  5:12                     ` Patrick Steinhardt
2024-05-29  9:27                   ` Jeff King
2024-05-09 16:52 ` [PATCH 0/3] un-breaking osx-gcc ci job Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZkX9w6etjDVAh-ln@tanuki \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).