All of lore.kernel.org
 help / color / mirror / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, Derrick Stolee <stolee@gmail.com>
Subject: Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc
Date: Thu, 16 May 2024 14:36:19 +0200	[thread overview]
Message-ID: <ZkX9w6etjDVAh-ln@tanuki> (raw)
In-Reply-To: <Zj8blb0QqC2zdOAC@framework>

[-- Attachment #1: Type: text/plain, Size: 3989 bytes --]

On Sat, May 11, 2024 at 09:17:41AM +0200, Patrick Steinhardt wrote:
> On Fri, May 10, 2024 at 04:13:48PM -0400, Jeff King wrote:
> > On Fri, May 10, 2024 at 09:00:04AM +0200, Patrick Steinhardt wrote:
> > 
> > > On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote:
> > > [snip]
> > > > I'd like to report that this let me get a successful CI run, but I'm
> > > > running into the thing where osx jobs seem to randomly hang sometimes
> > > > and hit the 6-hour timeout. But I did confirm that this lets us get to
> > > > the actual build/test, and not barf while installing dependencies.
> > > 
> > > Yeah, this one is puzzling to me. We see the same thing on GitLab CI,
> > > and until now I haven't yet figured out why that is.
> > 
> > Drat. I was hoping maybe it was a problem in GitHub CI and somebody else
> > would eventually fix it. ;)
> > 
> > It feels like a deadlock somewhere, though whether it is in our code, or
> > in our tests, or some system-ish issue with prove, perl, etc, I don't
> > know. It would be nice to catch it in the act and see what the process
> > tree looks like. I guess poking around in the test environment with
> > tmate might work, though I don't know if there's a way to get tmate
> > running simultaneously with the hung step (so you'd probably have to
> > connect, kick off the "make test" manually and hope it hangs).
> 
> My hunch tells me that it's the Perforce tests -- after all, this is
> where the jobs get stuck, too. In "lib-git-p4.sh" we already document
> that p4d is known to crash at times, and overall the logic to spawn the
> server is quite convoluted.
> 
> I did try to get more useful logs yesterday. But as usual, once you
> _want_ to reproduce a failure like this is doesn't happen anymore.

I was spending (or rather wasting?) some more time on this. With the
below diff I was able to get a list of processes running after ~50
minutes:

    diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
    index 98dda42045..d5570b59d3 100755
    --- a/ci/run-build-and-tests.sh
    +++ b/ci/run-build-and-tests.sh
    @@ -51,8 +51,15 @@ esac
     group Build make
     if test -n "$run_tests"
     then
    -	group "Run tests" make test ||
    +	(
    +		sleep 3200 &&
    +		mkdir -p t/failed-test-artifacts &&
    +		ps -A >t/failed-test-artifacts/ps 2>&1
    +	) &
    +	pid=$!
    +	group "Run tests" gtimeout 1h make test ||
        handle_failed_tests
    +	kill "$pid"
     fi
     check_unignored_build_artifacts

I trimmed that process list to the following set of relevant processes:

  PID TTY           TIME CMD
 5196 ??         0:00.01 /bin/sh t9211-scalar-clone.sh --verbose-log -x
 5242 ??         0:00.00 /bin/sh t9211-scalar-clone.sh --verbose-log -x
 5244 ??         0:00.00 tee -a /Volumes/RAMDisk/test-results/t9211-scalar-clone.out
 5245 ??         0:00.09 /bin/sh t9211-scalar-clone.sh --verbose-log -x
 7235 ??         0:00.02 /Users/gitlab/builds/gitlab-org/git/scalar clone file:///Volumes/RAMDisk/trash directory.t9211-scalar-clone/to-clone maint-fail
 7265 ??         0:00.01 /Users/gitlab/builds/gitlab-org/git/git fetch --quiet --no-progress origin
 7276 ??         0:00.01 /Users/gitlab/builds/gitlab-org/git/git fsmonitor--daemon run --detach --ipc-threads=8

So it seems like the issue is t9211, and the hang happens in "scalar
clone warns when background maintenance fails" specifically. What
exactly the root cause is I have no clue though. Maybe an fsmonitor
race, maybe something else entirely. Hard to say as I have never seen
this happen on any other platform than macOS, and I do not have access
to a Mac myself.

The issue also doesn't seem to occur when running t9211 on its own, but
only when running the full test suite. This may further indicate that
there is a race condition, where the additional load improves the
likelihood of it. Or there is bad interaction with another test.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2024-05-16 12:36 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-09 16:22 [PATCH 0/3] un-breaking osx-gcc ci job Jeff King
2024-05-09 16:23 ` [PATCH 1/3] ci: drop mention of BREW_INSTALL_PACKAGES variable Jeff King
2024-05-09 16:24 ` [PATCH 2/3] ci: avoid bare "gcc" for osx-gcc job Jeff King
2024-05-10  7:00   ` Patrick Steinhardt
2024-05-10 20:16     ` Jeff King
2024-05-10 20:32   ` Kyle Lippincott
2024-05-10 20:48     ` Junio C Hamano
2024-05-10 22:02     ` Jeff King
2024-05-10 22:47       ` Junio C Hamano
2024-05-11 17:21         ` Patrick Steinhardt
2024-05-16  7:19           ` Jeff King
2024-05-16  7:27             ` Jeff King
2024-05-16  9:54             ` Patrick Steinhardt
2024-05-17  8:19               ` Jeff King
2024-05-17  8:33                 ` Patrick Steinhardt
2024-05-17 16:59                 ` Junio C Hamano
2024-05-23  9:10                   ` Jeff King
2024-05-23 15:35                     ` Junio C Hamano
2024-05-09 16:25 ` [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc Jeff King
2024-05-10  7:00   ` Patrick Steinhardt
2024-05-10 20:13     ` Jeff King
2024-05-11  7:17       ` Patrick Steinhardt
2024-05-16 12:36         ` Patrick Steinhardt [this message]
2024-05-17  8:11           ` Jeff King
2024-05-17  8:25             ` Patrick Steinhardt
2024-05-17 11:30               ` Patrick Steinhardt
2024-05-26  6:34                 ` Philip
2024-05-26 19:23                   ` Junio C Hamano
2024-05-27  5:12                     ` Patrick Steinhardt
2024-05-29  9:27                   ` Jeff King
2024-05-09 16:52 ` [PATCH 0/3] un-breaking osx-gcc ci job Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZkX9w6etjDVAh-ln@tanuki \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.