From: Jeff King <peff@peff.net>
To: Koji Nakamaru <koji.nakamaru@gree.net>
Cc: git@vger.kernel.org, Patrick Steinhardt <ps@pks.im>,
Johannes Schindelin <Johannes.Schindelin@gmx.de>
Subject: Re: fsmonitor deadlock / macOS CI hangs
Date: Wed, 2 Oct 2024 02:25:39 -0400 [thread overview]
Message-ID: <20241002062539.GA2863841@coredump.intra.peff.net> (raw)
In-Reply-To: <CAOTNsDykrEBGBnQJqmCghWBTgzirxHZQM25ohsCBysp8ZO1qLg@mail.gmail.com>
On Wed, Oct 02, 2024 at 10:46:03AM +0900, Koji Nakamaru wrote:
> Perhaps I found the cause. fsmonitor_run_daemon_1() starts the fsevent
> Normally FSEventStreamStart() is called before
> with_lock__wait_for_cookie() creates a cookie file, but this is not
> guaranteed. We can reproduce the issue easily if we modify
> fsm_listen__loop() as below:
>
> --- a/compat/fsmonitor/fsm-listen-darwin.c
> +++ b/compat/fsmonitor/fsm-listen-darwin.c
> @@ -510,6 +510,7 @@ void fsm_listen__loop(struct
> fsmonitor_daemon_state *state)
> FSEventStreamSetDispatchQueue(data->stream, data->dq);
> data->stream_scheduled = 1;
>
> + sleep(1);
> if (!FSEventStreamStart(data->stream)) {
> error(_("Failed to start the FSEventStream"));
> goto force_error_stop_without_loop;
Ah, good catch. Yes, with that sleep it is much easier to create the
situation in the debugger.
So if I understand (and bear with me if you already know this, but I am
just figuring out how fsmonitor is supposed to work), the expected
order of events is:
- the thread serving the client creates a cookie file (something like
".git/fsmonitor--daemon/cookies/1234-0", based on the pid) and then
waits for the pthread_cond to be triggered
- the thread listening for fs events should see that cookie file
creation, which gets us to with_lock__mark_cookies_seen(), which
then triggers the cond
But with the sleep, the file creation event happens before the fs event
is actually listening, so we never hear it. We can "unstick" it by
doing:
# where "1234-0" is the unique cookie file, which you can get by
# looking at the output from GIT_TRACE_FSMONITOR=1
echo foo >trash directory.t9211-scalar-clone/cloned/src/.git/fsmonitor--daemon/cookies/1234-0
which then triggers us to mark the cookies seen. I have no clue why
there is this elaborate communication for two threads which are already
sharing a memory space, but let's assume for a moment that it's
required.
The solution then is to avoid letting any client threads run (or at
least create cookies) until we know we're listening for fs events.
Naively I would say we should just have the fs event thread signal us
that it's ready before we spawn any of the client threads. But it looks
like we explicitly start those threads first in fsmonitor_run_daemon_1():
/*
* Start the IPC thread pool before the we've started the file
* system event listener thread so that we have the IPC handle
* before we need it.
*/
if (ipc_server_run_async(&state->ipc_server_data,
state->path_ipc.buf, &ipc_opts,
handle_client, state))
return error_errno(
_("could not start IPC thread pool on '%s'"),
state->path_ipc.buf);
So I guess we need to let the ipc server initialize itself and then wait
to be told to start serving requests. And then the fs event thread can
trigger that "go ahead" once the fs event stream is started. Which is a
little tricky, but possible with an extra mutex.
But hmm. I wonder if we can hack around it like this:
diff --git a/compat/fsmonitor/fsm-listen-darwin.c b/compat/fsmonitor/fsm-listen-darwin.c
index 2fc6744..6b06faf 100644
--- a/compat/fsmonitor/fsm-listen-darwin.c
+++ b/compat/fsmonitor/fsm-listen-darwin.c
@@ -510,11 +510,17 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state)
FSEventStreamSetDispatchQueue(data->stream, data->dq);
data->stream_scheduled = 1;
+ sleep(1);
if (!FSEventStreamStart(data->stream)) {
error(_("Failed to start the FSEventStream"));
goto force_error_stop_without_loop;
}
data->stream_started = 1;
+ /*
+ * In case any clients showed up before we initialized the event stream,
+ * abort all in-progress ops and start fresh.
+ */
+ fsmonitor_force_resync(state);
pthread_mutex_lock(&data->dq_lock);
pthread_cond_wait(&data->dq_finished, &data->dq_lock);
The forced resync aborts the cookies, which triggers the waiting
clients. I'm not sure how "bad" it is to do a resync like this. It's
something that can happen in the normal course of events, so in theory
it's just non-optimal and nobody will do the wrong thing. And this would
only happen once at the start of fsmonitor, so in most cases nobody
would have connected yet, and we wouldn't have processed any events.
So I dunno. It does make the hang go away in this case.
-Peff
next prev parent reply other threads:[~2024-10-02 6:25 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-30 19:46 fsmonitor deadlock / macOS CI hangs Jeff King
2024-10-02 1:46 ` Koji Nakamaru
2024-10-02 6:25 ` Jeff King [this message]
2024-10-02 9:51 ` Koji Nakamaru
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241002062539.GA2863841@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=koji.nakamaru@gree.net \
--cc=ps@pks.im \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).