From: Jeff King <peff@peff.net>
To: Jan Palus <jpalus@fastmail.com>
Cc: git@vger.kernel.org, Andrew Au <cshung@gmail.com>,
Junio C Hamano <gitster@pobox.com>
Subject: Re: regression 2.54.0: test suite hangs indefinitely with mksh
Date: Wed, 22 Apr 2026 18:35:42 -0400 [thread overview]
Message-ID: <20260422223542.GA110382@coredump.intra.peff.net> (raw)
In-Reply-To: <aekVSi4iu9YLaPLQ@rock.grzadka>
On Wed, Apr 22, 2026 at 09:00:30PM +0200, Jan Palus wrote:
> If /bin/sh points to mksh, git's test suite hangs forever and never
> completes. An example of process tree for such hung test:
> [...]
Thanks, I can reproduce the problem easily here with mksh.
> bisect points to the following commit:
>
> commit dd3693eb0859274d62feac8047e1d486b3beaf31 (HEAD)
> Author: Andrew Au <cshung@gmail.com>
> Date: Thu Mar 12 22:49:37 2026
>
> transport-helper, connect: use clean_on_exit to reap children on abnormal exit
Yup, same here.
> From commit message alone it seems the change tries to avoid reparenting
> child processes to PID 1, however this does not seem to work for
> processes which were spawned with "sh -c".
>
> For example if /bin/sh points to bash tests pass because bash appears to
> react to SIGTERM differently -- shell process ends but its children are
> reparented anyway.
>
> mksh on the other hand seems to ignore SIGTERM (or queues it?) but
> waits for child process and so test suite never ends.
Yeah, I agree with your analysis. It looks mksh's signal handler waits
for its children. If I run:
# note start of script
date
# mksh with a slow child
mksh -c 'echo before; sleep 3; echo after' &
# wait for it to have started child, then kill
sleep 1
kill $!
# and now wait and mark the time
wait $!
date
then we see "before" but not "after", and the script takes 3 seconds to
run (since we wait for mksh to exit). Replacing "mksh" with bash or
dash, it finishes in 1 second.
So what to do?
I think we could alleviate the symptom by teaching clean_on_exit to
follow-up SIGTERM with SIGKILL. But as you note, that patch is not
really solving the original problem in the first place. If we use dash
or bash in the above script and add "ps u | grep sleep" to the end, we
can see that "sleep 3" is still running even after we've returned. So we
have not really accomplished our goal anyway.
So I think just reverting dd3693eb08 is probably the most immediate fix.
We can try again at the same goal later, though I'm not quite sure how
to do so. From Git's perspective, we do not even know the pid of the
child that the shell spawned, so we cannot really kill it or wait on it.
Conceptually we'd want the shell to propagate the signal to its
children, but doing that automatically in the general case is rather
tricky. I don't think we want to get into prepending trap commands at
the start of each shell we invoke.
For the specific cases in dd3693eb08, we probably want to signal to the
child to exit by closing the pipe its reading from. But we can't do that
via the run-command system, so it would have to be a specific signal
handler within the transport system. I think that would avoid true
deadlocks, but it still isn't quite what we want. The process won't die
until it tries to read from the parent, so if it's waiting on a long
network timeout, for example, it could mean that the parent sits around
for quite a while even after having gotten a signal.
The fallback position to what dd3693eb08 was trying to do is roughly:
let them get reparented to pid 1 and don't worry about (and people in
containers should really use "tini" or some other pid-1 replacement).
That doesn't seem like the worst place to be, and we can get there
easily with a revert.
This all does point to some general weaknesses in run-command's
clean_on_exit. If a shell is involved, we're probably not killing who we
meant to. That's usually not the end of the world (if we take "clean" to
be best-effort), but with wait_after_clean tacked on it's a recipe for a
potential hang if the shell doesn't exit. So I tried this:
diff --git a/run-command.c b/run-command.c
index c146a56532..97c223bdf1 100644
--- a/run-command.c
+++ b/run-command.c
@@ -680,6 +680,9 @@ int start_command(struct child_process *cmd)
int failed_errno;
const char *str;
+ if (cmd->use_shell && cmd->wait_after_clean)
+ BUG("combining use_shell and wait_after_clean is not reliable");
+
/*
* In case of errors we must keep the promise to close FDs
* that have been passed in via ->in and ->out.
and ran the test suite, which results in several failures. In particular
we'll run a "!" alias as a shell command with those flags, but also for
things like exec-ing external commands (like "git-foo"). So we should be
able to produce a "hang" like this (built with SHELL_PATH=/bin/mksh):
./git -c alias.foo='!echo before; sleep 10; echo after' foo &
sleep 1
kill $!
wait
and we can see that the parent "git" process waits 10 seconds, even
though it was killed. There are two saving graces here:
1. This isn't really a deadlock, but just a slow process. We are
waiting on the alias to exit anyway, and we do not expect that the
alias command is waiting on us. This is very different from the
cases in dd3693eb08, where we are spawning something with a pipe
which could be waiting to read from us. And it looks like all of
the existing uses of wait_after_clean are like this; the parent is
conceptually doing an "exec" of the child, but for various reasons
we want it to hang around.
The exception is in http-backend, where I think we really are
interacting with the child (so if we got a signal and the child
didn't, I think we could possibly deadlock as it waits for us to
feed more data).
2. mksh seems eager to automatically exec its child when fed only a
single, simple command (like "mksh -c 'sleep 10'). And that
alleviates the problem, since the signal goes straight to the child
command, then. So for things like git-foo externals, the shell may
have gotten out of the way anyway.
So I think the existing, pre-dd3693eb08 cases are _probably_ fine, even
with mksh, though the signal propagation may not be doing quite what we
want it to do. In practice I think most signals end up delivered to the
whole process group anyway (e.g., hitting ^C in a terminal) and that
part doesn't matter so much.
-Peff
next prev parent reply other threads:[~2026-04-22 22:35 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-22 19:00 regression 2.54.0: test suite hangs indefinitely with mksh Jan Palus
2026-04-22 22:35 ` Jeff King [this message]
2026-04-22 23:00 ` [PATCH] Revert "transport-helper, connect: use clean_on_exit to reap children on abnormal exit" Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260422223542.GA110382@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=cshung@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jpalus@fastmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox