* regression 2.54.0: test suite hangs indefinitely with mksh
@ 2026-04-22 19:00 Jan Palus
2026-04-22 22:35 ` Jeff King
0 siblings, 1 reply; 3+ messages in thread
From: Jan Palus @ 2026-04-22 19:00 UTC (permalink / raw)
To: git; +Cc: Andrew Au, Junio C Hamano
If /bin/sh points to mksh, git's test suite hangs forever and never
completes. An example of process tree for such hung test:
$ pstree -a `pgrep -f ./t5702-protocol-v2.sh`
t5702-protocol- ./t5702-protocol-v2.sh
└─git -c protocol.version=0 ls-remote -o hello -o worldfile:///home/users/builder/rpm/BUILD/gi
└─sh -c git-upload-pack '/home/users/builder/rpm/BUILD/git/t/trash directory.t5702-protocol-v2/file_parent'...
└─git-upload-pack /home/users/builder/rpm/BUILD/git/t/trash directory.t5702-protocol-v2/file_parent
bisect points to the following commit:
commit dd3693eb0859274d62feac8047e1d486b3beaf31 (HEAD)
Author: Andrew Au <cshung@gmail.com>
Date: Thu Mar 12 22:49:37 2026
transport-helper, connect: use clean_on_exit to reap children on abnormal exit
When a long-running service (e.g., a source indexer) runs as PID 1
inside a container and repeatedly spawns git, git may in turn spawn
child processes such as git-remote-https or ssh. If git exits abnormally
(e.g., via exit(128) on a transport error), the normal cleanup paths
(disconnect_helper, finish_connect) are bypassed, and these children are
never waited on. The children are reparented to PID 1, which does not
reap them, so they accumulate as zombies over time.
Set clean_on_exit and wait_after_clean on child_process structs in both
transport-helper.c and connect.c so that the existing run-command
cleanup infrastructure handles reaping on any exit path. This avoids
rolling custom atexit handlers that call finish_command(), which could
deadlock if the child is blocked waiting for the parent to close a pipe.
The clean_on_exit mechanism sends SIGTERM first, then waits, ensuring
the child terminates promptly. It also handles signal-based exits, not
just atexit.
Signed-off-by: Andrew Au <cshung@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
From commit message alone it seems the change tries to avoid reparenting
child processes to PID 1, however this does not seem to work for
processes which were spawned with "sh -c".
For example if /bin/sh points to bash tests pass because bash appears to
react to SIGTERM differently -- shell process ends but its children are
reparented anyway.
mksh on the other hand seems to ignore SIGTERM (or queues it?) but
waits for child process and so test suite never ends.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: regression 2.54.0: test suite hangs indefinitely with mksh
2026-04-22 19:00 regression 2.54.0: test suite hangs indefinitely with mksh Jan Palus
@ 2026-04-22 22:35 ` Jeff King
2026-04-22 23:00 ` [PATCH] Revert "transport-helper, connect: use clean_on_exit to reap children on abnormal exit" Jeff King
0 siblings, 1 reply; 3+ messages in thread
From: Jeff King @ 2026-04-22 22:35 UTC (permalink / raw)
To: Jan Palus; +Cc: git, Andrew Au, Junio C Hamano
On Wed, Apr 22, 2026 at 09:00:30PM +0200, Jan Palus wrote:
> If /bin/sh points to mksh, git's test suite hangs forever and never
> completes. An example of process tree for such hung test:
> [...]
Thanks, I can reproduce the problem easily here with mksh.
> bisect points to the following commit:
>
> commit dd3693eb0859274d62feac8047e1d486b3beaf31 (HEAD)
> Author: Andrew Au <cshung@gmail.com>
> Date: Thu Mar 12 22:49:37 2026
>
> transport-helper, connect: use clean_on_exit to reap children on abnormal exit
Yup, same here.
> From commit message alone it seems the change tries to avoid reparenting
> child processes to PID 1, however this does not seem to work for
> processes which were spawned with "sh -c".
>
> For example if /bin/sh points to bash tests pass because bash appears to
> react to SIGTERM differently -- shell process ends but its children are
> reparented anyway.
>
> mksh on the other hand seems to ignore SIGTERM (or queues it?) but
> waits for child process and so test suite never ends.
Yeah, I agree with your analysis. It looks mksh's signal handler waits
for its children. If I run:
# note start of script
date
# mksh with a slow child
mksh -c 'echo before; sleep 3; echo after' &
# wait for it to have started child, then kill
sleep 1
kill $!
# and now wait and mark the time
wait $!
date
then we see "before" but not "after", and the script takes 3 seconds to
run (since we wait for mksh to exit). Replacing "mksh" with bash or
dash, it finishes in 1 second.
So what to do?
I think we could alleviate the symptom by teaching clean_on_exit to
follow-up SIGTERM with SIGKILL. But as you note, that patch is not
really solving the original problem in the first place. If we use dash
or bash in the above script and add "ps u | grep sleep" to the end, we
can see that "sleep 3" is still running even after we've returned. So we
have not really accomplished our goal anyway.
So I think just reverting dd3693eb08 is probably the most immediate fix.
We can try again at the same goal later, though I'm not quite sure how
to do so. From Git's perspective, we do not even know the pid of the
child that the shell spawned, so we cannot really kill it or wait on it.
Conceptually we'd want the shell to propagate the signal to its
children, but doing that automatically in the general case is rather
tricky. I don't think we want to get into prepending trap commands at
the start of each shell we invoke.
For the specific cases in dd3693eb08, we probably want to signal to the
child to exit by closing the pipe its reading from. But we can't do that
via the run-command system, so it would have to be a specific signal
handler within the transport system. I think that would avoid true
deadlocks, but it still isn't quite what we want. The process won't die
until it tries to read from the parent, so if it's waiting on a long
network timeout, for example, it could mean that the parent sits around
for quite a while even after having gotten a signal.
The fallback position to what dd3693eb08 was trying to do is roughly:
let them get reparented to pid 1 and don't worry about (and people in
containers should really use "tini" or some other pid-1 replacement).
That doesn't seem like the worst place to be, and we can get there
easily with a revert.
This all does point to some general weaknesses in run-command's
clean_on_exit. If a shell is involved, we're probably not killing who we
meant to. That's usually not the end of the world (if we take "clean" to
be best-effort), but with wait_after_clean tacked on it's a recipe for a
potential hang if the shell doesn't exit. So I tried this:
diff --git a/run-command.c b/run-command.c
index c146a56532..97c223bdf1 100644
--- a/run-command.c
+++ b/run-command.c
@@ -680,6 +680,9 @@ int start_command(struct child_process *cmd)
int failed_errno;
const char *str;
+ if (cmd->use_shell && cmd->wait_after_clean)
+ BUG("combining use_shell and wait_after_clean is not reliable");
+
/*
* In case of errors we must keep the promise to close FDs
* that have been passed in via ->in and ->out.
and ran the test suite, which results in several failures. In particular
we'll run a "!" alias as a shell command with those flags, but also for
things like exec-ing external commands (like "git-foo"). So we should be
able to produce a "hang" like this (built with SHELL_PATH=/bin/mksh):
./git -c alias.foo='!echo before; sleep 10; echo after' foo &
sleep 1
kill $!
wait
and we can see that the parent "git" process waits 10 seconds, even
though it was killed. There are two saving graces here:
1. This isn't really a deadlock, but just a slow process. We are
waiting on the alias to exit anyway, and we do not expect that the
alias command is waiting on us. This is very different from the
cases in dd3693eb08, where we are spawning something with a pipe
which could be waiting to read from us. And it looks like all of
the existing uses of wait_after_clean are like this; the parent is
conceptually doing an "exec" of the child, but for various reasons
we want it to hang around.
The exception is in http-backend, where I think we really are
interacting with the child (so if we got a signal and the child
didn't, I think we could possibly deadlock as it waits for us to
feed more data).
2. mksh seems eager to automatically exec its child when fed only a
single, simple command (like "mksh -c 'sleep 10'). And that
alleviates the problem, since the signal goes straight to the child
command, then. So for things like git-foo externals, the shell may
have gotten out of the way anyway.
So I think the existing, pre-dd3693eb08 cases are _probably_ fine, even
with mksh, though the signal propagation may not be doing quite what we
want it to do. In practice I think most signals end up delivered to the
whole process group anyway (e.g., hitting ^C in a terminal) and that
part doesn't matter so much.
-Peff
^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH] Revert "transport-helper, connect: use clean_on_exit to reap children on abnormal exit"
2026-04-22 22:35 ` Jeff King
@ 2026-04-22 23:00 ` Jeff King
0 siblings, 0 replies; 3+ messages in thread
From: Jeff King @ 2026-04-22 23:00 UTC (permalink / raw)
To: Jan Palus; +Cc: git, Andrew Au, Junio C Hamano
On Wed, Apr 22, 2026 at 06:35:43PM -0400, Jeff King wrote:
> So I think just reverting dd3693eb08 is probably the most immediate fix.
Here it is in patch form.
-- >8 --
Subject: Revert "transport-helper, connect: use clean_on_exit to reap children on abnormal exit"
This reverts commit dd3693eb0859274d62feac8047e1d486b3beaf31.
The goal of that commit was to avoid zombie child processes hanging
around when the parent git process is killed. But it doesn't quite work
when the child command is run by the shell:
1. If there is a shell, then we kill and wait for the shell, not the
process spawned by the shell. And so the child process, even if it
eventually exits, will hang around as a zombie forever. And this is
true of most (all?) shells: bash, dash, etc.
So we are not really accomplishing our goal in the first place.
2. Not all shells will exit immediately upon receiving a signal. In
particular, mksh will wait for its children to exit (but not
actually propagate the signal to them!) leaving us with a potential
deadlock: git is wait()ing on mksh, which is wait()ing on a child
process, but that child process is waiting on git to produce more
input (or EOF) over a pipe.
You can see several examples of this deadlock in the test suite,
for example by running:
make SHELL_PATH=/bin/mksh
cd t
./t5702-protocol-v2.sh
Because this is a regression for mksh users, and because we did not
achieve our goal even with other shells, let's revert the commit for
now. If there is a more clever way of doing the same thing, we can
consider applying it separately on top (or do nothing and just accept
the zombies and rely on PID 1 to reap them).
Reported-by: Jan Palus <jpalus@fastmail.com>
Signed-off-by: Jeff King <peff@peff.net>
---
connect.c | 4 ----
transport-helper.c | 2 --
2 files changed, 6 deletions(-)
diff --git a/connect.c b/connect.c
index fcd35c5539..a02583a102 100644
--- a/connect.c
+++ b/connect.c
@@ -1054,8 +1054,6 @@ static struct child_process *git_proxy_connect(int fd[2], char *host)
strvec_push(&proxy->args, port);
proxy->in = -1;
proxy->out = -1;
- proxy->clean_on_exit = 1;
- proxy->wait_after_clean = 1;
if (start_command(proxy))
die(_("cannot start proxy %s"), git_proxy_command);
fd[0] = proxy->out; /* read from proxy stdout */
@@ -1517,8 +1515,6 @@ struct child_process *git_connect(int fd[2], const char *url,
}
strvec_push(&conn->args, cmd.buf);
- conn->clean_on_exit = 1;
- conn->wait_after_clean = 1;
if (start_command(conn))
die(_("unable to fork"));
diff --git a/transport-helper.c b/transport-helper.c
index 4e5d1d914f..4614036c99 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -154,8 +154,6 @@ static struct child_process *get_helper(struct transport *transport)
helper->trace2_child_class = helper->args.v[0]; /* "remote-<name>" */
- helper->clean_on_exit = 1;
- helper->wait_after_clean = 1;
code = start_command(helper);
if (code < 0 && errno == ENOENT)
die(_("unable to find remote helper for '%s'"), data->name);
--
2.54.0.232.gea82d9008b
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-04-22 23:00 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 19:00 regression 2.54.0: test suite hangs indefinitely with mksh Jan Palus
2026-04-22 22:35 ` Jeff King
2026-04-22 23:00 ` [PATCH] Revert "transport-helper, connect: use clean_on_exit to reap children on abnormal exit" Jeff King
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox