* regression 2.54.0: test suite hangs indefinitely with mksh
@ 2026-04-22 19:00 Jan Palus
2026-04-22 22:35 ` Jeff King
0 siblings, 1 reply; 3+ messages in thread
From: Jan Palus @ 2026-04-22 19:00 UTC (permalink / raw)
To: git; +Cc: Andrew Au, Junio C Hamano
If /bin/sh points to mksh, git's test suite hangs forever and never
completes. An example of process tree for such hung test:
$ pstree -a `pgrep -f ./t5702-protocol-v2.sh`
t5702-protocol- ./t5702-protocol-v2.sh
└─git -c protocol.version=0 ls-remote -o hello -o worldfile:///home/users/builder/rpm/BUILD/gi
└─sh -c git-upload-pack '/home/users/builder/rpm/BUILD/git/t/trash directory.t5702-protocol-v2/file_parent'...
└─git-upload-pack /home/users/builder/rpm/BUILD/git/t/trash directory.t5702-protocol-v2/file_parent
bisect points to the following commit:
commit dd3693eb0859274d62feac8047e1d486b3beaf31 (HEAD)
Author: Andrew Au <cshung@gmail.com>
Date: Thu Mar 12 22:49:37 2026
transport-helper, connect: use clean_on_exit to reap children on abnormal exit
When a long-running service (e.g., a source indexer) runs as PID 1
inside a container and repeatedly spawns git, git may in turn spawn
child processes such as git-remote-https or ssh. If git exits abnormally
(e.g., via exit(128) on a transport error), the normal cleanup paths
(disconnect_helper, finish_connect) are bypassed, and these children are
never waited on. The children are reparented to PID 1, which does not
reap them, so they accumulate as zombies over time.
Set clean_on_exit and wait_after_clean on child_process structs in both
transport-helper.c and connect.c so that the existing run-command
cleanup infrastructure handles reaping on any exit path. This avoids
rolling custom atexit handlers that call finish_command(), which could
deadlock if the child is blocked waiting for the parent to close a pipe.
The clean_on_exit mechanism sends SIGTERM first, then waits, ensuring
the child terminates promptly. It also handles signal-based exits, not
just atexit.
Signed-off-by: Andrew Au <cshung@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
From commit message alone it seems the change tries to avoid reparenting
child processes to PID 1, however this does not seem to work for
processes which were spawned with "sh -c".
For example if /bin/sh points to bash tests pass because bash appears to
react to SIGTERM differently -- shell process ends but its children are
reparented anyway.
mksh on the other hand seems to ignore SIGTERM (or queues it?) but
waits for child process and so test suite never ends.
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: regression 2.54.0: test suite hangs indefinitely with mksh 2026-04-22 19:00 regression 2.54.0: test suite hangs indefinitely with mksh Jan Palus @ 2026-04-22 22:35 ` Jeff King 2026-04-22 23:00 ` [PATCH] Revert "transport-helper, connect: use clean_on_exit to reap children on abnormal exit" Jeff King 0 siblings, 1 reply; 3+ messages in thread From: Jeff King @ 2026-04-22 22:35 UTC (permalink / raw) To: Jan Palus; +Cc: git, Andrew Au, Junio C Hamano On Wed, Apr 22, 2026 at 09:00:30PM +0200, Jan Palus wrote: > If /bin/sh points to mksh, git's test suite hangs forever and never > completes. An example of process tree for such hung test: > [...] Thanks, I can reproduce the problem easily here with mksh. > bisect points to the following commit: > > commit dd3693eb0859274d62feac8047e1d486b3beaf31 (HEAD) > Author: Andrew Au <cshung@gmail.com> > Date: Thu Mar 12 22:49:37 2026 > > transport-helper, connect: use clean_on_exit to reap children on abnormal exit Yup, same here. > From commit message alone it seems the change tries to avoid reparenting > child processes to PID 1, however this does not seem to work for > processes which were spawned with "sh -c". > > For example if /bin/sh points to bash tests pass because bash appears to > react to SIGTERM differently -- shell process ends but its children are > reparented anyway. > > mksh on the other hand seems to ignore SIGTERM (or queues it?) but > waits for child process and so test suite never ends. Yeah, I agree with your analysis. It looks mksh's signal handler waits for its children. If I run: # note start of script date # mksh with a slow child mksh -c 'echo before; sleep 3; echo after' & # wait for it to have started child, then kill sleep 1 kill $! # and now wait and mark the time wait $! date then we see "before" but not "after", and the script takes 3 seconds to run (since we wait for mksh to exit). Replacing "mksh" with bash or dash, it finishes in 1 second. So what to do? I think we could alleviate the symptom by teaching clean_on_exit to follow-up SIGTERM with SIGKILL. But as you note, that patch is not really solving the original problem in the first place. If we use dash or bash in the above script and add "ps u | grep sleep" to the end, we can see that "sleep 3" is still running even after we've returned. So we have not really accomplished our goal anyway. So I think just reverting dd3693eb08 is probably the most immediate fix. We can try again at the same goal later, though I'm not quite sure how to do so. From Git's perspective, we do not even know the pid of the child that the shell spawned, so we cannot really kill it or wait on it. Conceptually we'd want the shell to propagate the signal to its children, but doing that automatically in the general case is rather tricky. I don't think we want to get into prepending trap commands at the start of each shell we invoke. For the specific cases in dd3693eb08, we probably want to signal to the child to exit by closing the pipe its reading from. But we can't do that via the run-command system, so it would have to be a specific signal handler within the transport system. I think that would avoid true deadlocks, but it still isn't quite what we want. The process won't die until it tries to read from the parent, so if it's waiting on a long network timeout, for example, it could mean that the parent sits around for quite a while even after having gotten a signal. The fallback position to what dd3693eb08 was trying to do is roughly: let them get reparented to pid 1 and don't worry about (and people in containers should really use "tini" or some other pid-1 replacement). That doesn't seem like the worst place to be, and we can get there easily with a revert. This all does point to some general weaknesses in run-command's clean_on_exit. If a shell is involved, we're probably not killing who we meant to. That's usually not the end of the world (if we take "clean" to be best-effort), but with wait_after_clean tacked on it's a recipe for a potential hang if the shell doesn't exit. So I tried this: diff --git a/run-command.c b/run-command.c index c146a56532..97c223bdf1 100644 --- a/run-command.c +++ b/run-command.c @@ -680,6 +680,9 @@ int start_command(struct child_process *cmd) int failed_errno; const char *str; + if (cmd->use_shell && cmd->wait_after_clean) + BUG("combining use_shell and wait_after_clean is not reliable"); + /* * In case of errors we must keep the promise to close FDs * that have been passed in via ->in and ->out. and ran the test suite, which results in several failures. In particular we'll run a "!" alias as a shell command with those flags, but also for things like exec-ing external commands (like "git-foo"). So we should be able to produce a "hang" like this (built with SHELL_PATH=/bin/mksh): ./git -c alias.foo='!echo before; sleep 10; echo after' foo & sleep 1 kill $! wait and we can see that the parent "git" process waits 10 seconds, even though it was killed. There are two saving graces here: 1. This isn't really a deadlock, but just a slow process. We are waiting on the alias to exit anyway, and we do not expect that the alias command is waiting on us. This is very different from the cases in dd3693eb08, where we are spawning something with a pipe which could be waiting to read from us. And it looks like all of the existing uses of wait_after_clean are like this; the parent is conceptually doing an "exec" of the child, but for various reasons we want it to hang around. The exception is in http-backend, where I think we really are interacting with the child (so if we got a signal and the child didn't, I think we could possibly deadlock as it waits for us to feed more data). 2. mksh seems eager to automatically exec its child when fed only a single, simple command (like "mksh -c 'sleep 10'). And that alleviates the problem, since the signal goes straight to the child command, then. So for things like git-foo externals, the shell may have gotten out of the way anyway. So I think the existing, pre-dd3693eb08 cases are _probably_ fine, even with mksh, though the signal propagation may not be doing quite what we want it to do. In practice I think most signals end up delivered to the whole process group anyway (e.g., hitting ^C in a terminal) and that part doesn't matter so much. -Peff ^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH] Revert "transport-helper, connect: use clean_on_exit to reap children on abnormal exit" 2026-04-22 22:35 ` Jeff King @ 2026-04-22 23:00 ` Jeff King 0 siblings, 0 replies; 3+ messages in thread From: Jeff King @ 2026-04-22 23:00 UTC (permalink / raw) To: Jan Palus; +Cc: git, Andrew Au, Junio C Hamano On Wed, Apr 22, 2026 at 06:35:43PM -0400, Jeff King wrote: > So I think just reverting dd3693eb08 is probably the most immediate fix. Here it is in patch form. -- >8 -- Subject: Revert "transport-helper, connect: use clean_on_exit to reap children on abnormal exit" This reverts commit dd3693eb0859274d62feac8047e1d486b3beaf31. The goal of that commit was to avoid zombie child processes hanging around when the parent git process is killed. But it doesn't quite work when the child command is run by the shell: 1. If there is a shell, then we kill and wait for the shell, not the process spawned by the shell. And so the child process, even if it eventually exits, will hang around as a zombie forever. And this is true of most (all?) shells: bash, dash, etc. So we are not really accomplishing our goal in the first place. 2. Not all shells will exit immediately upon receiving a signal. In particular, mksh will wait for its children to exit (but not actually propagate the signal to them!) leaving us with a potential deadlock: git is wait()ing on mksh, which is wait()ing on a child process, but that child process is waiting on git to produce more input (or EOF) over a pipe. You can see several examples of this deadlock in the test suite, for example by running: make SHELL_PATH=/bin/mksh cd t ./t5702-protocol-v2.sh Because this is a regression for mksh users, and because we did not achieve our goal even with other shells, let's revert the commit for now. If there is a more clever way of doing the same thing, we can consider applying it separately on top (or do nothing and just accept the zombies and rely on PID 1 to reap them). Reported-by: Jan Palus <jpalus@fastmail.com> Signed-off-by: Jeff King <peff@peff.net> --- connect.c | 4 ---- transport-helper.c | 2 -- 2 files changed, 6 deletions(-) diff --git a/connect.c b/connect.c index fcd35c5539..a02583a102 100644 --- a/connect.c +++ b/connect.c @@ -1054,8 +1054,6 @@ static struct child_process *git_proxy_connect(int fd[2], char *host) strvec_push(&proxy->args, port); proxy->in = -1; proxy->out = -1; - proxy->clean_on_exit = 1; - proxy->wait_after_clean = 1; if (start_command(proxy)) die(_("cannot start proxy %s"), git_proxy_command); fd[0] = proxy->out; /* read from proxy stdout */ @@ -1517,8 +1515,6 @@ struct child_process *git_connect(int fd[2], const char *url, } strvec_push(&conn->args, cmd.buf); - conn->clean_on_exit = 1; - conn->wait_after_clean = 1; if (start_command(conn)) die(_("unable to fork")); diff --git a/transport-helper.c b/transport-helper.c index 4e5d1d914f..4614036c99 100644 --- a/transport-helper.c +++ b/transport-helper.c @@ -154,8 +154,6 @@ static struct child_process *get_helper(struct transport *transport) helper->trace2_child_class = helper->args.v[0]; /* "remote-<name>" */ - helper->clean_on_exit = 1; - helper->wait_after_clean = 1; code = start_command(helper); if (code < 0 && errno == ENOENT) die(_("unable to find remote helper for '%s'"), data->name); -- 2.54.0.232.gea82d9008b ^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-04-22 23:00 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-22 19:00 regression 2.54.0: test suite hangs indefinitely with mksh Jan Palus 2026-04-22 22:35 ` Jeff King 2026-04-22 23:00 ` [PATCH] Revert "transport-helper, connect: use clean_on_exit to reap children on abnormal exit" Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox