git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [bug] git clone command leaves orphaned ssh process
@ 2023-09-10  6:38 Max Amelchenko
  2023-09-10  8:50 ` Bagas Sanjaya
  0 siblings, 1 reply; 9+ messages in thread
From: Max Amelchenko @ 2023-09-10  6:38 UTC (permalink / raw)
  To: git

What did you do before the bug happened? (Steps to reproduce your issue)

Run the command:
ps aux
Observe no ssh processes running on system.

Run git clone against a non-existent hostname:
git clone -v --depth=1 -b 3.23.66
ssh://*****@*****lab-prod.server.sim.cloud/terraform/modules/aws-eks
/tmp/dest
Observe the command fails with:

Could not resolve hostname *****lab-prod.server.sim.cloud: Name or
service not known

Run:
ps aux

Observe a defunct ssh process is left behind.


What did you expect to happen? (Expected behavior)
I expected the command to quit without leaving any processes behind.

What happened instead? (Actual behavior)
The command quit and left a defunct ssh process on the system.

What's different between what you expected and what actually happened?
I don't want zombie processes left after any git command (either failed or not).

Anything else you want to add:
These processes are zombie orphaned, meaning we're stuck with them
until system reboot (which is bad).

Please review the rest of the bug report below.

You can delete any lines you don't wish to share.



[System Info]

git version:

git version 2.40.1

cpu: aarch64

no commit associated with this build

sizeof-long: 8

sizeof-size_t: 8

shell-path: /bin/sh

compiler info: gnuc: 7.3

libc info: glibc: 2.26

$SHELL (typically, interactive shell): <unset>



[Enabled Hooks]

not run from a git repository - no hooks to show

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bug] git clone command leaves orphaned ssh process
  2023-09-10  6:38 [bug] git clone command leaves orphaned ssh process Max Amelchenko
@ 2023-09-10  8:50 ` Bagas Sanjaya
  2023-09-10  9:47   ` Max Amelchenko
  0 siblings, 1 reply; 9+ messages in thread
From: Bagas Sanjaya @ 2023-09-10  8:50 UTC (permalink / raw)
  To: Max Amelchenko, git; +Cc: Hideaki Yoshifuji, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1913 bytes --]

On Sun, Sep 10, 2023 at 09:38:54AM +0300, Max Amelchenko wrote:
> What did you do before the bug happened? (Steps to reproduce your issue)
> 
> Run the command:
> ps aux
> Observe no ssh processes running on system.
> 
> Run git clone against a non-existent hostname:
> git clone -v --depth=1 -b 3.23.66
> ssh://*****@*****lab-prod.server.sim.cloud/terraform/modules/aws-eks
> /tmp/dest
> Observe the command fails with:
> 
> Could not resolve hostname *****lab-prod.server.sim.cloud: Name or
> service not known
> 
> Run:
> ps aux
> 
> Observe a defunct ssh process is left behind.

On git current master on my system, I got sshd (server) processes instead:

```
root         835  0.0  0.0  15500  3584 ?        Ss   14:38   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
165536      3865  0.0  0.0   8488  1408 ?        Ss   14:39   0:00 sshd: /usr/sbin/sshd -D -e [listener] 0 of 10-100 startups
165536      4039  0.0  0.0  11308  1920 ?        Ss   14:40   0:00 sshd: /usr/bin/sshd -D [listener] 0 of 10-100 startups
165536      4374  0.0  0.0  15404  1920 ?        Ss   14:40   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
165536      4399  0.0  0.0  15404  1792 ?        Ss   14:40   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
165536      4732  0.0  0.0  15404  2048 ?        Ss   14:41   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
165536      4943  0.0  0.0  18004   848 ?        Ss   14:41   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
bagas       6841  0.0  0.0   7668  1092 ?        Ss   14:43   0:00 /usr/bin/ssh-agent /usr/bin/im-launch /usr/bin/gnome-session
bagas       6908  0.0  0.1 162780  5488 ?        Ssl  14:43   0:00 /usr/libexec/gcr-ssh-agent /run/user/1000/gcr

```

What is your ps output then?

Thanks.

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bug] git clone command leaves orphaned ssh process
  2023-09-10  8:50 ` Bagas Sanjaya
@ 2023-09-10  9:47   ` Max Amelchenko
  2023-09-10 18:47     ` Taylor Blau
  0 siblings, 1 reply; 9+ messages in thread
From: Max Amelchenko @ 2023-09-10  9:47 UTC (permalink / raw)
  To: Bagas Sanjaya; +Cc: git, Hideaki Yoshifuji, Junio C Hamano

Output of first ps aux command:

bash-4.2# ps aux

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

root         1  0.0  0.0 715708  5144 pts/0    Ssl+ 09:43   0:00
/usr/local/bin/aws-lambda-rie /var/runtime/bootstrap

root        14  0.1  0.0 114096  3088 pts/1    Ss   09:43   0:00 bash

root       165  0.0  0.0 118296  3392 pts/1    R+   09:45   0:00 ps aux


Output of second ps aux command (after running git clone):

bash-4.2# ps aux

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

root         1  0.0  0.0 715708  5144 pts/0    Ssl+ 09:43   0:00
/usr/local/bin/aws-lambda-rie /var/runtime/bootstrap

root        14  0.0  0.0 114096  3088 pts/1    Ss   09:43   0:00 bash

root       167  0.5  0.0      0     0 pts/1    Z    09:46   0:00 [ssh] <defunct>

root       168  0.0  0.0 118296  3408 pts/1    R+   09:46   0:00 ps aux

See the added ssh defunct process.

On Sun, Sep 10, 2023 at 11:50 AM Bagas Sanjaya <bagasdotme@gmail.com> wrote:
>
> On Sun, Sep 10, 2023 at 09:38:54AM +0300, Max Amelchenko wrote:
> > What did you do before the bug happened? (Steps to reproduce your issue)
> >
> > Run the command:
> > ps aux
> > Observe no ssh processes running on system.
> >
> > Run git clone against a non-existent hostname:
> > git clone -v --depth=1 -b 3.23.66
> > ssh://*****@*****lab-prod.server.sim.cloud/terraform/modules/aws-eks
> > /tmp/dest
> > Observe the command fails with:
> >
> > Could not resolve hostname *****lab-prod.server.sim.cloud: Name or
> > service not known
> >
> > Run:
> > ps aux
> >
> > Observe a defunct ssh process is left behind.
>
> On git current master on my system, I got sshd (server) processes instead:
>
> ```
> root         835  0.0  0.0  15500  3584 ?        Ss   14:38   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
> 165536      3865  0.0  0.0   8488  1408 ?        Ss   14:39   0:00 sshd: /usr/sbin/sshd -D -e [listener] 0 of 10-100 startups
> 165536      4039  0.0  0.0  11308  1920 ?        Ss   14:40   0:00 sshd: /usr/bin/sshd -D [listener] 0 of 10-100 startups
> 165536      4374  0.0  0.0  15404  1920 ?        Ss   14:40   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
> 165536      4399  0.0  0.0  15404  1792 ?        Ss   14:40   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
> 165536      4732  0.0  0.0  15404  2048 ?        Ss   14:41   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
> 165536      4943  0.0  0.0  18004   848 ?        Ss   14:41   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
> bagas       6841  0.0  0.0   7668  1092 ?        Ss   14:43   0:00 /usr/bin/ssh-agent /usr/bin/im-launch /usr/bin/gnome-session
> bagas       6908  0.0  0.1 162780  5488 ?        Ssl  14:43   0:00 /usr/libexec/gcr-ssh-agent /run/user/1000/gcr
>
> ```
>
> What is your ps output then?
>
> Thanks.
>
> --
> An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bug] git clone command leaves orphaned ssh process
  2023-09-10  9:47   ` Max Amelchenko
@ 2023-09-10 18:47     ` Taylor Blau
  2023-09-11 10:11       ` Max Amelchenko
  0 siblings, 1 reply; 9+ messages in thread
From: Taylor Blau @ 2023-09-10 18:47 UTC (permalink / raw)
  To: Max Amelchenko; +Cc: Bagas Sanjaya, git, Hideaki Yoshifuji, Junio C Hamano

On Sun, Sep 10, 2023 at 12:47:14PM +0300, Max Amelchenko wrote:
> Output of second ps aux command (after running git clone):
>
> bash-4.2# ps aux
>
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
>
> root         1  0.0  0.0 715708  5144 pts/0    Ssl+ 09:43   0:00
> /usr/local/bin/aws-lambda-rie /var/runtime/bootstrap
>
> root        14  0.0  0.0 114096  3088 pts/1    Ss   09:43   0:00 bash
>
> root       167  0.5  0.0      0     0 pts/1    Z    09:46   0:00 [ssh] <defunct>
>
> root       168  0.0  0.0 118296  3408 pts/1    R+   09:46   0:00 ps aux
>
> See the added ssh defunct process.

Hmm... I wasn't quite able to reproduce this locally. Below
`git.compile` points to a Git executable built from the v2.40.1 tag
corresponding to your bug report:

    $ host='ssh://*****@*****lab-prod.server.sim.cloud/terraform/modules/aws-eks'
    $ git.compile clone "$host" /tmp/x
    Cloning into '/tmp/x'...
    ssh: Could not resolve hostname *****lab-prod.server.sim.cloud: Name or service not known
    fatal: Could not read from remote repository.

    Please make sure you have the correct access rights
    and the repository exists.

and then:

    $ ps aux | grep defunct
    ttaylorr 3688844  0.0  0.0   6340  2180 pts/1    S+   14:45   0:00 grep --color defunct

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bug] git clone command leaves orphaned ssh process
  2023-09-10 18:47     ` Taylor Blau
@ 2023-09-11 10:11       ` Max Amelchenko
  2023-09-12  0:40         ` Aaron Schrab
  0 siblings, 1 reply; 9+ messages in thread
From: Max Amelchenko @ 2023-09-11 10:11 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Bagas Sanjaya, git, Hideaki Yoshifuji, Junio C Hamano

Maybe it's connected also to the underlying infrastructure? We are
getting this in AWS lambda jobs and we're hitting a system limit of
max processes because of it.
Can you try running this inside this image public.ecr.aws/lambda/python ?

On Sun, Sep 10, 2023 at 9:47 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Sun, Sep 10, 2023 at 12:47:14PM +0300, Max Amelchenko wrote:
> > Output of second ps aux command (after running git clone):
> >
> > bash-4.2# ps aux
> >
> > USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> >
> > root         1  0.0  0.0 715708  5144 pts/0    Ssl+ 09:43   0:00
> > /usr/local/bin/aws-lambda-rie /var/runtime/bootstrap
> >
> > root        14  0.0  0.0 114096  3088 pts/1    Ss   09:43   0:00 bash
> >
> > root       167  0.5  0.0      0     0 pts/1    Z    09:46   0:00 [ssh] <defunct>
> >
> > root       168  0.0  0.0 118296  3408 pts/1    R+   09:46   0:00 ps aux
> >
> > See the added ssh defunct process.
>
> Hmm... I wasn't quite able to reproduce this locally. Below
> `git.compile` points to a Git executable built from the v2.40.1 tag
> corresponding to your bug report:
>
>     $ host='ssh://*****@*****lab-prod.server.sim.cloud/terraform/modules/aws-eks'
>     $ git.compile clone "$host" /tmp/x
>     Cloning into '/tmp/x'...
>     ssh: Could not resolve hostname *****lab-prod.server.sim.cloud: Name or service not known
>     fatal: Could not read from remote repository.
>
>     Please make sure you have the correct access rights
>     and the repository exists.
>
> and then:
>
>     $ ps aux | grep defunct
>     ttaylorr 3688844  0.0  0.0   6340  2180 pts/1    S+   14:45   0:00 grep --color defunct
>
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bug] git clone command leaves orphaned ssh process
  2023-09-11 10:11       ` Max Amelchenko
@ 2023-09-12  0:40         ` Aaron Schrab
  2023-09-12  4:33           ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Aaron Schrab @ 2023-09-12  0:40 UTC (permalink / raw)
  To: Max Amelchenko
  Cc: Taylor Blau, Bagas Sanjaya, git, Hideaki Yoshifuji,
	Junio C Hamano

At 13:11 +0300 11 Sep 2023, Max Amelchenko <maxamel2002@gmail.com> wrote:
>Maybe it's connected also to the underlying infrastructure? We are
>getting this in AWS lambda jobs and we're hitting a system limit of
>max processes because of it.

Running as a lambda, or in a container, could definitely be why you're 
seeing a difference. Normally when a process is orphaned it gets adopted 
by `init` (PID 1), and that will take care of cleaning up after orphaned 
zombie processes.

But most of the time containers just run the configured process 
directly, without an init process. That leaves nothing to clean orphan 
processes.

Although for that to really be a problem, would require hitting that max 
process limit inside a single container invocation. Of course since 
containers usually aren't meant to be spawning a lot of processes, that 
limit might be a lot lower than on a normal system.

I know that Docker provides a way to include an init process in the 
started container (`docker run --init`), but I don't think that AWS 
Lambda does.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bug] git clone command leaves orphaned ssh process
  2023-09-12  0:40         ` Aaron Schrab
@ 2023-09-12  4:33           ` Jeff King
  2023-09-24 10:25             ` Max Amelchenko
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff King @ 2023-09-12  4:33 UTC (permalink / raw)
  To: Aaron Schrab
  Cc: Max Amelchenko, Taylor Blau, Bagas Sanjaya, git,
	Hideaki Yoshifuji, Junio C Hamano

On Mon, Sep 11, 2023 at 08:40:49PM -0400, Aaron Schrab wrote:

> At 13:11 +0300 11 Sep 2023, Max Amelchenko <maxamel2002@gmail.com> wrote:
> > Maybe it's connected also to the underlying infrastructure? We are
> > getting this in AWS lambda jobs and we're hitting a system limit of
> > max processes because of it.
> 
> Running as a lambda, or in a container, could definitely be why you're
> seeing a difference. Normally when a process is orphaned it gets adopted by
> `init` (PID 1), and that will take care of cleaning up after orphaned zombie
> processes.
> 
> But most of the time containers just run the configured process directly,
> without an init process. That leaves nothing to clean orphan processes.

Yeah, that seems like the culprit. If the clone finishes successfully,
we do end up in finish_connect(), where we wait() for the process. But
if we exit early (in this case, ssh bails and we get EOF on the pipe
reading from it), then we may call die() and exit immediately.

We _could_ take special care to add every spawned process to a global
list, set up handlers via atexit() and signal(), and then reap the
processes. But traditionally it's not a big deal to exit with un-reaped
children, and this is the responsibility of init. I'm not sure it makes
sense for Git to basically reimplement that catch-all (and of course we
cannot even do it reliably if we are killed by certain signals).

> Although for that to really be a problem, would require hitting that max
> process limit inside a single container invocation. Of course since
> containers usually aren't meant to be spawning a lot of processes, that
> limit might be a lot lower than on a normal system.
> 
> I know that Docker provides a way to include an init process in the started
> container (`docker run --init`), but I don't think that AWS Lambda does.

I don't know anything about Lambda, but if you are running arbitrary
commands, then it seems like you could insert something like this:

  https://github.com/krallin/tini

into the mix. I much prefer that to teaching Git to try to do the same
thing in-process.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bug] git clone command leaves orphaned ssh process
  2023-09-12  4:33           ` Jeff King
@ 2023-09-24 10:25             ` Max Amelchenko
  2023-09-25 12:29               ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Max Amelchenko @ 2023-09-24 10:25 UTC (permalink / raw)
  To: Jeff King
  Cc: Aaron Schrab, Taylor Blau, Bagas Sanjaya, git, Hideaki Yoshifuji,
	Junio C Hamano

Thanks,
Just wanted to clarify something. This will not be handled by AWS (we
had a support ticket re. that case), since they do not interfere with
the running processes on its infrastructure, and if there is a
problematic process causing this overflowing in orphaned processes, it
needs to be handled by that process.
The question is, doesn't Git want to ensure a clean exit in all cases?
This is a clear example of a non-clean exit.

On Tue, Sep 12, 2023 at 7:33 AM Jeff King <peff@peff.net> wrote:
>
> On Mon, Sep 11, 2023 at 08:40:49PM -0400, Aaron Schrab wrote:
>
> > At 13:11 +0300 11 Sep 2023, Max Amelchenko <maxamel2002@gmail.com> wrote:
> > > Maybe it's connected also to the underlying infrastructure? We are
> > > getting this in AWS lambda jobs and we're hitting a system limit of
> > > max processes because of it.
> >
> > Running as a lambda, or in a container, could definitely be why you're
> > seeing a difference. Normally when a process is orphaned it gets adopted by
> > `init` (PID 1), and that will take care of cleaning up after orphaned zombie
> > processes.
> >
> > But most of the time containers just run the configured process directly,
> > without an init process. That leaves nothing to clean orphan processes.
>
> Yeah, that seems like the culprit. If the clone finishes successfully,
> we do end up in finish_connect(), where we wait() for the process. But
> if we exit early (in this case, ssh bails and we get EOF on the pipe
> reading from it), then we may call die() and exit immediately.
>
> We _could_ take special care to add every spawned process to a global
> list, set up handlers via atexit() and signal(), and then reap the
> processes. But traditionally it's not a big deal to exit with un-reaped
> children, and this is the responsibility of init. I'm not sure it makes
> sense for Git to basically reimplement that catch-all (and of course we
> cannot even do it reliably if we are killed by certain signals).
>
> > Although for that to really be a problem, would require hitting that max
> > process limit inside a single container invocation. Of course since
> > containers usually aren't meant to be spawning a lot of processes, that
> > limit might be a lot lower than on a normal system.
> >
> > I know that Docker provides a way to include an init process in the started
> > container (`docker run --init`), but I don't think that AWS Lambda does.
>
> I don't know anything about Lambda, but if you are running arbitrary
> commands, then it seems like you could insert something like this:
>
>   https://github.com/krallin/tini
>
> into the mix. I much prefer that to teaching Git to try to do the same
> thing in-process.
>
> -Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [bug] git clone command leaves orphaned ssh process
  2023-09-24 10:25             ` Max Amelchenko
@ 2023-09-25 12:29               ` Jeff King
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff King @ 2023-09-25 12:29 UTC (permalink / raw)
  To: Max Amelchenko
  Cc: Aaron Schrab, Taylor Blau, Bagas Sanjaya, git, Hideaki Yoshifuji,
	Junio C Hamano

On Sun, Sep 24, 2023 at 01:25:08PM +0300, Max Amelchenko wrote:

> Thanks,
> Just wanted to clarify something. This will not be handled by AWS (we
> had a support ticket re. that case), since they do not interfere with
> the running processes on its infrastructure, and if there is a
> problematic process causing this overflowing in orphaned processes, it
> needs to be handled by that process.
> The question is, doesn't Git want to ensure a clean exit in all cases?
> This is a clear example of a non-clean exit.

Git does ensure a clean exit if we run the clone process to completion.
In your case we hit a fatal error midway through and are aborting. At
that point we do not care what the exit code of ssh is.

We _could_ set up a signal/atexit handler combo to call waitpid(), but
we would just be throwing away the result code. And that is a catch-all
I would rather see done by PID 1 than by git. It can serve all
processes, not just git. And it can do so more robustly, since git may
be killed without a chance to run cleanup code (e.g., signal 9).

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-09-25 12:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-10  6:38 [bug] git clone command leaves orphaned ssh process Max Amelchenko
2023-09-10  8:50 ` Bagas Sanjaya
2023-09-10  9:47   ` Max Amelchenko
2023-09-10 18:47     ` Taylor Blau
2023-09-11 10:11       ` Max Amelchenko
2023-09-12  0:40         ` Aaron Schrab
2023-09-12  4:33           ` Jeff King
2023-09-24 10:25             ` Max Amelchenko
2023-09-25 12:29               ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).