public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Demi Marie Obenour <demiobenour@gmail.com>
To: Oleg Nesterov <oleg@redhat.com>,
	Christian Brauner <brauner@kernel.org>,
	Mateusz Guzik <mjguzik@gmail.com>
Cc: Linux kernel mailing list <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: PID namespace init releases its file locks before its children die
Date: Fri, 3 Oct 2025 13:09:27 -0400	[thread overview]
Message-ID: <d2fa498d-1acf-4c92-ae8c-2d91be1449df@gmail.com> (raw)
In-Reply-To: <20251003123828.GA26441@redhat.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 2596 bytes --]

On 10/3/25 08:38, Oleg Nesterov wrote:
> Add CCs.
> 
> I can't really help, just my 2 cents...
> 
> I don't think we can change do_exit() to call exit_files() after
> exit_notify().

Not surprised.

> At first glance, technically it is possible to change do_exit() so
> that the exiting reaper does zap_pid_ns_processes() earlier... But
> even if this is possible, I think that this complication needs more
> justification.

I have a service that must not be run more than once concurrently.
I'm using s6 [1] as the service manager.  s6 doesn't support cgroups,
but it does support running the child in a PID namespace.
I was hoping that if the init process in the PID namespace took an
exclusive file lock, it would ensure that all the children in the PID
namespace stopped running before the lock is released.  Unfortunately,
with the current implementation that is not the case.

Right now, I'm leaking the file descriptor into the child processes and
relying on them to not close it.  This is somewhat fragile, though.
For instance, anything using GSubprocess breaks this assumption.
GSubprocess closes all file descriptors not explicitly passed into
the child.

It is definitely possible to implement this with cgroups: wait for
the cgroup to become empty before spawning the child.  It is also
possible for the supervisor to ensure that the child is dead before
spawning a new one, though s6's architecture makes this non-trivial.
The parent of the child is not PID 1, so it would need to inform
PID 1 to kill the child (and wait for it) if the actual supervisor
dies.

> Oleg.
> 
> On 10/02, Demi Marie Obenour wrote:
>>
>> I noticed that PID 1 in a PID namespace can release file locks (due
>> to exiting) while its children are still running for a bit.  If the
>> locks held by PID 1 were relied to serialize the execution of its
>> child processes, this could result in data corruption.
>>
>> Specifically, the child processes are killed via exit_notify() ->
>> forget_original_parent() -> find_child_reaper() ->
>> zap_pid_ns_processes().  That comes *after* exit_files(), which
>> releases the file locks.
>>
>> While it is possible to implement this with cgroups, cgroups
>> are quite a bit more complicated to use, at least compared to
>> a single call to unshare() before fork().
>>
>> Is this intentional?  Changing the behavior would make supervision
>> trees significantly easier to properly implement.
>> --
>> Sincerely,
>> Demi Marie Obenour (she/her/hers)
> 


-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2025-10-03 17:09 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-02 18:22 PID namespace init releases its file locks before its children die Demi Marie Obenour
2025-10-03 12:38 ` Oleg Nesterov
2025-10-03 17:09   ` Demi Marie Obenour [this message]
2025-10-07 12:02   ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d2fa498d-1acf-4c92-ae8c-2d91be1449df@gmail.com \
    --to=demiobenour@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=oleg@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox