PID namespace init releases its file locks before its children die

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* PID namespace init releases its file locks before its children die
@ 2025-10-02 18:22 Demi Marie Obenour
  2025-10-03 12:38 ` Oleg Nesterov
  0 siblings, 1 reply; 4+ messages in thread
From: Demi Marie Obenour @ 2025-10-02 18:22 UTC (permalink / raw)
  To: Linux kernel mailing list, Andrew Morton

[-- Attachment #1.1.1: Type: text/plain, Size: 818 bytes --]

I noticed that PID 1 in a PID namespace can release file locks (due
to exiting) while its children are still running for a bit.  If the
locks held by PID 1 were relied to serialize the execution of its
child processes, this could result in data corruption.

Specifically, the child processes are killed via exit_notify() ->
forget_original_parent() -> find_child_reaper() ->
zap_pid_ns_processes().  That comes *after* exit_files(), which
releases the file locks.

While it is possible to implement this with cgroups, cgroups
are quite a bit more complicated to use, at least compared to
a single call to unshare() before fork().

Is this intentional?  Changing the behavior would make supervision
trees significantly easier to properly implement.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: PID namespace init releases its file locks before its children die
  2025-10-02 18:22 PID namespace init releases its file locks before its children die Demi Marie Obenour
@ 2025-10-03 12:38 ` Oleg Nesterov
  2025-10-03 17:09   ` Demi Marie Obenour
  2025-10-07 12:02   ` Christian Brauner
  0 siblings, 2 replies; 4+ messages in thread
From: Oleg Nesterov @ 2025-10-03 12:38 UTC (permalink / raw)
  To: Demi Marie Obenour, Christian Brauner, Mateusz Guzik
  Cc: Linux kernel mailing list, Andrew Morton

Add CCs.

I can't really help, just my 2 cents...

I don't think we can change do_exit() to call exit_files() after
exit_notify().

At first glance, technically it is possible to change do_exit() so
that the exiting reaper does zap_pid_ns_processes() earlier... But
even if this is possible, I think that this complication needs more
justification.

Oleg.

On 10/02, Demi Marie Obenour wrote:
>
> I noticed that PID 1 in a PID namespace can release file locks (due
> to exiting) while its children are still running for a bit.  If the
> locks held by PID 1 were relied to serialize the execution of its
> child processes, this could result in data corruption.
>
> Specifically, the child processes are killed via exit_notify() ->
> forget_original_parent() -> find_child_reaper() ->
> zap_pid_ns_processes().  That comes *after* exit_files(), which
> releases the file locks.
>
> While it is possible to implement this with cgroups, cgroups
> are quite a bit more complicated to use, at least compared to
> a single call to unshare() before fork().
>
> Is this intentional?  Changing the behavior would make supervision
> trees significantly easier to properly implement.
> --
> Sincerely,
> Demi Marie Obenour (she/her/hers)


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: PID namespace init releases its file locks before its children die
  2025-10-03 12:38 ` Oleg Nesterov
@ 2025-10-03 17:09   ` Demi Marie Obenour
  2025-10-07 12:02   ` Christian Brauner
  1 sibling, 0 replies; 4+ messages in thread
From: Demi Marie Obenour @ 2025-10-03 17:09 UTC (permalink / raw)
  To: Oleg Nesterov, Christian Brauner, Mateusz Guzik
  Cc: Linux kernel mailing list, Andrew Morton

[-- Attachment #1.1.1: Type: text/plain, Size: 2596 bytes --]

On 10/3/25 08:38, Oleg Nesterov wrote:
> Add CCs.
> 
> I can't really help, just my 2 cents...
> 
> I don't think we can change do_exit() to call exit_files() after
> exit_notify().

Not surprised.

> At first glance, technically it is possible to change do_exit() so
> that the exiting reaper does zap_pid_ns_processes() earlier... But
> even if this is possible, I think that this complication needs more
> justification.

I have a service that must not be run more than once concurrently.
I'm using s6 [1] as the service manager.  s6 doesn't support cgroups,
but it does support running the child in a PID namespace.
I was hoping that if the init process in the PID namespace took an
exclusive file lock, it would ensure that all the children in the PID
namespace stopped running before the lock is released.  Unfortunately,
with the current implementation that is not the case.

Right now, I'm leaking the file descriptor into the child processes and
relying on them to not close it.  This is somewhat fragile, though.
For instance, anything using GSubprocess breaks this assumption.
GSubprocess closes all file descriptors not explicitly passed into
the child.

It is definitely possible to implement this with cgroups: wait for
the cgroup to become empty before spawning the child.  It is also
possible for the supervisor to ensure that the child is dead before
spawning a new one, though s6's architecture makes this non-trivial.
The parent of the child is not PID 1, so it would need to inform
PID 1 to kill the child (and wait for it) if the actual supervisor
dies.

> Oleg.
> 
> On 10/02, Demi Marie Obenour wrote:
>>
>> I noticed that PID 1 in a PID namespace can release file locks (due
>> to exiting) while its children are still running for a bit.  If the
>> locks held by PID 1 were relied to serialize the execution of its
>> child processes, this could result in data corruption.
>>
>> Specifically, the child processes are killed via exit_notify() ->
>> forget_original_parent() -> find_child_reaper() ->
>> zap_pid_ns_processes().  That comes *after* exit_files(), which
>> releases the file locks.
>>
>> While it is possible to implement this with cgroups, cgroups
>> are quite a bit more complicated to use, at least compared to
>> a single call to unshare() before fork().
>>
>> Is this intentional?  Changing the behavior would make supervision
>> trees significantly easier to properly implement.
>> --
>> Sincerely,
>> Demi Marie Obenour (she/her/hers)
> 

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: PID namespace init releases its file locks before its children die
  2025-10-03 12:38 ` Oleg Nesterov
  2025-10-03 17:09   ` Demi Marie Obenour
@ 2025-10-07 12:02   ` Christian Brauner
  1 sibling, 0 replies; 4+ messages in thread
From: Christian Brauner @ 2025-10-07 12:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Demi Marie Obenour, Mateusz Guzik, Linux kernel mailing list,
	Andrew Morton

On Fri, Oct 03, 2025 at 02:38:28PM +0200, Oleg Nesterov wrote:
> Add CCs.
> 
> I can't really help, just my 2 cents...
> 
> I don't think we can change do_exit() to call exit_files() after
> exit_notify().
> 
> At first glance, technically it is possible to change do_exit() so
> that the exiting reaper does zap_pid_ns_processes() earlier... But
> even if this is possible, I think that this complication needs more
> justification.

I agree. Relying on side-effect of file locks being released isn't exactly
a great idea. I'm certainly don't want to give any guarantee there. It's
doable with cgroups which is the correct solution for this imho.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-10-07 12:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-02 18:22 PID namespace init releases its file locks before its children die Demi Marie Obenour
2025-10-03 12:38 ` Oleg Nesterov
2025-10-03 17:09   ` Demi Marie Obenour
2025-10-07 12:02   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox