On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > Nothing has changed in that regard and I'm not interested in stuffing
> > the VFS APIs full of special-purpose behavior to work around the fact
> > that this is work that needs to be done in userspace. Change the apps,
> > stop pushing more and more cruft into the VFS that has no business
> > there.
> 
> It would be interesting to know how to patch user space to get the same
> guarantees...  Do you think I would propose a kernel patch otherwise?

You could mmap the script file with MAP_PRIVATE. This is the *actual*
protection the kernel uses against overwriting binaries (yes, ETXTBSY is
nice but IIRC there are ways to get around it anyway). Of course, most
interpreters don't mmap their scripts, but this is a potential solution.
If the security policy is based on validating the script text in some
way, this avoids the TOCTOU.

Now, in cases where you have IMA or something and you only permit signed
binaries to execute, you could argue there is a different race here (an
attacker creates a malicious script, runs it, and then replaces it with
a valid script's contents and metadata after the fact to get
AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
this is even possible with IMA (can an unprivileged user even set
security.ima?). But even then, I would expect users that really need
this would also probably use fs-verity or dm-verity that would block
this kind of attack since it would render the files read-only anyway.

This is why a more detailed threat model of what kinds of attacks are
relevant is useful. I was there for the talk you gave and subsequent
discussion at last year's LPC, but I felt that your threat model was
not really fleshed out at all. I am still not sure what capabilities you
expect the attacker to have nor what is being used to authenticate
binaries (other than AT_EXECVE_CHECK). Maybe I'm wrong with my above
assumptions, but I can't know without knowing what threat model you have
in mind, *in detail*.

For example, if you are dealing with an attacker that has CAP_SYS_ADMIN,
there are plenty of ways for an attacker to execute their own code
without using interpreters (create a new tmpfs with fsopen(2) for
instance). Executable memfds are even easier and don't require
privileges on most systems (yes, you can block them with vm.memfd_noexec
but CAP_SYS_ADMIN can disable that -- and there's always fsopen(2) or
mount(2)).

(As an aside, it's a shame that AT_EXECVE_CHECK burned one of the
top-level AT_* bits for a per-syscall flag -- the block comment I added
in b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be
allocated") was meant to avoid this happening but it seems you and the
reviewers missed that...)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/