On 2025-08-26, Mickaël Salaün wrote: > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote: > > Nothing has changed in that regard and I'm not interested in stuffing > > the VFS APIs full of special-purpose behavior to work around the fact > > that this is work that needs to be done in userspace. Change the apps, > > stop pushing more and more cruft into the VFS that has no business > > there. > > It would be interesting to know how to patch user space to get the same > guarantees... Do you think I would propose a kernel patch otherwise? You could mmap the script file with MAP_PRIVATE. This is the *actual* protection the kernel uses against overwriting binaries (yes, ETXTBSY is nice but IIRC there are ways to get around it anyway). Of course, most interpreters don't mmap their scripts, but this is a potential solution. If the security policy is based on validating the script text in some way, this avoids the TOCTOU. Now, in cases where you have IMA or something and you only permit signed binaries to execute, you could argue there is a different race here (an attacker creates a malicious script, runs it, and then replaces it with a valid script's contents and metadata after the fact to get AT_EXECVE_CHECK to permit the execution). However, I'm not sure that this is even possible with IMA (can an unprivileged user even set security.ima?). But even then, I would expect users that really need this would also probably use fs-verity or dm-verity that would block this kind of attack since it would render the files read-only anyway. This is why a more detailed threat model of what kinds of attacks are relevant is useful. I was there for the talk you gave and subsequent discussion at last year's LPC, but I felt that your threat model was not really fleshed out at all. I am still not sure what capabilities you expect the attacker to have nor what is being used to authenticate binaries (other than AT_EXECVE_CHECK). Maybe I'm wrong with my above assumptions, but I can't know without knowing what threat model you have in mind, *in detail*. For example, if you are dealing with an attacker that has CAP_SYS_ADMIN, there are plenty of ways for an attacker to execute their own code without using interpreters (create a new tmpfs with fsopen(2) for instance). Executable memfds are even easier and don't require privileges on most systems (yes, you can block them with vm.memfd_noexec but CAP_SYS_ADMIN can disable that -- and there's always fsopen(2) or mount(2)). (As an aside, it's a shame that AT_EXECVE_CHECK burned one of the top-level AT_* bits for a per-syscall flag -- the block comment I added in b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated") was meant to avoid this happening but it seems you and the reviewers missed that...) -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/