* [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) @ 2025-08-22 17:07 Mickaël Salaün 2025-08-22 17:07 ` [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE Mickaël Salaün ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Mickaël Salaün @ 2025-08-22 17:07 UTC (permalink / raw) To: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn Cc: Mickaël Salaün, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module Hi, Script interpreters can check if a file would be allowed to be executed by the kernel using the new AT_EXECVE_CHECK flag. This approach works well on systems with write-xor-execute policies, where scripts cannot be modified by malicious processes. However, this protection may not be available on more generic distributions. The key difference between `./script.sh` and `sh script.sh` (when using AT_EXECVE_CHECK) is that execve(2) prevents the script from being opened for writing while it's being executed. To achieve parity, the kernel should provide a mechanism for script interpreters to deny write access during script interpretation. While interpreters can copy script content into a buffer, a race condition remains possible after AT_EXECVE_CHECK. This patch series introduces a new O_DENY_WRITE flag for use with open*(2) and fcntl(2). Both interfaces are necessary since script interpreters may receive either a file path or file descriptor. For backward compatibility, open(2) with O_DENY_WRITE will not fail on unsupported systems, while users requiring explicit support guarantees can use openat2(2). The check_exec.rst documentation and related examples do not mention this new feature yet. Regards, Mickaël Salaün (2): fs: Add O_DENY_WRITE selftests/exec: Add O_DENY_WRITE tests fs/fcntl.c | 26 ++- fs/file_table.c | 2 + fs/namei.c | 6 + include/linux/fcntl.h | 2 +- include/uapi/asm-generic/fcntl.h | 4 + tools/testing/selftests/exec/check-exec.c | 219 ++++++++++++++++++++++ 6 files changed, 256 insertions(+), 3 deletions(-) base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9 -- 2.50.1 ^ permalink raw reply [flat|nested] 40+ messages in thread
* [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-22 17:07 [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) Mickaël Salaün @ 2025-08-22 17:07 ` Mickaël Salaün 2025-08-22 19:45 ` Jann Horn 2025-08-27 10:29 ` Aleksa Sarai 2025-08-22 17:08 ` [RFC PATCH v1 2/2] selftests/exec: Add O_DENY_WRITE tests Mickaël Salaün 2025-08-26 9:07 ` [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) Christian Brauner 2 siblings, 2 replies; 40+ messages in thread From: Mickaël Salaün @ 2025-08-22 17:07 UTC (permalink / raw) To: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn Cc: Mickaël Salaün, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Andy Lutomirski, Jeff Xu Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. passed file descriptors). This changes the state of the opened file by making it read-only until it is closed. The main use case is for script interpreters to get the guarantee that script' content cannot be altered while being read and interpreted. This is useful for generic distros that may not have a write-xor-execute policy. See commit a5874fde3c08 ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") Both execve(2) and the IOCTL to enable fsverity can already set this property on files with deny_write_access(). This new O_DENY_WRITE make it widely available. This is similar to what other OSs may provide e.g., opening a file with only FILE_SHARE_READ on Windows. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Christian Brauner <brauner@kernel.org> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Serge Hallyn <serge@hallyn.com> Reported-by: Robert Waite <rowait@microsoft.com> Signed-off-by: Mickaël Salaün <mic@digikod.net> Link: https://lore.kernel.org/r/20250822170800.2116980-2-mic@digikod.net --- fs/fcntl.c | 26 ++++++++++++++++++++++++-- fs/file_table.c | 2 ++ fs/namei.c | 6 ++++++ include/linux/fcntl.h | 2 +- include/uapi/asm-generic/fcntl.h | 4 ++++ 5 files changed, 37 insertions(+), 3 deletions(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index 5598e4d57422..0c80c0fbc706 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -34,7 +34,8 @@ #include "internal.h" -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME) +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \ + O_DENY_WRITE) static int setfl(int fd, struct file * filp, unsigned int arg) { @@ -80,8 +81,29 @@ static int setfl(int fd, struct file * filp, unsigned int arg) error = 0; } spin_lock(&filp->f_lock); + + if (arg & O_DENY_WRITE) { + /* Only regular files. */ + if (!S_ISREG(inode->i_mode)) { + error = -EINVAL; + goto unlock; + } + + /* Only sets once. */ + if (!(filp->f_flags & O_DENY_WRITE)) { + error = exe_file_deny_write_access(filp); + if (error) + goto unlock; + } + } else { + if (filp->f_flags & O_DENY_WRITE) + exe_file_allow_write_access(filp); + } + filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK); filp->f_iocb_flags = iocb_flags(filp); + +unlock: spin_unlock(&filp->f_lock); out: @@ -1158,7 +1180,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ != + BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC)); diff --git a/fs/file_table.c b/fs/file_table.c index 81c72576e548..6ba896b6a53f 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -460,6 +460,8 @@ static void __fput(struct file *file) locks_remove_file(file); security_file_release(file); + if (unlikely(file->f_flags & O_DENY_WRITE)) + exe_file_allow_write_access(file); if (unlikely(file->f_flags & FASYNC)) { if (file->f_op->fasync) file->f_op->fasync(-1, file, 0); diff --git a/fs/namei.c b/fs/namei.c index cd43ff89fbaa..366530bf937d 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3885,6 +3885,12 @@ static int do_open(struct nameidata *nd, error = may_open(idmap, &nd->path, acc_mode, open_flag); if (!error && !(file->f_mode & FMODE_OPENED)) error = vfs_open(&nd->path, file); + if (!error && (open_flag & O_DENY_WRITE)) { + if (S_ISREG(file_inode(file)->i_mode)) + error = exe_file_deny_write_access(file); + else + error = -EINVAL; + } if (!error) error = security_file_post_open(file, op->acc_mode); if (!error && do_truncate) diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index a332e79b3207..dad14101686f 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -10,7 +10,7 @@ (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \ O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ - O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE) + O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_DENY_WRITE) /* List of all valid flags for the how->resolve argument: */ #define VALID_RESOLVE_FLAGS \ diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index 613475285643..facd9136f5af 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -91,6 +91,10 @@ /* a horrid kludge trying to make sure that this will fail on old kernels */ #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) +#ifndef O_DENY_WRITE +#define O_DENY_WRITE 040000000 +#endif + #ifndef O_NDELAY #define O_NDELAY O_NONBLOCK #endif -- 2.50.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-22 17:07 ` [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE Mickaël Salaün @ 2025-08-22 19:45 ` Jann Horn 2025-08-24 11:03 ` Mickaël Salaün 2025-08-27 10:18 ` Aleksa Sarai 2025-08-27 10:29 ` Aleksa Sarai 1 sibling, 2 replies; 40+ messages in thread From: Jann Horn @ 2025-08-22 19:45 UTC (permalink / raw) To: Mickaël Salaün Cc: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Andy Lutomirski, Jeff Xu On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > passed file descriptors). This changes the state of the opened file by > making it read-only until it is closed. The main use case is for script > interpreters to get the guarantee that script' content cannot be altered > while being read and interpreted. This is useful for generic distros > that may not have a write-xor-execute policy. See commit a5874fde3c08 > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > Both execve(2) and the IOCTL to enable fsverity can already set this > property on files with deny_write_access(). This new O_DENY_WRITE make The kernel actually tried to get rid of this behavior on execve() in commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d because it broke userspace assumptions. > it widely available. This is similar to what other OSs may provide > e.g., opening a file with only FILE_SHARE_READ on Windows. We used to have the analogous mmap() flag MAP_DENYWRITE, and that was removed for security reasons; as https://man7.org/linux/man-pages/man2/mmap.2.html says: | MAP_DENYWRITE | This flag is ignored. (Long ago—Linux 2.0 and earlier—it | signaled that attempts to write to the underlying file | should fail with ETXTBSY. But this was a source of denial- | of-service attacks.)" It seems to me that the same issue applies to your patch - it would allow unprivileged processes to essentially lock files such that other processes can't write to them anymore. This might allow unprivileged users to prevent root from updating config files or stuff like that if they're updated in-place. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-22 19:45 ` Jann Horn @ 2025-08-24 11:03 ` Mickaël Salaün 2025-08-24 18:04 ` Andy Lutomirski 2025-08-27 10:18 ` Aleksa Sarai 1 sibling, 1 reply; 40+ messages in thread From: Mickaël Salaün @ 2025-08-24 11:03 UTC (permalink / raw) To: Jann Horn Cc: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Andy Lutomirski, Jeff Xu On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote: > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > > passed file descriptors). This changes the state of the opened file by > > making it read-only until it is closed. The main use case is for script > > interpreters to get the guarantee that script' content cannot be altered > > while being read and interpreted. This is useful for generic distros > > that may not have a write-xor-execute policy. See commit a5874fde3c08 > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > > > Both execve(2) and the IOCTL to enable fsverity can already set this > > property on files with deny_write_access(). This new O_DENY_WRITE make > > The kernel actually tried to get rid of this behavior on execve() in > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d > because it broke userspace assumptions. Oh, good to know. > > > it widely available. This is similar to what other OSs may provide > > e.g., opening a file with only FILE_SHARE_READ on Windows. > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was > removed for security reasons; as > https://man7.org/linux/man-pages/man2/mmap.2.html says: > > | MAP_DENYWRITE > | This flag is ignored. (Long ago—Linux 2.0 and earlier—it > | signaled that attempts to write to the underlying file > | should fail with ETXTBSY. But this was a source of denial- > | of-service attacks.)" > > It seems to me that the same issue applies to your patch - it would > allow unprivileged processes to essentially lock files such that other > processes can't write to them anymore. This might allow unprivileged > users to prevent root from updating config files or stuff like that if > they're updated in-place. Yes, I agree, but since it is the case for executed files I though it was worth starting a discussion on this topic. This new flag could be restricted to executable files, but we should avoid system-wide locks like this. I'm not sure how Windows handle these issues though. Anyway, we should rely on the access control policy to control write and execute access in a consistent way (e.g. write-xor-execute). Thanks for the references and the background! ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-24 11:03 ` Mickaël Salaün @ 2025-08-24 18:04 ` Andy Lutomirski 2025-08-25 9:31 ` Mickaël Salaün 0 siblings, 1 reply; 40+ messages in thread From: Andy Lutomirski @ 2025-08-24 18:04 UTC (permalink / raw) To: Mickaël Salaün Cc: Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Jeff Xu On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote: > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote: > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > > > passed file descriptors). This changes the state of the opened file by > > > making it read-only until it is closed. The main use case is for script > > > interpreters to get the guarantee that script' content cannot be altered > > > while being read and interpreted. This is useful for generic distros > > > that may not have a write-xor-execute policy. See commit a5874fde3c08 > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this > > > property on files with deny_write_access(). This new O_DENY_WRITE make > > > > The kernel actually tried to get rid of this behavior on execve() in > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d > > because it broke userspace assumptions. > > Oh, good to know. > > > > > > it widely available. This is similar to what other OSs may provide > > > e.g., opening a file with only FILE_SHARE_READ on Windows. > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was > > removed for security reasons; as > > https://man7.org/linux/man-pages/man2/mmap.2.html says: > > > > | MAP_DENYWRITE > > | This flag is ignored. (Long ago—Linux 2.0 and earlier—it > > | signaled that attempts to write to the underlying file > > | should fail with ETXTBSY. But this was a source of denial- > > | of-service attacks.)" > > > > It seems to me that the same issue applies to your patch - it would > > allow unprivileged processes to essentially lock files such that other > > processes can't write to them anymore. This might allow unprivileged > > users to prevent root from updating config files or stuff like that if > > they're updated in-place. > > Yes, I agree, but since it is the case for executed files I though it > was worth starting a discussion on this topic. This new flag could be > restricted to executable files, but we should avoid system-wide locks > like this. I'm not sure how Windows handle these issues though. > > Anyway, we should rely on the access control policy to control write and > execute access in a consistent way (e.g. write-xor-execute). Thanks for > the references and the background! I'm confused. I understand that there are many contexts in which one would want to prevent execution of unapproved content, which might include preventing a given process from modifying some code and then executing it. I don't understand what these deny-write features have to do with it. These features merely prevent someone from modifying code *that is currently in use*, which is not at all the same thing as preventing modifying code that might get executed -- one can often modify contents *before* executing those contents. In any case, IMO it's rather sad that the elimination of ETXTBSY had to be reverted -- it's really quite a nasty feature. But it occurs to me that Linux can more or less do what is IMO the actually desired thing: snapshot the contents of a file and execute the snapshot. The hack at the end of the email works! (Well, it works if the chosen filesystem supports it.) $ ./silly_tmp /tmp/test /tmp vim /proc/self/fd/3 emacs is apparently far, far too clever and can't save if you do: $ ./silly_tmp /tmp/test /tmp emacs /proc/self/fd/3 I'm not seriously suggesting that anyone should execute binaries or scripts on Linux exactly like this, for a whole bunch of reasons: - It needs filesystem support (but maybe this isn't so bad) - It needs write access to a directory on the correct filesystem (a showstopper for serious use) - It is wildly incompatible with write-xor-execute, so this would be a case of one step forward, ten steps back. - It would defeat a lot of tools that inspect /proc, which would be quite annoying to say the least. But maybe a less kludgy version could be used for real. What if there was a syscall that would take an fd and make a snapshot of the file? It would, at least by default, produce a *read-only* snapshot (fully sealed a la F_SEAL_*), inherit any integrity data that came with the source (e.g. LSMs could understand it), would not require a writable directory on the filesystem, and would maybe even come with an extra seal-like thing that prevents it from being linkat-ed. (I'm not sure that linkat would actually be a problem, but I'm also not immediately sure that LSMs would be as comfortable with it if linkat were allowed.) And there could probably be an extremely efficient implementation that might even reuse the existing deny-write mechanism to optimize the common case where the file is never written. For that matter, the actual common case would be to execute stuff in /usr or similar, and those files really ought never to be modified. So there could be a file attribute or something that means "this file CANNOT be modified, but it can still be unlinked or replaced as usual", and snapshotting such a file would be a no-op. Distributions and container tools could set that attribute. Overlayfs could also provide an efficient implementation if the file currently comes from an immutable source. Hmm, maybe it's not strictly necessary that it be immutable -- maybe it's sometimes okay if reads start to fail if the contents change. Let's call this a "weak snapshot" -- reads of a weak snapshot either return the original contents or fail. fsverity would give weak snapshots for at no additional cost. It's worth noting that the common case doesn't actually need an fd. We have mmap(..., MAP_PRIVATE, ...). What we would actually want for mmap use cases is mmap(..., MAP_SNAPSHOT, ...), with the semantics that the kernel promises that future writes to the source would either not be reflected in the mapping or would cause SIGBUS. One might reasonably debate what forced-writes would do (I think forced-writes should be allowed just like they currently are, since anyone who can force-write to process memory is already assumed to be permitted to bypass write-xor-execute). --- /* Written by Claude Sonnet 4 with a surprisingly small amount of help from Andy */ #define _GNU_SOURCE #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <sys/ioctl.h> #include <linux/fs.h> #include <stdio.h> #include <stdlib.h> #include <errno.h> #include <string.h> int main(int argc, char *argv[]) { if (argc < 4) { fprintf(stderr, "Usage: %s <source_file> <temp_dir> [exec_args...]\n", argv[0]); exit(1); } const char *source_file = argv[1]; const char *temp_dir = argv[2]; // Open source file int source_fd = open(source_file, O_RDONLY); if (source_fd == -1) { perror("Failed to open source file"); exit(1); } // Create temporary file int temp_fd = open(temp_dir, O_TMPFILE | O_RDWR, 0600); if (temp_fd == -1) { perror("Failed to create temporary file"); close(source_fd); exit(1); } // Clone the file contents using FICLONE if (ioctl(temp_fd, FICLONE, source_fd) == -1) { perror("Failed to clone file"); close(source_fd); close(temp_fd); exit(1); } // Close source file close(source_fd); // Make sure temp file is on fd 3 if (temp_fd != 3) { if (dup2(temp_fd, 3) == -1) { perror("Failed to move temp file to fd 3"); close(temp_fd); exit(1); } close(temp_fd); } // Execute the remaining arguments if (argc >= 3) { execvp(argv[3], &argv[3]); perror("Failed to execute command"); exit(1); } return 0; } ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-24 18:04 ` Andy Lutomirski @ 2025-08-25 9:31 ` Mickaël Salaün 2025-08-25 9:39 ` Florian Weimer ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Mickaël Salaün @ 2025-08-25 9:31 UTC (permalink / raw) To: Andy Lutomirski Cc: Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Jeff Xu On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote: > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote: > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > > > > passed file descriptors). This changes the state of the opened file by > > > > making it read-only until it is closed. The main use case is for script > > > > interpreters to get the guarantee that script' content cannot be altered > > > > while being read and interpreted. This is useful for generic distros > > > > that may not have a write-xor-execute policy. See commit a5874fde3c08 > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this > > > > property on files with deny_write_access(). This new O_DENY_WRITE make > > > > > > The kernel actually tried to get rid of this behavior on execve() in > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d > > > because it broke userspace assumptions. > > > > Oh, good to know. > > > > > > > > > it widely available. This is similar to what other OSs may provide > > > > e.g., opening a file with only FILE_SHARE_READ on Windows. > > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was > > > removed for security reasons; as > > > https://man7.org/linux/man-pages/man2/mmap.2.html says: > > > > > > | MAP_DENYWRITE > > > | This flag is ignored. (Long ago—Linux 2.0 and earlier—it > > > | signaled that attempts to write to the underlying file > > > | should fail with ETXTBSY. But this was a source of denial- > > > | of-service attacks.)" > > > > > > It seems to me that the same issue applies to your patch - it would > > > allow unprivileged processes to essentially lock files such that other > > > processes can't write to them anymore. This might allow unprivileged > > > users to prevent root from updating config files or stuff like that if > > > they're updated in-place. > > > > Yes, I agree, but since it is the case for executed files I though it > > was worth starting a discussion on this topic. This new flag could be > > restricted to executable files, but we should avoid system-wide locks > > like this. I'm not sure how Windows handle these issues though. > > > > Anyway, we should rely on the access control policy to control write and > > execute access in a consistent way (e.g. write-xor-execute). Thanks for > > the references and the background! > > I'm confused. I understand that there are many contexts in which one > would want to prevent execution of unapproved content, which might > include preventing a given process from modifying some code and then > executing it. > > I don't understand what these deny-write features have to do with it. > These features merely prevent someone from modifying code *that is > currently in use*, which is not at all the same thing as preventing > modifying code that might get executed -- one can often modify > contents *before* executing those contents. The order of checks would be: 1. open script with O_DENY_WRITE 2. check executability with AT_EXECVE_CHECK 3. read the content and interpret it The deny-write feature was to guarantee that there is no race condition between step 2 and 3. All these checks are supposed to be done by a trusted interpreter (which is allowed to be executed). The AT_EXECVE_CHECK call enables the caller to know if the kernel (and associated security policies) allowed the *current* content of the file to be executed. Whatever happen before or after that (wrt. O_DENY_WRITE) should be covered by the security policy. > > In any case, IMO it's rather sad that the elimination of ETXTBSY had > to be reverted -- it's really quite a nasty feature. But it occurs to > me that Linux can more or less do what is IMO the actually desired > thing: snapshot the contents of a file and execute the snapshot. The > hack at the end of the email works! (Well, it works if the chosen > filesystem supports it.) > > $ ./silly_tmp /tmp/test /tmp vim /proc/self/fd/3 > > emacs is apparently far, far too clever and can't save if you do: > > $ ./silly_tmp /tmp/test /tmp emacs /proc/self/fd/3 > > > I'm not seriously suggesting that anyone should execute binaries or > scripts on Linux exactly like this, for a whole bunch of reasons: > > - It needs filesystem support (but maybe this isn't so bad) > > - It needs write access to a directory on the correct filesystem (a > showstopper for serious use) > > - It is wildly incompatible with write-xor-execute, so this would be a > case of one step forward, ten steps back. > > - It would defeat a lot of tools that inspect /proc, which would be > quite annoying to say the least. > > > But maybe a less kludgy version could be used for real. What if there > was a syscall that would take an fd and make a snapshot of the file? Yes, that would be a clean solution. I don't think this is achievable in an efficient way without involving filesystem implementations though. > It would, at least by default, produce a *read-only* snapshot (fully > sealed a la F_SEAL_*), inherit any integrity data that came with the > source (e.g. LSMs could understand it), would not require a writable > directory on the filesystem, and would maybe even come with an extra > seal-like thing that prevents it from being linkat-ed. (I'm not sure > that linkat would actually be a problem, but I'm also not immediately > sure that LSMs would be as comfortable with it if linkat were > allowed.) And there could probably be an extremely efficient > implementation that might even reuse the existing deny-write mechanism > to optimize the common case where the file is never written. > > For that matter, the actual common case would be to execute stuff in > /usr or similar, and those files really ought never to be modified. > So there could be a file attribute or something that means "this file > CANNOT be modified, but it can still be unlinked or replaced as > usual", and snapshotting such a file would be a no-op. Distributions > and container tools could set that attribute. Overlayfs could also > provide an efficient implementation if the file currently comes from > an immutable source. > > Hmm, maybe it's not strictly necessary that it be immutable -- maybe > it's sometimes okay if reads start to fail if the contents change. > Let's call this a "weak snapshot" -- reads of a weak snapshot either > return the original contents or fail. fsverity would give weak > snapshots for at no additional cost. > > > It's worth noting that the common case doesn't actually need an fd. > We have mmap(..., MAP_PRIVATE, ...). What we would actually want for > mmap use cases is mmap(..., MAP_SNAPSHOT, ...), with the semantics > that the kernel promises that future writes to the source would either > not be reflected in the mapping or would cause SIGBUS. One might > reasonably debate what forced-writes would do (I think forced-writes > should be allowed just like they currently are, since anyone who can > force-write to process memory is already assumed to be permitted to > bypass write-xor-execute). > > > --- > > /* Written by Claude Sonnet 4 with a surprisingly small amount of help > from Andy */ > > #define _GNU_SOURCE > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <unistd.h> > #include <sys/ioctl.h> > #include <linux/fs.h> > #include <stdio.h> > #include <stdlib.h> > #include <errno.h> > #include <string.h> > > int main(int argc, char *argv[]) { > if (argc < 4) { > fprintf(stderr, "Usage: %s <source_file> <temp_dir> > [exec_args...]\n", argv[0]); > exit(1); > } > > const char *source_file = argv[1]; > const char *temp_dir = argv[2]; > > // Open source file > int source_fd = open(source_file, O_RDONLY); > if (source_fd == -1) { > perror("Failed to open source file"); > exit(1); > } > > // Create temporary file > int temp_fd = open(temp_dir, O_TMPFILE | O_RDWR, 0600); > if (temp_fd == -1) { > perror("Failed to create temporary file"); > close(source_fd); > exit(1); > } > > // Clone the file contents using FICLONE > if (ioctl(temp_fd, FICLONE, source_fd) == -1) { > perror("Failed to clone file"); > close(source_fd); > close(temp_fd); > exit(1); > } > > // Close source file > close(source_fd); > > // Make sure temp file is on fd 3 > if (temp_fd != 3) { > if (dup2(temp_fd, 3) == -1) { > perror("Failed to move temp file to fd 3"); > close(temp_fd); > exit(1); > } > close(temp_fd); > } > > // Execute the remaining arguments > if (argc >= 3) { > execvp(argv[3], &argv[3]); > perror("Failed to execute command"); > exit(1); > } > > return 0; > } As you said, this doesn't work if temp_dir is not allowed for execution, and it doesn't allow the kernel to check/track the content of the script, which is the purpose of AT_EXECVE_CHECK. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-25 9:31 ` Mickaël Salaün @ 2025-08-25 9:39 ` Florian Weimer 2025-08-26 12:35 ` Mickaël Salaün 2025-08-25 16:43 ` Andy Lutomirski 2025-08-25 17:57 ` Jeff Xu 2 siblings, 1 reply; 40+ messages in thread From: Florian Weimer @ 2025-08-25 9:39 UTC (permalink / raw) To: Mickaël Salaün Cc: Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Jeff Xu * Mickaël Salaün: > The order of checks would be: > 1. open script with O_DENY_WRITE > 2. check executability with AT_EXECVE_CHECK > 3. read the content and interpret it > > The deny-write feature was to guarantee that there is no race condition > between step 2 and 3. All these checks are supposed to be done by a > trusted interpreter (which is allowed to be executed). The > AT_EXECVE_CHECK call enables the caller to know if the kernel (and > associated security policies) allowed the *current* content of the file > to be executed. Whatever happen before or after that (wrt. > O_DENY_WRITE) should be covered by the security policy. Why isn't it an improper system configuration if the script file is writable? In the past, the argument was that making a file (writable and) executable was an auditable even, and that provided enough coverage for those people who are interested in this. Thanks, Florian ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-25 9:39 ` Florian Weimer @ 2025-08-26 12:35 ` Mickaël Salaün 0 siblings, 0 replies; 40+ messages in thread From: Mickaël Salaün @ 2025-08-26 12:35 UTC (permalink / raw) To: Florian Weimer Cc: Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Jeff Xu On Mon, Aug 25, 2025 at 11:39:11AM +0200, Florian Weimer wrote: > * Mickaël Salaün: > > > The order of checks would be: > > 1. open script with O_DENY_WRITE > > 2. check executability with AT_EXECVE_CHECK > > 3. read the content and interpret it > > > > The deny-write feature was to guarantee that there is no race condition > > between step 2 and 3. All these checks are supposed to be done by a > > trusted interpreter (which is allowed to be executed). The > > AT_EXECVE_CHECK call enables the caller to know if the kernel (and > > associated security policies) allowed the *current* content of the file > > to be executed. Whatever happen before or after that (wrt. > > O_DENY_WRITE) should be covered by the security policy. > > Why isn't it an improper system configuration if the script file is > writable? It is, except if the system only wants to track executions (e.g. record checksum of scripts) without restricting file modifications. > > In the past, the argument was that making a file (writable and) > executable was an auditable even, and that provided enough coverage for > those people who are interested in this. Yes, but in this case there is a race condition that this patch tried to fix. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-25 9:31 ` Mickaël Salaün 2025-08-25 9:39 ` Florian Weimer @ 2025-08-25 16:43 ` Andy Lutomirski 2025-08-25 18:10 ` Jeff Xu 2025-08-25 17:57 ` Jeff Xu 2 siblings, 1 reply; 40+ messages in thread From: Andy Lutomirski @ 2025-08-25 16:43 UTC (permalink / raw) To: Mickaël Salaün Cc: Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Jeff Xu On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote: > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote: > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote: > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > > > > > passed file descriptors). This changes the state of the opened file by > > > > > making it read-only until it is closed. The main use case is for script > > > > > interpreters to get the guarantee that script' content cannot be altered > > > > > while being read and interpreted. This is useful for generic distros > > > > > that may not have a write-xor-execute policy. See commit a5874fde3c08 > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > > > > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this > > > > > property on files with deny_write_access(). This new O_DENY_WRITE make > > > > > > > > The kernel actually tried to get rid of this behavior on execve() in > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d > > > > because it broke userspace assumptions. > > > > > > Oh, good to know. > > > > > > > > > > > > it widely available. This is similar to what other OSs may provide > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows. > > > > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was > > > > removed for security reasons; as > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says: > > > > > > > > | MAP_DENYWRITE > > > > | This flag is ignored. (Long ago—Linux 2.0 and earlier—it > > > > | signaled that attempts to write to the underlying file > > > > | should fail with ETXTBSY. But this was a source of denial- > > > > | of-service attacks.)" > > > > > > > > It seems to me that the same issue applies to your patch - it would > > > > allow unprivileged processes to essentially lock files such that other > > > > processes can't write to them anymore. This might allow unprivileged > > > > users to prevent root from updating config files or stuff like that if > > > > they're updated in-place. > > > > > > Yes, I agree, but since it is the case for executed files I though it > > > was worth starting a discussion on this topic. This new flag could be > > > restricted to executable files, but we should avoid system-wide locks > > > like this. I'm not sure how Windows handle these issues though. > > > > > > Anyway, we should rely on the access control policy to control write and > > > execute access in a consistent way (e.g. write-xor-execute). Thanks for > > > the references and the background! > > > > I'm confused. I understand that there are many contexts in which one > > would want to prevent execution of unapproved content, which might > > include preventing a given process from modifying some code and then > > executing it. > > > > I don't understand what these deny-write features have to do with it. > > These features merely prevent someone from modifying code *that is > > currently in use*, which is not at all the same thing as preventing > > modifying code that might get executed -- one can often modify > > contents *before* executing those contents. > > The order of checks would be: > 1. open script with O_DENY_WRITE > 2. check executability with AT_EXECVE_CHECK > 3. read the content and interpret it Hmm. Common LSM configurations should be able to handle this without deny write, I think. If you don't want a program to be able to make their own scripts, then don't allow AT_EXECVE_CHECK to succeed on a script that the program can write. Keep in mind that trying to lock this down too hard is pointless for users who are allowed to to ptrace-write to their own processes. Or for users who can do JIT, or for users who can run a REPL, etc. > > But maybe a less kludgy version could be used for real. What if there > > was a syscall that would take an fd and make a snapshot of the file? > > Yes, that would be a clean solution. I don't think this is achievable > in an efficient way without involving filesystem implementations though. It wouldn't be so terrible to involve filesystem implementations. Most of the filesystems that people who care at all about security run their binaries from either support reflinks or are immutable. Things like OCI implementations may already fit meet those criteria, and it would be pretty nifty if the kernel was actually aware that OCI layers are intended to be immutable. We could even have an API to generically query the hash of an immutable file and to ask the kernel if it's validating the hash on reads. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-25 16:43 ` Andy Lutomirski @ 2025-08-25 18:10 ` Jeff Xu 0 siblings, 0 replies; 40+ messages in thread From: Jeff Xu @ 2025-08-25 18:10 UTC (permalink / raw) To: Andy Lutomirski Cc: Mickaël Salaün, Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Jeff Xu On Mon, Aug 25, 2025 at 9:43 AM Andy Lutomirski <luto@amacapital.net> wrote: > > On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote: > > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote: > > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > > > > > > passed file descriptors). This changes the state of the opened file by > > > > > > making it read-only until it is closed. The main use case is for script > > > > > > interpreters to get the guarantee that script' content cannot be altered > > > > > > while being read and interpreted. This is useful for generic distros > > > > > > that may not have a write-xor-execute policy. See commit a5874fde3c08 > > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > > > > > > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this > > > > > > property on files with deny_write_access(). This new O_DENY_WRITE make > > > > > > > > > > The kernel actually tried to get rid of this behavior on execve() in > > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had > > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d > > > > > because it broke userspace assumptions. > > > > > > > > Oh, good to know. > > > > > > > > > > > > > > > it widely available. This is similar to what other OSs may provide > > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows. > > > > > > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was > > > > > removed for security reasons; as > > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says: > > > > > > > > > > | MAP_DENYWRITE > > > > > | This flag is ignored. (Long ago—Linux 2.0 and earlier—it > > > > > | signaled that attempts to write to the underlying file > > > > > | should fail with ETXTBSY. But this was a source of denial- > > > > > | of-service attacks.)" > > > > > > > > > > It seems to me that the same issue applies to your patch - it would > > > > > allow unprivileged processes to essentially lock files such that other > > > > > processes can't write to them anymore. This might allow unprivileged > > > > > users to prevent root from updating config files or stuff like that if > > > > > they're updated in-place. > > > > > > > > Yes, I agree, but since it is the case for executed files I though it > > > > was worth starting a discussion on this topic. This new flag could be > > > > restricted to executable files, but we should avoid system-wide locks > > > > like this. I'm not sure how Windows handle these issues though. > > > > > > > > Anyway, we should rely on the access control policy to control write and > > > > execute access in a consistent way (e.g. write-xor-execute). Thanks for > > > > the references and the background! > > > > > > I'm confused. I understand that there are many contexts in which one > > > would want to prevent execution of unapproved content, which might > > > include preventing a given process from modifying some code and then > > > executing it. > > > > > > I don't understand what these deny-write features have to do with it. > > > These features merely prevent someone from modifying code *that is > > > currently in use*, which is not at all the same thing as preventing > > > modifying code that might get executed -- one can often modify > > > contents *before* executing those contents. > > > > The order of checks would be: > > 1. open script with O_DENY_WRITE > > 2. check executability with AT_EXECVE_CHECK > > 3. read the content and interpret it > > Hmm. Common LSM configurations should be able to handle this without > deny write, I think. If you don't want a program to be able to make > their own scripts, then don't allow AT_EXECVE_CHECK to succeed on a > script that the program can write. > Yes, Common LSM could handle this, however, due to historic and app backward compability reason, sometimes it is impossible to enforce that kind of policy in practice, therefore as an alternative, a machinism such as AT_EXECVE_CHECK is really useful. > Keep in mind that trying to lock this down too hard is pointless for > users who are allowed to to ptrace-write to their own processes. Or > for users who can do JIT, or for users who can run a REPL, etc. > The ptrace-write and /proc/pid/mem writing are on my radar, at least for ChomeOS and Android. AT_EXECVE_CHECK is orthogonal to those IMO, I hope eventually all those paths will be hardened. Thanks and regards, -Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-25 9:31 ` Mickaël Salaün 2025-08-25 9:39 ` Florian Weimer 2025-08-25 16:43 ` Andy Lutomirski @ 2025-08-25 17:57 ` Jeff Xu 2025-08-26 12:39 ` Mickaël Salaün 2 siblings, 1 reply; 40+ messages in thread From: Jeff Xu @ 2025-08-25 17:57 UTC (permalink / raw) To: Mickaël Salaün Cc: Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Jeff Xu Hi Mickaël On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote: > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote: > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote: > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > > > > > passed file descriptors). This changes the state of the opened file by > > > > > making it read-only until it is closed. The main use case is for script > > > > > interpreters to get the guarantee that script' content cannot be altered > > > > > while being read and interpreted. This is useful for generic distros > > > > > that may not have a write-xor-execute policy. See commit a5874fde3c08 > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > > > > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this > > > > > property on files with deny_write_access(). This new O_DENY_WRITE make > > > > > > > > The kernel actually tried to get rid of this behavior on execve() in > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d > > > > because it broke userspace assumptions. > > > > > > Oh, good to know. > > > > > > > > > > > > it widely available. This is similar to what other OSs may provide > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows. > > > > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was > > > > removed for security reasons; as > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says: > > > > > > > > | MAP_DENYWRITE > > > > | This flag is ignored. (Long ago—Linux 2.0 and earlier—it > > > > | signaled that attempts to write to the underlying file > > > > | should fail with ETXTBSY. But this was a source of denial- > > > > | of-service attacks.)" > > > > > > > > It seems to me that the same issue applies to your patch - it would > > > > allow unprivileged processes to essentially lock files such that other > > > > processes can't write to them anymore. This might allow unprivileged > > > > users to prevent root from updating config files or stuff like that if > > > > they're updated in-place. > > > > > > Yes, I agree, but since it is the case for executed files I though it > > > was worth starting a discussion on this topic. This new flag could be > > > restricted to executable files, but we should avoid system-wide locks > > > like this. I'm not sure how Windows handle these issues though. > > > > > > Anyway, we should rely on the access control policy to control write and > > > execute access in a consistent way (e.g. write-xor-execute). Thanks for > > > the references and the background! > > > > I'm confused. I understand that there are many contexts in which one > > would want to prevent execution of unapproved content, which might > > include preventing a given process from modifying some code and then > > executing it. > > > > I don't understand what these deny-write features have to do with it. > > These features merely prevent someone from modifying code *that is > > currently in use*, which is not at all the same thing as preventing > > modifying code that might get executed -- one can often modify > > contents *before* executing those contents. > > The order of checks would be: > 1. open script with O_DENY_WRITE > 2. check executability with AT_EXECVE_CHECK > 3. read the content and interpret it > I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving. AT_EXECVE_CHECK is not just for scripting languages. It could also work with bytecodes like Java, for example. If we let the Java runtime call AT_EXECVE_CHECK before loading the bytecode, the LSM could develop a policy based on that. > The deny-write feature was to guarantee that there is no race condition > between step 2 and 3. All these checks are supposed to be done by a > trusted interpreter (which is allowed to be executed). The > AT_EXECVE_CHECK call enables the caller to know if the kernel (and > associated security policies) allowed the *current* content of the file > to be executed. Whatever happen before or after that (wrt. > O_DENY_WRITE) should be covered by the security policy. > Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK. Enforcing non-write for the path that stores scripts or bytecodes can be challenging due to historical or backward compatibility reasons. Since AT_EXECVE_CHECK provides a mechanism to check the file right before it is used, we can assume it will detect any "problem" that happened before that, (e.g. the file was overwritten). However, that also imposes two additional requirements: 1> the file doesn't change while AT_EXECVE_CHECK does the check. 2>The file content kept by the process remains unchanged after passing the AT_EXECVE_CHECK. I imagine, the complete solution for AT_EXECVE_CHECK would include those two grantees. Thanks -Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-25 17:57 ` Jeff Xu @ 2025-08-26 12:39 ` Mickaël Salaün 2025-08-26 20:29 ` Jeff Xu 0 siblings, 1 reply; 40+ messages in thread From: Mickaël Salaün @ 2025-08-26 12:39 UTC (permalink / raw) To: Jeff Xu Cc: Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Jeff Xu On Mon, Aug 25, 2025 at 10:57:57AM -0700, Jeff Xu wrote: > Hi Mickaël > > On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote: > > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote: > > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > > > > > > passed file descriptors). This changes the state of the opened file by > > > > > > making it read-only until it is closed. The main use case is for script > > > > > > interpreters to get the guarantee that script' content cannot be altered > > > > > > while being read and interpreted. This is useful for generic distros > > > > > > that may not have a write-xor-execute policy. See commit a5874fde3c08 > > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > > > > > > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this > > > > > > property on files with deny_write_access(). This new O_DENY_WRITE make > > > > > > > > > > The kernel actually tried to get rid of this behavior on execve() in > > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had > > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d > > > > > because it broke userspace assumptions. > > > > > > > > Oh, good to know. > > > > > > > > > > > > > > > it widely available. This is similar to what other OSs may provide > > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows. > > > > > > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was > > > > > removed for security reasons; as > > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says: > > > > > > > > > > | MAP_DENYWRITE > > > > > | This flag is ignored. (Long ago—Linux 2.0 and earlier—it > > > > > | signaled that attempts to write to the underlying file > > > > > | should fail with ETXTBSY. But this was a source of denial- > > > > > | of-service attacks.)" > > > > > > > > > > It seems to me that the same issue applies to your patch - it would > > > > > allow unprivileged processes to essentially lock files such that other > > > > > processes can't write to them anymore. This might allow unprivileged > > > > > users to prevent root from updating config files or stuff like that if > > > > > they're updated in-place. > > > > > > > > Yes, I agree, but since it is the case for executed files I though it > > > > was worth starting a discussion on this topic. This new flag could be > > > > restricted to executable files, but we should avoid system-wide locks > > > > like this. I'm not sure how Windows handle these issues though. > > > > > > > > Anyway, we should rely on the access control policy to control write and > > > > execute access in a consistent way (e.g. write-xor-execute). Thanks for > > > > the references and the background! > > > > > > I'm confused. I understand that there are many contexts in which one > > > would want to prevent execution of unapproved content, which might > > > include preventing a given process from modifying some code and then > > > executing it. > > > > > > I don't understand what these deny-write features have to do with it. > > > These features merely prevent someone from modifying code *that is > > > currently in use*, which is not at all the same thing as preventing > > > modifying code that might get executed -- one can often modify > > > contents *before* executing those contents. > > > > The order of checks would be: > > 1. open script with O_DENY_WRITE > > 2. check executability with AT_EXECVE_CHECK > > 3. read the content and interpret it > > > I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving. > > AT_EXECVE_CHECK is not just for scripting languages. It could also > work with bytecodes like Java, for example. If we let the Java runtime > call AT_EXECVE_CHECK before loading the bytecode, the LSM could > develop a policy based on that. Sure, I'm using "script" to make it simple, but this applies to other use cases. > > > The deny-write feature was to guarantee that there is no race condition > > between step 2 and 3. All these checks are supposed to be done by a > > trusted interpreter (which is allowed to be executed). The > > AT_EXECVE_CHECK call enables the caller to know if the kernel (and > > associated security policies) allowed the *current* content of the file > > to be executed. Whatever happen before or after that (wrt. > > O_DENY_WRITE) should be covered by the security policy. > > > Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK. > > Enforcing non-write for the path that stores scripts or bytecodes can > be challenging due to historical or backward compatibility reasons. > Since AT_EXECVE_CHECK provides a mechanism to check the file right > before it is used, we can assume it will detect any "problem" that > happened before that, (e.g. the file was overwritten). However, that > also imposes two additional requirements: > 1> the file doesn't change while AT_EXECVE_CHECK does the check. This is already the case, so any kind of LSM checks are good. > 2>The file content kept by the process remains unchanged after passing > the AT_EXECVE_CHECK. The goal of this patch was to avoid such race condition in the case where executable files can be updated. But in most cases it should not be a security issue (because processes allowed to write to executable files should be trusted), but this could still lead to bugs (because of inconsistent file content, half-updated). > > I imagine, the complete solution for AT_EXECVE_CHECK would include > those two grantees. There is no issue directly with AT_EXECVE_CHECK, but according to the system configuration, interpreters could read a file that was updated after the AT_EXECVE_CHECK. This should not be an issue for secure systems where executable files are only updated with trusted code, except if the update mechanism is not atomic. The main use case for this patch series was for generic distros that may not have the write-xor-execute guarantees e.g., for developers. The only viable solution I see would be to have some kind of snapshot of files, requested by interpreters, but I'm not sure if it is worth it. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-26 12:39 ` Mickaël Salaün @ 2025-08-26 20:29 ` Jeff Xu 2025-08-27 8:19 ` Mickaël Salaün 0 siblings, 1 reply; 40+ messages in thread From: Jeff Xu @ 2025-08-26 20:29 UTC (permalink / raw) To: Mickaël Salaün Cc: Jeff Xu, Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module Hi Mickaël On Tue, Aug 26, 2025 at 5:39 AM Mickaël Salaün <mic@digikod.net> wrote: > > On Mon, Aug 25, 2025 at 10:57:57AM -0700, Jeff Xu wrote: > > Hi Mickaël > > > > On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote: > > > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote: > > > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > > > > > > > passed file descriptors). This changes the state of the opened file by > > > > > > > making it read-only until it is closed. The main use case is for script > > > > > > > interpreters to get the guarantee that script' content cannot be altered > > > > > > > while being read and interpreted. This is useful for generic distros > > > > > > > that may not have a write-xor-execute policy. See commit a5874fde3c08 > > > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > > > > > > > > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this > > > > > > > property on files with deny_write_access(). This new O_DENY_WRITE make > > > > > > > > > > > > The kernel actually tried to get rid of this behavior on execve() in > > > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had > > > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d > > > > > > because it broke userspace assumptions. > > > > > > > > > > Oh, good to know. > > > > > > > > > > > > > > > > > > it widely available. This is similar to what other OSs may provide > > > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows. > > > > > > > > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was > > > > > > removed for security reasons; as > > > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says: > > > > > > > > > > > > | MAP_DENYWRITE > > > > > > | This flag is ignored. (Long ago—Linux 2.0 and earlier—it > > > > > > | signaled that attempts to write to the underlying file > > > > > > | should fail with ETXTBSY. But this was a source of denial- > > > > > > | of-service attacks.)" > > > > > > > > > > > > It seems to me that the same issue applies to your patch - it would > > > > > > allow unprivileged processes to essentially lock files such that other > > > > > > processes can't write to them anymore. This might allow unprivileged > > > > > > users to prevent root from updating config files or stuff like that if > > > > > > they're updated in-place. > > > > > > > > > > Yes, I agree, but since it is the case for executed files I though it > > > > > was worth starting a discussion on this topic. This new flag could be > > > > > restricted to executable files, but we should avoid system-wide locks > > > > > like this. I'm not sure how Windows handle these issues though. > > > > > > > > > > Anyway, we should rely on the access control policy to control write and > > > > > execute access in a consistent way (e.g. write-xor-execute). Thanks for > > > > > the references and the background! > > > > > > > > I'm confused. I understand that there are many contexts in which one > > > > would want to prevent execution of unapproved content, which might > > > > include preventing a given process from modifying some code and then > > > > executing it. > > > > > > > > I don't understand what these deny-write features have to do with it. > > > > These features merely prevent someone from modifying code *that is > > > > currently in use*, which is not at all the same thing as preventing > > > > modifying code that might get executed -- one can often modify > > > > contents *before* executing those contents. > > > > > > The order of checks would be: > > > 1. open script with O_DENY_WRITE > > > 2. check executability with AT_EXECVE_CHECK > > > 3. read the content and interpret it > > > > > I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving. > > > > AT_EXECVE_CHECK is not just for scripting languages. It could also > > work with bytecodes like Java, for example. If we let the Java runtime > > call AT_EXECVE_CHECK before loading the bytecode, the LSM could > > develop a policy based on that. > > Sure, I'm using "script" to make it simple, but this applies to other > use cases. > That makes sense. > > > > > The deny-write feature was to guarantee that there is no race condition > > > between step 2 and 3. All these checks are supposed to be done by a > > > trusted interpreter (which is allowed to be executed). The > > > AT_EXECVE_CHECK call enables the caller to know if the kernel (and > > > associated security policies) allowed the *current* content of the file > > > to be executed. Whatever happen before or after that (wrt. > > > O_DENY_WRITE) should be covered by the security policy. > > > > > Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK. > > > > Enforcing non-write for the path that stores scripts or bytecodes can > > be challenging due to historical or backward compatibility reasons. > > Since AT_EXECVE_CHECK provides a mechanism to check the file right > > before it is used, we can assume it will detect any "problem" that > > happened before that, (e.g. the file was overwritten). However, that > > also imposes two additional requirements: > > 1> the file doesn't change while AT_EXECVE_CHECK does the check. > > This is already the case, so any kind of LSM checks are good. > May I ask how this is done? some code in do_open_execat() does this ? Apologies if this is a basic question. > > 2>The file content kept by the process remains unchanged after passing > > the AT_EXECVE_CHECK. > > The goal of this patch was to avoid such race condition in the case > where executable files can be updated. But in most cases it should not > be a security issue (because processes allowed to write to executable > files should be trusted), but this could still lead to bugs (because of > inconsistent file content, half-updated). > There is also a time gap between: a> the time of AT_EXECVE_CHECK b> the time that the app opens the file for execution. right ? another potential attack path (though this is not the case I mentioned previously). For the case I mentioned previously, I have to think more if the race condition is a bug or security issue. IIUC, two solutions are discussed so far: 1> the process could write to fs to update the script. However, for execution, the process still uses the copy that passed the AT_EXECVE_CHECK. (snapshot solution by Andy Lutomirski) or 2> the process blocks the write while opening the file as read only and executing the script. (this seems to be the approach of this patch). I wonder if there are other ideas. Thanks and regards, -Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-26 20:29 ` Jeff Xu @ 2025-08-27 8:19 ` Mickaël Salaün 2025-08-28 20:17 ` Jeff Xu 0 siblings, 1 reply; 40+ messages in thread From: Mickaël Salaün @ 2025-08-27 8:19 UTC (permalink / raw) To: Jeff Xu Cc: Jeff Xu, Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Tue, Aug 26, 2025 at 01:29:55PM -0700, Jeff Xu wrote: > Hi Mickaël > > On Tue, Aug 26, 2025 at 5:39 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > On Mon, Aug 25, 2025 at 10:57:57AM -0700, Jeff Xu wrote: > > > Hi Mickaël > > > > > > On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote: > > > > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > > > > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote: > > > > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > > > > > > > > passed file descriptors). This changes the state of the opened file by > > > > > > > > making it read-only until it is closed. The main use case is for script > > > > > > > > interpreters to get the guarantee that script' content cannot be altered > > > > > > > > while being read and interpreted. This is useful for generic distros > > > > > > > > that may not have a write-xor-execute policy. See commit a5874fde3c08 > > > > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > > > > > > > > > > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this > > > > > > > > property on files with deny_write_access(). This new O_DENY_WRITE make > > > > > > > > > > > > > > The kernel actually tried to get rid of this behavior on execve() in > > > > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had > > > > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d > > > > > > > because it broke userspace assumptions. > > > > > > > > > > > > Oh, good to know. > > > > > > > > > > > > > > > > > > > > > it widely available. This is similar to what other OSs may provide > > > > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows. > > > > > > > > > > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was > > > > > > > removed for security reasons; as > > > > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says: > > > > > > > > > > > > > > | MAP_DENYWRITE > > > > > > > | This flag is ignored. (Long ago—Linux 2.0 and earlier—it > > > > > > > | signaled that attempts to write to the underlying file > > > > > > > | should fail with ETXTBSY. But this was a source of denial- > > > > > > > | of-service attacks.)" > > > > > > > > > > > > > > It seems to me that the same issue applies to your patch - it would > > > > > > > allow unprivileged processes to essentially lock files such that other > > > > > > > processes can't write to them anymore. This might allow unprivileged > > > > > > > users to prevent root from updating config files or stuff like that if > > > > > > > they're updated in-place. > > > > > > > > > > > > Yes, I agree, but since it is the case for executed files I though it > > > > > > was worth starting a discussion on this topic. This new flag could be > > > > > > restricted to executable files, but we should avoid system-wide locks > > > > > > like this. I'm not sure how Windows handle these issues though. > > > > > > > > > > > > Anyway, we should rely on the access control policy to control write and > > > > > > execute access in a consistent way (e.g. write-xor-execute). Thanks for > > > > > > the references and the background! > > > > > > > > > > I'm confused. I understand that there are many contexts in which one > > > > > would want to prevent execution of unapproved content, which might > > > > > include preventing a given process from modifying some code and then > > > > > executing it. > > > > > > > > > > I don't understand what these deny-write features have to do with it. > > > > > These features merely prevent someone from modifying code *that is > > > > > currently in use*, which is not at all the same thing as preventing > > > > > modifying code that might get executed -- one can often modify > > > > > contents *before* executing those contents. > > > > > > > > The order of checks would be: > > > > 1. open script with O_DENY_WRITE > > > > 2. check executability with AT_EXECVE_CHECK > > > > 3. read the content and interpret it > > > > > > > I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving. > > > > > > AT_EXECVE_CHECK is not just for scripting languages. It could also > > > work with bytecodes like Java, for example. If we let the Java runtime > > > call AT_EXECVE_CHECK before loading the bytecode, the LSM could > > > develop a policy based on that. > > > > Sure, I'm using "script" to make it simple, but this applies to other > > use cases. > > > That makes sense. > > > > > > > > The deny-write feature was to guarantee that there is no race condition > > > > between step 2 and 3. All these checks are supposed to be done by a > > > > trusted interpreter (which is allowed to be executed). The > > > > AT_EXECVE_CHECK call enables the caller to know if the kernel (and > > > > associated security policies) allowed the *current* content of the file > > > > to be executed. Whatever happen before or after that (wrt. > > > > O_DENY_WRITE) should be covered by the security policy. > > > > > > > Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK. > > > > > > Enforcing non-write for the path that stores scripts or bytecodes can > > > be challenging due to historical or backward compatibility reasons. > > > Since AT_EXECVE_CHECK provides a mechanism to check the file right > > > before it is used, we can assume it will detect any "problem" that > > > happened before that, (e.g. the file was overwritten). However, that > > > also imposes two additional requirements: > > > 1> the file doesn't change while AT_EXECVE_CHECK does the check. > > > > This is already the case, so any kind of LSM checks are good. > > > May I ask how this is done? some code in do_open_execat() does this ? > Apologies if this is a basic question. do_open_execat() calls exe_file_deny_write_access() > > > > 2>The file content kept by the process remains unchanged after passing > > > the AT_EXECVE_CHECK. > > > > The goal of this patch was to avoid such race condition in the case > > where executable files can be updated. But in most cases it should not > > be a security issue (because processes allowed to write to executable > > files should be trusted), but this could still lead to bugs (because of > > inconsistent file content, half-updated). > > > There is also a time gap between: > a> the time of AT_EXECVE_CHECK > b> the time that the app opens the file for execution. > right ? another potential attack path (though this is not the case I > mentioned previously). As explained in the documentation, to avoid this specific race condition, interpreters should open the script once, check the FD with AT_EXECVE_CHECK, and then read the content with the same FD. > > For the case I mentioned previously, I have to think more if the race > condition is a bug or security issue. > IIUC, two solutions are discussed so far: > 1> the process could write to fs to update the script. However, for > execution, the process still uses the copy that passed the > AT_EXECVE_CHECK. (snapshot solution by Andy Lutomirski) Yes, the snapshot solution would be the best, but I guess it would rely on filesystems to support this feature. > or 2> the process blocks the write while opening the file as read only > and executing the script. (this seems to be the approach of this > patch). Yes, and this is not something we want anymore. > > I wonder if there are other ideas. I don't see other efficient ways do give the same guarantees. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-27 8:19 ` Mickaël Salaün @ 2025-08-28 20:17 ` Jeff Xu 0 siblings, 0 replies; 40+ messages in thread From: Jeff Xu @ 2025-08-28 20:17 UTC (permalink / raw) To: Mickaël Salaün Cc: Jeff Xu, Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module Hi Mickaël On Wed, Aug 27, 2025 at 1:19 AM Mickaël Salaün <mic@digikod.net> wrote: > > On Tue, Aug 26, 2025 at 01:29:55PM -0700, Jeff Xu wrote: > > Hi Mickaël > > > > On Tue, Aug 26, 2025 at 5:39 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > On Mon, Aug 25, 2025 at 10:57:57AM -0700, Jeff Xu wrote: > > > > Hi Mickaël > > > > > > > > On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > > > > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote: > > > > > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > > > > > > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote: > > > > > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > > > > > > > > > passed file descriptors). This changes the state of the opened file by > > > > > > > > > making it read-only until it is closed. The main use case is for script > > > > > > > > > interpreters to get the guarantee that script' content cannot be altered > > > > > > > > > while being read and interpreted. This is useful for generic distros > > > > > > > > > that may not have a write-xor-execute policy. See commit a5874fde3c08 > > > > > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > > > > > > > > > > > > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this > > > > > > > > > property on files with deny_write_access(). This new O_DENY_WRITE make > > > > > > > > > > > > > > > > The kernel actually tried to get rid of this behavior on execve() in > > > > > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had > > > > > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d > > > > > > > > because it broke userspace assumptions. > > > > > > > > > > > > > > Oh, good to know. > > > > > > > > > > > > > > > > > > > > > > > > it widely available. This is similar to what other OSs may provide > > > > > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows. > > > > > > > > > > > > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was > > > > > > > > removed for security reasons; as > > > > > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says: > > > > > > > > > > > > > > > > | MAP_DENYWRITE > > > > > > > > | This flag is ignored. (Long ago—Linux 2.0 and earlier—it > > > > > > > > | signaled that attempts to write to the underlying file > > > > > > > > | should fail with ETXTBSY. But this was a source of denial- > > > > > > > > | of-service attacks.)" > > > > > > > > > > > > > > > > It seems to me that the same issue applies to your patch - it would > > > > > > > > allow unprivileged processes to essentially lock files such that other > > > > > > > > processes can't write to them anymore. This might allow unprivileged > > > > > > > > users to prevent root from updating config files or stuff like that if > > > > > > > > they're updated in-place. > > > > > > > > > > > > > > Yes, I agree, but since it is the case for executed files I though it > > > > > > > was worth starting a discussion on this topic. This new flag could be > > > > > > > restricted to executable files, but we should avoid system-wide locks > > > > > > > like this. I'm not sure how Windows handle these issues though. > > > > > > > > > > > > > > Anyway, we should rely on the access control policy to control write and > > > > > > > execute access in a consistent way (e.g. write-xor-execute). Thanks for > > > > > > > the references and the background! > > > > > > > > > > > > I'm confused. I understand that there are many contexts in which one > > > > > > would want to prevent execution of unapproved content, which might > > > > > > include preventing a given process from modifying some code and then > > > > > > executing it. > > > > > > > > > > > > I don't understand what these deny-write features have to do with it. > > > > > > These features merely prevent someone from modifying code *that is > > > > > > currently in use*, which is not at all the same thing as preventing > > > > > > modifying code that might get executed -- one can often modify > > > > > > contents *before* executing those contents. > > > > > > > > > > The order of checks would be: > > > > > 1. open script with O_DENY_WRITE > > > > > 2. check executability with AT_EXECVE_CHECK > > > > > 3. read the content and interpret it > > > > > > > > > I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving. > > > > > > > > AT_EXECVE_CHECK is not just for scripting languages. It could also > > > > work with bytecodes like Java, for example. If we let the Java runtime > > > > call AT_EXECVE_CHECK before loading the bytecode, the LSM could > > > > develop a policy based on that. > > > > > > Sure, I'm using "script" to make it simple, but this applies to other > > > use cases. > > > > > That makes sense. > > > > > > > > > > > The deny-write feature was to guarantee that there is no race condition > > > > > between step 2 and 3. All these checks are supposed to be done by a > > > > > trusted interpreter (which is allowed to be executed). The > > > > > AT_EXECVE_CHECK call enables the caller to know if the kernel (and > > > > > associated security policies) allowed the *current* content of the file > > > > > to be executed. Whatever happen before or after that (wrt. > > > > > O_DENY_WRITE) should be covered by the security policy. > > > > > > > > > Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK. > > > > > > > > Enforcing non-write for the path that stores scripts or bytecodes can > > > > be challenging due to historical or backward compatibility reasons. > > > > Since AT_EXECVE_CHECK provides a mechanism to check the file right > > > > before it is used, we can assume it will detect any "problem" that > > > > happened before that, (e.g. the file was overwritten). However, that > > > > also imposes two additional requirements: > > > > 1> the file doesn't change while AT_EXECVE_CHECK does the check. > > > > > > This is already the case, so any kind of LSM checks are good. > > > > > May I ask how this is done? some code in do_open_execat() does this ? > > Apologies if this is a basic question. > > do_open_execat() calls exe_file_deny_write_access() > Thanks for pointing. With that, now I read the full history of discussion regarding this :-) > > > > > > 2>The file content kept by the process remains unchanged after passing > > > > the AT_EXECVE_CHECK. > > > > > > The goal of this patch was to avoid such race condition in the case > > > where executable files can be updated. But in most cases it should not > > > be a security issue (because processes allowed to write to executable > > > files should be trusted), but this could still lead to bugs (because of > > > inconsistent file content, half-updated). > > > > > There is also a time gap between: > > a> the time of AT_EXECVE_CHECK > > b> the time that the app opens the file for execution. > > right ? another potential attack path (though this is not the case I > > mentioned previously). > > As explained in the documentation, to avoid this specific race > condition, interpreters should open the script once, check the FD with > AT_EXECVE_CHECK, and then read the content with the same FD. > Ya, now I see that in the description of this patch, sorry that I missed that previously. > > > > For the case I mentioned previously, I have to think more if the race > > condition is a bug or security issue. > > IIUC, two solutions are discussed so far: > > 1> the process could write to fs to update the script. However, for > > execution, the process still uses the copy that passed the > > AT_EXECVE_CHECK. (snapshot solution by Andy Lutomirski) > > Yes, the snapshot solution would be the best, but I guess it would rely > on filesystems to support this feature. > snapshot seems to be the reasonable direction to go Is this something related to the VMA ? e.g. preserve the in-memory copy of the file when the file on fs was updated. According to man mmap: MAP_PRIVATE Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region. so the direction here is the process -> update the vma -> doesn't carry to the file. What we want is the reverse direction: (the unspecified part in the man page) file updated on fs -> doesn't carry to the vma of this process. > > or 2> the process blocks the write while opening the file as read only > > and executing the script. (this seems to be the approach of this > > patch). > > Yes, and this is not something we want anymore. > right. Thank you for clarifying this. > > > > I wonder if there are other ideas. > > I don't see other efficient ways to give the same guarantees. right, me neither. Thanks and regards, -Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-22 19:45 ` Jann Horn 2025-08-24 11:03 ` Mickaël Salaün @ 2025-08-27 10:18 ` Aleksa Sarai 1 sibling, 0 replies; 40+ messages in thread From: Aleksa Sarai @ 2025-08-27 10:18 UTC (permalink / raw) To: Jann Horn Cc: Mickaël Salaün, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Andy Lutomirski, Jeff Xu [-- Attachment #1: Type: text/plain, Size: 2558 bytes --] On 2025-08-22, Jann Horn <jannh@google.com> wrote: > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > > passed file descriptors). This changes the state of the opened file by > > making it read-only until it is closed. The main use case is for script > > interpreters to get the guarantee that script' content cannot be altered > > while being read and interpreted. This is useful for generic distros > > that may not have a write-xor-execute policy. See commit a5874fde3c08 > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > > > Both execve(2) and the IOCTL to enable fsverity can already set this > > property on files with deny_write_access(). This new O_DENY_WRITE make > > The kernel actually tried to get rid of this behavior on execve() in > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d > because it broke userspace assumptions. Also the ETXTBSY behaviour for binaries is not always guaranteed to block writes to the file. When we were discussing this back in 2021 and when we initially removed it, I remember there being some fairly trivial ways to get around it anyway (but because process mm is mapped with MAP_PRIVATE, writes aren't seen by the actual process). > > it widely available. This is similar to what other OSs may provide > > e.g., opening a file with only FILE_SHARE_READ on Windows. > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was > removed for security reasons; as > https://man7.org/linux/man-pages/man2/mmap.2.html says: > > | MAP_DENYWRITE > | This flag is ignored. (Long ago—Linux 2.0 and earlier—it > | signaled that attempts to write to the underlying file > | should fail with ETXTBSY. But this was a source of denial- > | of-service attacks.)" > > It seems to me that the same issue applies to your patch - it would > allow unprivileged processes to essentially lock files such that other > processes can't write to them anymore. This might allow unprivileged > users to prevent root from updating config files or stuff like that if > they're updated in-place. Agreed, and this was one of the major issues with the also-now-removed mandatory locking as well. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 265 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-22 17:07 ` [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE Mickaël Salaün 2025-08-22 19:45 ` Jann Horn @ 2025-08-27 10:29 ` Aleksa Sarai 1 sibling, 0 replies; 40+ messages in thread From: Aleksa Sarai @ 2025-08-27 10:29 UTC (permalink / raw) To: Mickaël Salaün Cc: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Andy Lutomirski, Jeff Xu [-- Attachment #1: Type: text/plain, Size: 6799 bytes --] On 2025-08-22, Mickaël Salaün <mic@digikod.net> wrote: > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > passed file descriptors). This changes the state of the opened file by > making it read-only until it is closed. The main use case is for script > interpreters to get the guarantee that script' content cannot be altered > while being read and interpreted. This is useful for generic distros > that may not have a write-xor-execute policy. See commit a5874fde3c08 > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > > Both execve(2) and the IOCTL to enable fsverity can already set this > property on files with deny_write_access(). This new O_DENY_WRITE make > it widely available. This is similar to what other OSs may provide > e.g., opening a file with only FILE_SHARE_READ on Windows. > > Cc: Al Viro <viro@zeniv.linux.org.uk> > Cc: Andy Lutomirski <luto@amacapital.net> > Cc: Christian Brauner <brauner@kernel.org> > Cc: Jeff Xu <jeffxu@chromium.org> > Cc: Kees Cook <keescook@chromium.org> > Cc: Paul Moore <paul@paul-moore.com> > Cc: Serge Hallyn <serge@hallyn.com> > Reported-by: Robert Waite <rowait@microsoft.com> > Signed-off-by: Mickaël Salaün <mic@digikod.net> > Link: https://lore.kernel.org/r/20250822170800.2116980-2-mic@digikod.net > --- > fs/fcntl.c | 26 ++++++++++++++++++++++++-- > fs/file_table.c | 2 ++ > fs/namei.c | 6 ++++++ > include/linux/fcntl.h | 2 +- > include/uapi/asm-generic/fcntl.h | 4 ++++ > 5 files changed, 37 insertions(+), 3 deletions(-) > > diff --git a/fs/fcntl.c b/fs/fcntl.c > index 5598e4d57422..0c80c0fbc706 100644 > --- a/fs/fcntl.c > +++ b/fs/fcntl.c > @@ -34,7 +34,8 @@ > > #include "internal.h" > > -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME) > +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \ > + O_DENY_WRITE) > > static int setfl(int fd, struct file * filp, unsigned int arg) > { > @@ -80,8 +81,29 @@ static int setfl(int fd, struct file * filp, unsigned int arg) > error = 0; > } > spin_lock(&filp->f_lock); > + > + if (arg & O_DENY_WRITE) { > + /* Only regular files. */ > + if (!S_ISREG(inode->i_mode)) { > + error = -EINVAL; > + goto unlock; > + } > + > + /* Only sets once. */ > + if (!(filp->f_flags & O_DENY_WRITE)) { > + error = exe_file_deny_write_access(filp); > + if (error) > + goto unlock; > + } > + } else { > + if (filp->f_flags & O_DENY_WRITE) > + exe_file_allow_write_access(filp); > + } I appreciate the goal of making this something that can be cleared (presumably for interpreters that mmap(MAP_PRIVATE) their scripts), but making a security-related flag this easy to clear seems like a footgun (any library function could mask O_DENY_WRITE or forget to copy the old flag values). > + > filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK); > filp->f_iocb_flags = iocb_flags(filp); > + > +unlock: > spin_unlock(&filp->f_lock); > > out: > @@ -1158,7 +1180,7 @@ static int __init fcntl_init(void) > * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY > * is defined as O_NONBLOCK on some platforms and not on others. > */ > - BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ != > + BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != > HWEIGHT32( > (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | > __FMODE_EXEC)); > diff --git a/fs/file_table.c b/fs/file_table.c > index 81c72576e548..6ba896b6a53f 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -460,6 +460,8 @@ static void __fput(struct file *file) > locks_remove_file(file); > > security_file_release(file); > + if (unlikely(file->f_flags & O_DENY_WRITE)) > + exe_file_allow_write_access(file); > if (unlikely(file->f_flags & FASYNC)) { > if (file->f_op->fasync) > file->f_op->fasync(-1, file, 0); > diff --git a/fs/namei.c b/fs/namei.c > index cd43ff89fbaa..366530bf937d 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3885,6 +3885,12 @@ static int do_open(struct nameidata *nd, > error = may_open(idmap, &nd->path, acc_mode, open_flag); > if (!error && !(file->f_mode & FMODE_OPENED)) > error = vfs_open(&nd->path, file); > + if (!error && (open_flag & O_DENY_WRITE)) { > + if (S_ISREG(file_inode(file)->i_mode)) > + error = exe_file_deny_write_access(file); > + else > + error = -EINVAL; > + } > if (!error) > error = security_file_post_open(file, op->acc_mode); > if (!error && do_truncate) > diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h > index a332e79b3207..dad14101686f 100644 > --- a/include/linux/fcntl.h > +++ b/include/linux/fcntl.h > @@ -10,7 +10,7 @@ > (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \ > O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \ > FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ > - O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE) > + O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_DENY_WRITE) I don't like this patch for the same reasons Christian has already said, but in addition -- you cannot just add new open(2) flags like this. Unlike openat2(2), classic open(2) does not verify invalid flag bits, so any new flag must be designed so that old kernels will return an error for that flag combination, which ensures that: * No old programs set those bits inadvertently, which lets us avoid breaking userspace in some really fun and hard-to-debug ways. * For security-related bits, that new programs running on old kernels do not think they are getting a security property that they aren't actually getting. O_TMPFILE's bitflag soup is an example of how you can resolve this issue for open(2), but I would suggest that authors of new O_* flags seriously consider making their flags openat2(2)-only unless it's trivial to get the above behaviour. > /* List of all valid flags for the how->resolve argument: */ > #define VALID_RESOLVE_FLAGS \ > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h > index 613475285643..facd9136f5af 100644 > --- a/include/uapi/asm-generic/fcntl.h > +++ b/include/uapi/asm-generic/fcntl.h > @@ -91,6 +91,10 @@ > /* a horrid kludge trying to make sure that this will fail on old kernels */ > #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) > > +#ifndef O_DENY_WRITE > +#define O_DENY_WRITE 040000000 > +#endif > + > #ifndef O_NDELAY > #define O_NDELAY O_NONBLOCK > #endif -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 265 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* [RFC PATCH v1 2/2] selftests/exec: Add O_DENY_WRITE tests 2025-08-22 17:07 [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) Mickaël Salaün 2025-08-22 17:07 ` [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE Mickaël Salaün @ 2025-08-22 17:08 ` Mickaël Salaün 2025-08-26 9:07 ` [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) Christian Brauner 2 siblings, 0 replies; 40+ messages in thread From: Mickaël Salaün @ 2025-08-22 17:08 UTC (permalink / raw) To: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn Cc: Mickaël Salaün, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Andy Lutomirski, Jeff Xu Add 6 test suites to check O_DENY_WRITE used through open(2) and fcntl(2). Check that it fills its purpose, that it only applies to regular files, and that setting this flag several times is not an issue. The O_DENY_WRITE flag is useful in conjunction with AT_EXECVE_CHECK for systems that don't enforce a write-xor-execute policy. Extend related tests to also use them as examples. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Christian Brauner <brauner@kernel.org> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Robert Waite <rowait@microsoft.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Mickaël Salaün <mic@digikod.net> Link: https://lore.kernel.org/r/20250822170800.2116980-3-mic@digikod.net --- tools/testing/selftests/exec/check-exec.c | 219 ++++++++++++++++++++++ 1 file changed, 219 insertions(+) diff --git a/tools/testing/selftests/exec/check-exec.c b/tools/testing/selftests/exec/check-exec.c index 55bce47e56b7..9db1d7b9aa97 100644 --- a/tools/testing/selftests/exec/check-exec.c +++ b/tools/testing/selftests/exec/check-exec.c @@ -30,6 +30,10 @@ #define _ASM_GENERIC_FCNTL_H #include <linux/fcntl.h> +#ifndef O_DENY_WRITE +#define O_DENY_WRITE 040000000 +#endif + #include "../kselftest_harness.h" static int sys_execveat(int dirfd, const char *pathname, char *const argv[], @@ -319,6 +323,221 @@ TEST_F(access, non_regular_files) test_exec_fd(_metadata, self->pipefd, EACCES); } +TEST_F(access, deny_write_check_open) +{ + int fd_deny, fd_read, fd_write; + + fd_deny = open(reg_file_path, O_DENY_WRITE | O_RDONLY | O_CLOEXEC); + ASSERT_LE(0, fd_deny); + + /* Concurrent reads are always allowed. */ + fd_read = open(reg_file_path, O_RDONLY | O_CLOEXEC); + EXPECT_LE(0, fd_read); + EXPECT_EQ(0, close(fd_read)); + + /* Concurrent writes are denied. */ + fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC); + EXPECT_EQ(-1, fd_write); + EXPECT_EQ(ETXTBSY, errno); + + /* Drops O_DENY_WRITE. */ + EXPECT_EQ(0, close(fd_deny)); + + /* The restriction is now gone. */ + fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC); + EXPECT_LE(0, fd_write); + EXPECT_EQ(0, close(fd_write)); +} + +TEST_F(access, deny_write_check_open_and_fcntl) +{ + int fd_deny, fd_read, fd_write, flags; + + fd_deny = open(reg_file_path, O_DENY_WRITE | O_RDONLY | O_CLOEXEC); + ASSERT_LE(0, fd_deny); + + /* Sets O_DENY_WRITE a "second" time. */ + flags = fcntl(fd_deny, F_GETFL); + ASSERT_NE(-1, flags); + EXPECT_EQ(0, fcntl(fd_deny, F_SETFL, flags | O_DENY_WRITE)); + + /* Concurrent reads are always allowed. */ + fd_read = open(reg_file_path, O_RDONLY | O_CLOEXEC); + EXPECT_LE(0, fd_read); + EXPECT_EQ(0, close(fd_read)); + + /* Concurrent writes are denied. */ + fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC); + EXPECT_EQ(-1, fd_write); + EXPECT_EQ(ETXTBSY, errno); + + /* Drops O_DENY_WRITE. */ + EXPECT_EQ(0, close(fd_deny)); + + /* The restriction is now gone. */ + fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC); + EXPECT_LE(0, fd_write); + EXPECT_EQ(0, close(fd_write)); +} + +TEST_F(access, deny_write_check_fcntl) +{ + int fd_deny, fd_read, fd_write, flags; + + fd_deny = open(reg_file_path, O_RDONLY | O_CLOEXEC); + ASSERT_LE(0, fd_deny); + + /* Sets O_DENY_WRITE a first time. */ + flags = fcntl(fd_deny, F_GETFL); + ASSERT_NE(-1, flags); + EXPECT_EQ(0, fcntl(fd_deny, F_SETFL, flags | O_DENY_WRITE)); + + /* Sets O_DENY_WRITE a "second" time. */ + EXPECT_EQ(0, fcntl(fd_deny, F_SETFL, flags | O_DENY_WRITE)); + + /* Concurrent reads are always allowed. */ + fd_read = open(reg_file_path, O_RDONLY | O_CLOEXEC); + EXPECT_LE(0, fd_read); + EXPECT_EQ(0, close(fd_read)); + + /* Concurrent writes are denied. */ + fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC); + EXPECT_EQ(-1, fd_write); + EXPECT_EQ(ETXTBSY, errno); + + /* Drops O_DENY_WRITE. */ + EXPECT_EQ(0, fcntl(fd_deny, F_SETFL, flags & ~O_DENY_WRITE)); + + /* The restriction is now gone. */ + fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC); + EXPECT_LE(0, fd_write); + EXPECT_EQ(0, close(fd_write)); + + EXPECT_EQ(0, close(fd_deny)); +} + +static void test_deny_write_open(struct __test_metadata *_metadata, + const char *const path, int flags, + const int err_code) +{ + int fd; + + flags |= O_CLOEXEC; + + /* Do not block on pipes. */ + if (path == fifo_path) + flags |= O_NONBLOCK; + + fd = open(path, flags | O_RDONLY); + if (err_code) { + ASSERT_EQ(-1, fd) + { + TH_LOG("Successfully opened %s", path); + } + EXPECT_EQ(errno, err_code) + { + TH_LOG("Wrong error code for %s: %s", path, + strerror(errno)); + } + } else { + ASSERT_LE(0, fd) + { + TH_LOG("Failed to open %s: %s", path, strerror(errno)); + } + EXPECT_EQ(0, close(fd)); + } +} + +TEST_F(access, deny_write_type_open) +{ + test_deny_write_open(_metadata, reg_file_path, O_DENY_WRITE, 0); + test_deny_write_open(_metadata, dir_path, O_DENY_WRITE, EINVAL); + test_deny_write_open(_metadata, block_dev_path, O_DENY_WRITE, EINVAL); + test_deny_write_open(_metadata, char_dev_path, O_DENY_WRITE, EINVAL); + test_deny_write_open(_metadata, fifo_path, O_DENY_WRITE, EINVAL); +} + +static void test_deny_write_fcntl(struct __test_metadata *_metadata, + const char *const path, int setfl, + const int err_code) +{ + int fd, ret; + int getfl, flags = O_CLOEXEC; + + /* Do not block on pipes. */ + if (path == fifo_path) + flags |= O_NONBLOCK; + + fd = open(path, flags | O_RDONLY); + ASSERT_LE(0, fd) + { + TH_LOG("Failed to open %s: %s", path, strerror(errno)); + } + getfl = fcntl(fd, F_GETFL); + ASSERT_NE(-1, getfl); + ret = fcntl(fd, F_SETFL, getfl | setfl); + if (err_code) { + ASSERT_EQ(-1, ret) + { + TH_LOG("Successfully updated flags for %s", path); + } + EXPECT_EQ(errno, err_code) + { + TH_LOG("Wrong error code for %s: %s", path, + strerror(errno)); + } + } else { + ASSERT_LE(0, ret) + { + TH_LOG("Failed to update flags for %s: %s", path, + strerror(errno)); + } + EXPECT_EQ(0, close(fd)); + } +} + +TEST_F(access, deny_write_type_fcntl) +{ + int flags; + + test_deny_write_fcntl(_metadata, reg_file_path, O_DENY_WRITE, 0); + test_deny_write_fcntl(_metadata, dir_path, O_DENY_WRITE, EINVAL); + test_deny_write_fcntl(_metadata, block_dev_path, O_DENY_WRITE, EINVAL); + test_deny_write_fcntl(_metadata, char_dev_path, O_DENY_WRITE, EINVAL); + test_deny_write_fcntl(_metadata, fifo_path, O_DENY_WRITE, EINVAL); + + flags = fcntl(self->socket_fds[0], F_GETFL); + ASSERT_NE(-1, flags); + EXPECT_EQ(-1, + fcntl(self->socket_fds[0], F_SETFL, flags | O_DENY_WRITE)); + EXPECT_EQ(EINVAL, errno); + + flags = fcntl(self->pipefd, F_GETFL); + ASSERT_NE(-1, flags); + EXPECT_EQ(-1, fcntl(self->pipefd, F_SETFL, flags | O_DENY_WRITE)); + EXPECT_EQ(EINVAL, errno); +} + +TEST_F(access, allow_write_type_fcntl) +{ + int flags; + + test_deny_write_fcntl(_metadata, reg_file_path, 0, 0); + test_deny_write_fcntl(_metadata, dir_path, 0, 0); + test_deny_write_fcntl(_metadata, block_dev_path, 0, 0); + test_deny_write_fcntl(_metadata, char_dev_path, 0, 0); + test_deny_write_fcntl(_metadata, fifo_path, 0, 0); + + flags = fcntl(self->socket_fds[0], F_GETFL); + ASSERT_NE(-1, flags); + EXPECT_EQ(0, + fcntl(self->socket_fds[0], F_SETFL, flags & ~O_DENY_WRITE)); + + flags = fcntl(self->pipefd, F_GETFL); + ASSERT_NE(-1, flags); + EXPECT_EQ(0, fcntl(self->pipefd, F_SETFL, flags & ~O_DENY_WRITE)); +} + /* clang-format off */ FIXTURE(secbits) {}; /* clang-format on */ -- 2.50.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-22 17:07 [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) Mickaël Salaün 2025-08-22 17:07 ` [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE Mickaël Salaün 2025-08-22 17:08 ` [RFC PATCH v1 2/2] selftests/exec: Add O_DENY_WRITE tests Mickaël Salaün @ 2025-08-26 9:07 ` Christian Brauner 2025-08-26 11:23 ` Mickaël Salaün 2 siblings, 1 reply; 40+ messages in thread From: Christian Brauner @ 2025-08-26 9:07 UTC (permalink / raw) To: Mickaël Salaün Cc: Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Fri, Aug 22, 2025 at 07:07:58PM +0200, Mickaël Salaün wrote: > Hi, > > Script interpreters can check if a file would be allowed to be executed > by the kernel using the new AT_EXECVE_CHECK flag. This approach works > well on systems with write-xor-execute policies, where scripts cannot > be modified by malicious processes. However, this protection may not be > available on more generic distributions. > > The key difference between `./script.sh` and `sh script.sh` (when using > AT_EXECVE_CHECK) is that execve(2) prevents the script from being opened > for writing while it's being executed. To achieve parity, the kernel > should provide a mechanism for script interpreters to deny write access > during script interpretation. While interpreters can copy script content > into a buffer, a race condition remains possible after AT_EXECVE_CHECK. > > This patch series introduces a new O_DENY_WRITE flag for use with > open*(2) and fcntl(2). Both interfaces are necessary since script > interpreters may receive either a file path or file descriptor. For > backward compatibility, open(2) with O_DENY_WRITE will not fail on > unsupported systems, while users requiring explicit support guarantees > can use openat2(2). We've said no to abusing the O_* flag space for that AT_EXECVE_* stuff before and you've been told by Linus as well that this is a nogo. Nothing has changed in that regard and I'm not interested in stuffing the VFS APIs full of special-purpose behavior to work around the fact that this is work that needs to be done in userspace. Change the apps, stop pushing more and more cruft into the VFS that has no business there. That's before we get into all the issues that are introduced by this mechanism that magically makes arbitrary files unwritable. It's not just a DoS it's likely to cause breakage in userspace as well. I removed the deny-write from execve because it already breaks various use-cases or leads to spurious failures in e.g., go. We're not spreading this disease as a first-class VFS API. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-26 9:07 ` [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) Christian Brauner @ 2025-08-26 11:23 ` Mickaël Salaün 2025-08-26 12:30 ` Theodore Ts'o 2025-08-28 0:14 ` Aleksa Sarai 0 siblings, 2 replies; 40+ messages in thread From: Mickaël Salaün @ 2025-08-26 11:23 UTC (permalink / raw) To: Christian Brauner Cc: Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote: > On Fri, Aug 22, 2025 at 07:07:58PM +0200, Mickaël Salaün wrote: > > Hi, > > > > Script interpreters can check if a file would be allowed to be executed > > by the kernel using the new AT_EXECVE_CHECK flag. This approach works > > well on systems with write-xor-execute policies, where scripts cannot > > be modified by malicious processes. However, this protection may not be > > available on more generic distributions. > > > > The key difference between `./script.sh` and `sh script.sh` (when using > > AT_EXECVE_CHECK) is that execve(2) prevents the script from being opened > > for writing while it's being executed. To achieve parity, the kernel > > should provide a mechanism for script interpreters to deny write access > > during script interpretation. While interpreters can copy script content > > into a buffer, a race condition remains possible after AT_EXECVE_CHECK. > > > > This patch series introduces a new O_DENY_WRITE flag for use with > > open*(2) and fcntl(2). Both interfaces are necessary since script > > interpreters may receive either a file path or file descriptor. For > > backward compatibility, open(2) with O_DENY_WRITE will not fail on > > unsupported systems, while users requiring explicit support guarantees > > can use openat2(2). > > We've said no to abusing the O_* flag space for that AT_EXECVE_* stuff > before and you've been told by Linus as well that this is a nogo. Oh, please, don't mix up everything. First, this is an RFC, and as I explained, the goal is to start a discussion with something concrete. Second, doing a one-time check on a file and providing guarantees for the whole lifetime of an opened file requires different approaches, hence this O_ *proposal*. > > Nothing has changed in that regard and I'm not interested in stuffing > the VFS APIs full of special-purpose behavior to work around the fact > that this is work that needs to be done in userspace. Change the apps, > stop pushing more and more cruft into the VFS that has no business > there. It would be interesting to know how to patch user space to get the same guarantees... Do you think I would propose a kernel patch otherwise? > > That's before we get into all the issues that are introduced by this > mechanism that magically makes arbitrary files unwritable. It's not just > a DoS it's likely to cause breakage in userspace as well. I removed the > deny-write from execve because it already breaks various use-cases or > leads to spurious failures in e.g., go. We're not spreading this disease > as a first-class VFS API. Jann explained it very well, and the deny-write for execve is still there, but let's keep it civil. I already agreed that this is not a good approach, but we could get interesting proposals. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-26 11:23 ` Mickaël Salaün @ 2025-08-26 12:30 ` Theodore Ts'o 2025-08-26 17:47 ` Mickaël Salaün 2025-08-28 0:14 ` Aleksa Sarai 1 sibling, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2025-08-26 12:30 UTC (permalink / raw) To: Mickaël Salaün Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module Is there a single, unified design and requirements document that describes the threat model, and what you are trying to achieve with AT_EXECVE_CHECK and O_DENY_WRITE? I've been looking at the cover letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation that has landed for AT_EXECVE_CHECK and it really doesn't describe what *are* the checks that AT_EXECVE_CHECK is trying to achieve: "The AT_EXECVE_CHECK execveat(2) flag, and the SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE securebits are intended for script interpreters and dynamic linkers to enforce a consistent execution security policy handled by the kernel." Um, what security policy? What checks? What is a sample exploit which is blocked by AT_EXECVE_CHECK? And then on top of it, why can't you do these checks by modifying the script interpreters? Confused, - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-26 12:30 ` Theodore Ts'o @ 2025-08-26 17:47 ` Mickaël Salaün 2025-08-26 20:50 ` Theodore Ts'o 2025-08-27 17:35 ` Andy Lutomirski 0 siblings, 2 replies; 40+ messages in thread From: Mickaël Salaün @ 2025-08-26 17:47 UTC (permalink / raw) To: Theodore Ts'o Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote: > Is there a single, unified design and requirements document that > describes the threat model, and what you are trying to achieve with > AT_EXECVE_CHECK and O_DENY_WRITE? I've been looking at the cover > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation > that has landed for AT_EXECVE_CHECK and it really doesn't describe > what *are* the checks that AT_EXECVE_CHECK is trying to achieve: > > "The AT_EXECVE_CHECK execveat(2) flag, and the > SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE > securebits are intended for script interpreters and dynamic linkers > to enforce a consistent execution security policy handled by the > kernel." From the documentation: Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check on a regular file and returns 0 if execution of this file would be allowed, ignoring the file format and then the related interpreter dependencies (e.g. ELF libraries, script’s shebang). > > Um, what security policy? Whether the file is allowed to be executed. This includes file permission, mount point option, ACL, LSM policies... > What checks? Executability checks? > What is a sample exploit > which is blocked by AT_EXECVE_CHECK? Executing/interpreting any data: sh script.txt > > And then on top of it, why can't you do these checks by modifying the > script interpreters? The script interpreter requires modification to use AT_EXECVE_CHECK. There is no other way for user space to reliably check executability of files (taking into account all enforced security policies/configurations). ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-26 17:47 ` Mickaël Salaün @ 2025-08-26 20:50 ` Theodore Ts'o 2025-08-27 8:19 ` Mickaël Salaün 2025-08-27 17:35 ` Andy Lutomirski 1 sibling, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2025-08-26 20:50 UTC (permalink / raw) To: Mickaël Salaün Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Tue, Aug 26, 2025 at 07:47:30PM +0200, Mickaël Salaün wrote: > > Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check > on a regular file and returns 0 if execution of this file would be > allowed, ignoring the file format and then the related interpreter > dependencies (e.g. ELF libraries, script’s shebang). But if that's it, why can't the script interpreter (python, bash, etc.) before executing the script, checks for executability via faccessat(2) or fstat(2)? The whole O_DONY_WRITE dicsussion seemed to imply that AT_EXECVE_CHECK was doing more than just the executability check? > There is no other way for user space to reliably check executability of > files (taking into account all enforced security > policies/configurations). Why doesn't faccessat(2) or fstat(2) suffice? This is why having a more substantive requirements and design doc might be helpful. It appears you have some assumptions that perhaps other kernel developers are not aware. I certainly seem to be missing something..... - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-26 20:50 ` Theodore Ts'o @ 2025-08-27 8:19 ` Mickaël Salaün 0 siblings, 0 replies; 40+ messages in thread From: Mickaël Salaün @ 2025-08-27 8:19 UTC (permalink / raw) To: Theodore Ts'o Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Tue, Aug 26, 2025 at 04:50:57PM -0400, Theodore Ts'o wrote: > On Tue, Aug 26, 2025 at 07:47:30PM +0200, Mickaël Salaün wrote: > > > > Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check > > on a regular file and returns 0 if execution of this file would be > > allowed, ignoring the file format and then the related interpreter > > dependencies (e.g. ELF libraries, script’s shebang). > > But if that's it, why can't the script interpreter (python, bash, > etc.) before executing the script, checks for executability via > faccessat(2) or fstat(2)? From commit a5874fde3c08 ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)"): This is different from faccessat(2) + X_OK which only checks a subset of access rights (i.e. inode permission and mount options for regular files), but not the full context (e.g. all LSM access checks). The main use case for access(2) is for SUID processes to (partially) check access on behalf of their caller. The main use case for execveat(2) + AT_EXECVE_CHECK is to check if a script execution would be allowed, according to all the different restrictions in place. Because the use of AT_EXECVE_CHECK follows the exact kernel semantic as for a real execution, user space gets the same error codes. > > The whole O_DONY_WRITE dicsussion seemed to imply that AT_EXECVE_CHECK > was doing more than just the executability check? I would say that that AT_EXECVE_CHECK does a full executability check (with the full caller's credentials checked against the currently enforced security policy). The rationale to add O_DENY_WRITE (which is now abandoned) was to avoid a race condition between the check and the full read. Indeed, with a full execveat(2), the kernel write-lock the file to avoid such issue (which can lead to other issues). > > > There is no other way for user space to reliably check executability of > > files (taking into account all enforced security > > policies/configurations). > > Why doesn't faccessat(2) or fstat(2) suffice? This is why having a > more substantive requirements and design doc might be helpful. It > appears you have some assumptions that perhaps other kernel developers > are not aware. I certainly seem to be missing something..... My reasoning was to explain the rationale for a kernel feature in the commit message, and the user doc (why and how to use it) in the user-facing documentation. Documentation improvements are welcome! ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-26 17:47 ` Mickaël Salaün 2025-08-26 20:50 ` Theodore Ts'o @ 2025-08-27 17:35 ` Andy Lutomirski 2025-08-27 19:07 ` Mickaël Salaün 1 sibling, 1 reply; 40+ messages in thread From: Andy Lutomirski @ 2025-08-27 17:35 UTC (permalink / raw) To: Mickaël Salaün Cc: Theodore Ts'o, Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Tue, Aug 26, 2025 at 10:47 AM Mickaël Salaün <mic@digikod.net> wrote: > > On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote: > > Is there a single, unified design and requirements document that > > describes the threat model, and what you are trying to achieve with > > AT_EXECVE_CHECK and O_DENY_WRITE? I've been looking at the cover > > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation > > that has landed for AT_EXECVE_CHECK and it really doesn't describe > > what *are* the checks that AT_EXECVE_CHECK is trying to achieve: > > > > "The AT_EXECVE_CHECK execveat(2) flag, and the > > SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE > > securebits are intended for script interpreters and dynamic linkers > > to enforce a consistent execution security policy handled by the > > kernel." > > From the documentation: > > Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check > on a regular file and returns 0 if execution of this file would be > allowed, ignoring the file format and then the related interpreter > dependencies (e.g. ELF libraries, script’s shebang). > > > > > Um, what security policy? > > Whether the file is allowed to be executed. This includes file > permission, mount point option, ACL, LSM policies... This needs *waaaaay* more detail for any sort of useful evaluation. Is an actual credible security policy rolling dice? Asking ChatGPT? Looking at security labels? Does it care who can write to the file, or who owns the file, or what the file's hash is, or what filesystem it's on, or where it came from? Does it dynamically inspect the contents? Is it controlled by an unprivileged process? I can easily come up with security policies for which DENYWRITE is completely useless. I can come up with convoluted and not-really-credible policies where DENYWRITE is important, but I'm honestly not sure that those policies are actually useful. I'm honestly a bit concerned that AT_EXECVE_CHECK is fundamentally busted because it should have been parametrized by *what format is expected* -- it might be possible to bypass a policy by executing a perfectly fine Python script using bash, for example. I genuinely have not come up with a security policy that I believe makes sense that needs AT_EXECVE_CHECK and DENYWRITE. I'm not saying that such a policy does not exist -- I'm saying that I have not thought of such a thing after a few minutes of thought and reading these threads. > > And then on top of it, why can't you do these checks by modifying the > > script interpreters? > > The script interpreter requires modification to use AT_EXECVE_CHECK. > > There is no other way for user space to reliably check executability of > files (taking into account all enforced security > policies/configurations). > As mentioned above, even AT_EXECVE_CHECK does not obviously accomplish this goal. If it were genuinely useful, I would much, much prefer a totally different API: a *syscall* that takes, as input, a file descriptor of something that an interpreter wants to execute and a whole lot of context as to what that interpreter wants to do with it. And I admit I'm *still* not convinced. Seriously, consider all the unending recent attacks on LLMs an inspiration. The implications of viewing an image, downscaling the image, possibly interpreting the image as something containing text, possibly following instructions in a given language contained in the image, etc are all wildly different. A mechanism for asking for general permission to "consume this image" is COMPLETELY MISSING THE POINT. (Never mind that the current crop of LLMs seem entirely incapable of constraining their own use of some piece of input, but that's a different issue and is besides the point here.) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-27 17:35 ` Andy Lutomirski @ 2025-08-27 19:07 ` Mickaël Salaün 2025-08-27 20:35 ` Andy Lutomirski 0 siblings, 1 reply; 40+ messages in thread From: Mickaël Salaün @ 2025-08-27 19:07 UTC (permalink / raw) To: Andy Lutomirski Cc: Theodore Ts'o, Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Wed, Aug 27, 2025 at 10:35:28AM -0700, Andy Lutomirski wrote: > On Tue, Aug 26, 2025 at 10:47 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote: > > > Is there a single, unified design and requirements document that > > > describes the threat model, and what you are trying to achieve with > > > AT_EXECVE_CHECK and O_DENY_WRITE? I've been looking at the cover > > > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation > > > that has landed for AT_EXECVE_CHECK and it really doesn't describe > > > what *are* the checks that AT_EXECVE_CHECK is trying to achieve: > > > > > > "The AT_EXECVE_CHECK execveat(2) flag, and the > > > SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE > > > securebits are intended for script interpreters and dynamic linkers > > > to enforce a consistent execution security policy handled by the > > > kernel." > > > > From the documentation: > > > > Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check > > on a regular file and returns 0 if execution of this file would be > > allowed, ignoring the file format and then the related interpreter > > dependencies (e.g. ELF libraries, script’s shebang). > > > > > > > > Um, what security policy? > > > > Whether the file is allowed to be executed. This includes file > > permission, mount point option, ACL, LSM policies... > > This needs *waaaaay* more detail for any sort of useful evaluation. > Is an actual credible security policy rolling dice? Asking ChatGPT? > Looking at security labels? Does it care who can write to the file, > or who owns the file, or what the file's hash is, or what filesystem > it's on, or where it came from? Does it dynamically inspect the > contents? Is it controlled by an unprivileged process? AT_EXECVE_CHECK only does the same checks as done by other execveat(2) calls, but without actually executing the file/fd. > > I can easily come up with security policies for which DENYWRITE is > completely useless. I can come up with convoluted and > not-really-credible policies where DENYWRITE is important, but I'm > honestly not sure that those policies are actually useful. I'm > honestly a bit concerned that AT_EXECVE_CHECK is fundamentally busted > because it should have been parametrized by *what format is expected* > -- it might be possible to bypass a policy by executing a perfectly > fine Python script using bash, for example. There have been a lot of bikesheding for the AT_EXECVE_CHECK patch series, and a lot of discussions too (you where part of them). We ended up with this design, which is simple and follows the kernel semantic (requested by Linus). > > I genuinely have not come up with a security policy that I believe > makes sense that needs AT_EXECVE_CHECK and DENYWRITE. I'm not saying > that such a policy does not exist -- I'm saying that I have not > thought of such a thing after a few minutes of thought and reading > these threads. A simple use case is for systems that wants to enforce a write-xor-execute policy e.g., thanks to mount point options. > > > > > And then on top of it, why can't you do these checks by modifying the > > > script interpreters? > > > > The script interpreter requires modification to use AT_EXECVE_CHECK. > > > > There is no other way for user space to reliably check executability of > > files (taking into account all enforced security > > policies/configurations). > > > > As mentioned above, even AT_EXECVE_CHECK does not obviously accomplish > this goal. If it were genuinely useful, I would much, much prefer a > totally different API: a *syscall* that takes, as input, a file > descriptor of something that an interpreter wants to execute and a > whole lot of context as to what that interpreter wants to do with it. > And I admit I'm *still* not convinced. As mentioned above, AT_EXECVE_CHECK follows the kernel semantic. Nothing fancy. > > Seriously, consider all the unending recent attacks on LLMs an > inspiration. The implications of viewing an image, downscaling the > image, possibly interpreting the image as something containing text, > possibly following instructions in a given language contained in the > image, etc are all wildly different. A mechanism for asking for > general permission to "consume this image" is COMPLETELY MISSING THE > POINT. (Never mind that the current crop of LLMs seem entirely > incapable of constraining their own use of some piece of input, but > that's a different issue and is besides the point here.) You're asking about what should we consider executable. This is a good question, but AT_EXECVE_CHECK is there to answer another question: would the kernel execute it or not? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-27 19:07 ` Mickaël Salaün @ 2025-08-27 20:35 ` Andy Lutomirski 0 siblings, 0 replies; 40+ messages in thread From: Andy Lutomirski @ 2025-08-27 20:35 UTC (permalink / raw) To: Mickaël Salaün Cc: Andy Lutomirski, Theodore Ts'o, Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Wed, Aug 27, 2025 at 12:07 PM Mickaël Salaün <mic@digikod.net> wrote: > > On Wed, Aug 27, 2025 at 10:35:28AM -0700, Andy Lutomirski wrote: > > On Tue, Aug 26, 2025 at 10:47 AM Mickaël Salaün <mic@digikod.net> wrote: > > > > > > On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote: > > > > Is there a single, unified design and requirements document that > > > > describes the threat model, and what you are trying to achieve with > > > > AT_EXECVE_CHECK and O_DENY_WRITE? I've been looking at the cover > > > > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation > > > > that has landed for AT_EXECVE_CHECK and it really doesn't describe > > > > what *are* the checks that AT_EXECVE_CHECK is trying to achieve: > > > > > > > > "The AT_EXECVE_CHECK execveat(2) flag, and the > > > > SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE > > > > securebits are intended for script interpreters and dynamic linkers > > > > to enforce a consistent execution security policy handled by the > > > > kernel." > > > > > > From the documentation: > > > > > > Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check > > > on a regular file and returns 0 if execution of this file would be > > > allowed, ignoring the file format and then the related interpreter > > > dependencies (e.g. ELF libraries, script’s shebang). > > > > > > > > > > > Um, what security policy? > > > > > > Whether the file is allowed to be executed. This includes file > > > permission, mount point option, ACL, LSM policies... > > > > This needs *waaaaay* more detail for any sort of useful evaluation. > > Is an actual credible security policy rolling dice? Asking ChatGPT? > > Looking at security labels? Does it care who can write to the file, > > or who owns the file, or what the file's hash is, or what filesystem > > it's on, or where it came from? Does it dynamically inspect the > > contents? Is it controlled by an unprivileged process? > > AT_EXECVE_CHECK only does the same checks as done by other execveat(2) > calls, but without actually executing the file/fd. > okay... but see below. > > > > I can easily come up with security policies for which DENYWRITE is > > completely useless. I can come up with convoluted and > > not-really-credible policies where DENYWRITE is important, but I'm > > honestly not sure that those policies are actually useful. I'm > > honestly a bit concerned that AT_EXECVE_CHECK is fundamentally busted > > because it should have been parametrized by *what format is expected* > > -- it might be possible to bypass a policy by executing a perfectly > > fine Python script using bash, for example. > > There have been a lot of bikesheding for the AT_EXECVE_CHECK patch > series, and a lot of discussions too (you where part of them). We ended > up with this design, which is simple and follows the kernel semantic > (requested by Linus). I recall this. That doesn't mean I totally love AT_EXECVE_CHECK. And it especially doesn't mean that I believe that it usefully does something that justifies anything like DENYWRITE. > > > > > I genuinely have not come up with a security policy that I believe > > makes sense that needs AT_EXECVE_CHECK and DENYWRITE. I'm not saying > > that such a policy does not exist -- I'm saying that I have not > > thought of such a thing after a few minutes of thought and reading > > these threads. > > A simple use case is for systems that wants to enforce a > write-xor-execute policy e.g., thanks to mount point options. Sure, but I'm contemplating DENYWRITE, and this thread is about DENYWRITE. If the kernel is enforcing W^X, then there are really two almost unrelated things going on: 1. LSM policy that enforces W^X for memory mappings. This is to enforce that applications don't do nasty things like having executable stacks, and it's a mess because no one has really figured out how JITs are supposed to work in this world. It has almost nothing to do with execve except incidentally. 2. LSM policy that enforces that someone doesn't execve (or similar) something that *that user* can write. Or that non-root can write. Or that anyone at all can write, etc. I think, but I'm not sure, that you're talking about #2. So maybe there's a policy that says that one may only exec things that are on an fs with the 'exec' mount option. Or maybe there's a policy that says that one may only exec things that are on a readonly fs. In these specific cases, I believe in AT_EXECVE_CHECK. *But* I don't believe in DENYWRITE: in the 'exec' case, if an fs has the exec option set, that doesn't change if the file is subsequently modified. And if an fs is readonly, then the file is quite unlikely to be modified at all and will certainly not be modified via the mount through which it's being executed. And you don't need DENYWRITE. So I think my question still stands: is there a credible security policy *that actually benefits from DENYWRITE*? If so, can you give an example? > > > > Seriously, consider all the unending recent attacks on LLMs an > > inspiration. The implications of viewing an image, downscaling the > > image, possibly interpreting the image as something containing text, > > possibly following instructions in a given language contained in the > > image, etc are all wildly different. A mechanism for asking for > > general permission to "consume this image" is COMPLETELY MISSING THE > > POINT. (Never mind that the current crop of LLMs seem entirely > > incapable of constraining their own use of some piece of input, but > > that's a different issue and is besides the point here.) > > You're asking about what should we consider executable. This is a good > question, but AT_EXECVE_CHECK is there to answer another question: would > the kernel execute it or not? > That's a sort of odd way of putting it. The kernel won't execute it because the kernel doesn't know how to :) But I think I understand what you're saying. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-26 11:23 ` Mickaël Salaün 2025-08-26 12:30 ` Theodore Ts'o @ 2025-08-28 0:14 ` Aleksa Sarai 2025-08-28 0:32 ` Andy Lutomirski 2025-09-01 9:24 ` Roberto Sassu 1 sibling, 2 replies; 40+ messages in thread From: Aleksa Sarai @ 2025-08-28 0:14 UTC (permalink / raw) To: Mickaël Salaün Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module [-- Attachment #1: Type: text/plain, Size: 2981 bytes --] On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote: > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote: > > Nothing has changed in that regard and I'm not interested in stuffing > > the VFS APIs full of special-purpose behavior to work around the fact > > that this is work that needs to be done in userspace. Change the apps, > > stop pushing more and more cruft into the VFS that has no business > > there. > > It would be interesting to know how to patch user space to get the same > guarantees... Do you think I would propose a kernel patch otherwise? You could mmap the script file with MAP_PRIVATE. This is the *actual* protection the kernel uses against overwriting binaries (yes, ETXTBSY is nice but IIRC there are ways to get around it anyway). Of course, most interpreters don't mmap their scripts, but this is a potential solution. If the security policy is based on validating the script text in some way, this avoids the TOCTOU. Now, in cases where you have IMA or something and you only permit signed binaries to execute, you could argue there is a different race here (an attacker creates a malicious script, runs it, and then replaces it with a valid script's contents and metadata after the fact to get AT_EXECVE_CHECK to permit the execution). However, I'm not sure that this is even possible with IMA (can an unprivileged user even set security.ima?). But even then, I would expect users that really need this would also probably use fs-verity or dm-verity that would block this kind of attack since it would render the files read-only anyway. This is why a more detailed threat model of what kinds of attacks are relevant is useful. I was there for the talk you gave and subsequent discussion at last year's LPC, but I felt that your threat model was not really fleshed out at all. I am still not sure what capabilities you expect the attacker to have nor what is being used to authenticate binaries (other than AT_EXECVE_CHECK). Maybe I'm wrong with my above assumptions, but I can't know without knowing what threat model you have in mind, *in detail*. For example, if you are dealing with an attacker that has CAP_SYS_ADMIN, there are plenty of ways for an attacker to execute their own code without using interpreters (create a new tmpfs with fsopen(2) for instance). Executable memfds are even easier and don't require privileges on most systems (yes, you can block them with vm.memfd_noexec but CAP_SYS_ADMIN can disable that -- and there's always fsopen(2) or mount(2)). (As an aside, it's a shame that AT_EXECVE_CHECK burned one of the top-level AT_* bits for a per-syscall flag -- the block comment I added in b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated") was meant to avoid this happening but it seems you and the reviewers missed that...) -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 265 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-28 0:14 ` Aleksa Sarai @ 2025-08-28 0:32 ` Andy Lutomirski 2025-08-28 0:52 ` Aleksa Sarai 2025-08-28 21:01 ` Serge E. Hallyn 2025-09-01 9:24 ` Roberto Sassu 1 sibling, 2 replies; 40+ messages in thread From: Andy Lutomirski @ 2025-08-28 0:32 UTC (permalink / raw) To: Aleksa Sarai Cc: Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote: > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote: > > > Nothing has changed in that regard and I'm not interested in stuffing > > > the VFS APIs full of special-purpose behavior to work around the fact > > > that this is work that needs to be done in userspace. Change the apps, > > > stop pushing more and more cruft into the VFS that has no business > > > there. > > > > It would be interesting to know how to patch user space to get the same > > guarantees... Do you think I would propose a kernel patch otherwise? > > You could mmap the script file with MAP_PRIVATE. This is the *actual* > protection the kernel uses against overwriting binaries (yes, ETXTBSY is > nice but IIRC there are ways to get around it anyway). Wait, really? MAP_PRIVATE prevents writes to the mapping from affecting the file, but I don't think that writes to the file will break the MAP_PRIVATE CoW if it's not already broken. IPython says: In [1]: import mmap, tempfile In [2]: f = tempfile.TemporaryFile() In [3]: f.write(b'initial contents') Out[3]: 16 In [4]: f.flush() In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ) In [6]: map[:] Out[6]: b'initial contents' In [7]: f.seek(0) Out[7]: 0 In [8]: f.write(b'changed') Out[8]: 7 In [9]: f.flush() In [10]: map[:] Out[10]: b'changed contents' ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-28 0:32 ` Andy Lutomirski @ 2025-08-28 0:52 ` Aleksa Sarai 2025-08-28 21:01 ` Serge E. Hallyn 1 sibling, 0 replies; 40+ messages in thread From: Aleksa Sarai @ 2025-08-28 0:52 UTC (permalink / raw) To: Andy Lutomirski Cc: Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module [-- Attachment #1: Type: text/plain, Size: 2052 bytes --] On 2025-08-27, Andy Lutomirski <luto@kernel.org> wrote: > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote: > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote: > > > > Nothing has changed in that regard and I'm not interested in stuffing > > > > the VFS APIs full of special-purpose behavior to work around the fact > > > > that this is work that needs to be done in userspace. Change the apps, > > > > stop pushing more and more cruft into the VFS that has no business > > > > there. > > > > > > It would be interesting to know how to patch user space to get the same > > > guarantees... Do you think I would propose a kernel patch otherwise? > > > > You could mmap the script file with MAP_PRIVATE. This is the *actual* > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is > > nice but IIRC there are ways to get around it anyway). > > Wait, really? MAP_PRIVATE prevents writes to the mapping from > affecting the file, but I don't think that writes to the file will > break the MAP_PRIVATE CoW if it's not already broken. Oh I guess you're right -- that's news to me. And from mmap(2): > MAP_PRIVATE > [...] It is unspecified whether changes made to the file after the > mmap() call are visible in the mapped region. But then what is the protection mechanism (in the absence of -ETXTBSY) that stops you from overwriting the live text of a binary by just writing to it? I would need to go trawling through my old scripts to find the reproducer that let you get around -ETXTBSY (I think it involved executable memfds) but I distinctly remember that even if you overwrote the binary you would not see the live process's mapped mm change value. (Ditto for the few kernels when we removed -ETXTBSY.) I found this surprising, but assumed that it was because of MAP_PRIVATE. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 265 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-28 0:32 ` Andy Lutomirski 2025-08-28 0:52 ` Aleksa Sarai @ 2025-08-28 21:01 ` Serge E. Hallyn 2025-09-01 11:05 ` Jann Horn 1 sibling, 1 reply; 40+ messages in thread From: Serge E. Hallyn @ 2025-08-28 21:01 UTC (permalink / raw) To: Andy Lutomirski Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote: > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote: > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote: > > > > Nothing has changed in that regard and I'm not interested in stuffing > > > > the VFS APIs full of special-purpose behavior to work around the fact > > > > that this is work that needs to be done in userspace. Change the apps, > > > > stop pushing more and more cruft into the VFS that has no business > > > > there. > > > > > > It would be interesting to know how to patch user space to get the same > > > guarantees... Do you think I would propose a kernel patch otherwise? > > > > You could mmap the script file with MAP_PRIVATE. This is the *actual* > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is > > nice but IIRC there are ways to get around it anyway). > > Wait, really? MAP_PRIVATE prevents writes to the mapping from > affecting the file, but I don't think that writes to the file will > break the MAP_PRIVATE CoW if it's not already broken. > > IPython says: > > In [1]: import mmap, tempfile > > In [2]: f = tempfile.TemporaryFile() > > In [3]: f.write(b'initial contents') > Out[3]: 16 > > In [4]: f.flush() > > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, > prot=mmap.PROT_READ) > > In [6]: map[:] > Out[6]: b'initial contents' > > In [7]: f.seek(0) > Out[7]: 0 > > In [8]: f.write(b'changed') > Out[8]: 7 > > In [9]: f.flush() > > In [10]: map[:] > Out[10]: b'changed contents' That was surprising to me, however, if I split the reader and writer into different processes, so P1: f = open("/tmp/3", "w") f.write('initial contents') f.flush() P2: import mmap f = open("/tmp/3", "r") map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ) Back to P1: f.seek(0) f.write('changed') Back to P2: map[:] Then P2 gives me: b'initial contents' -serge ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-28 21:01 ` Serge E. Hallyn @ 2025-09-01 11:05 ` Jann Horn 2025-09-01 13:18 ` Serge E. Hallyn 2025-09-01 16:01 ` Andy Lutomirski 0 siblings, 2 replies; 40+ messages in thread From: Jann Horn @ 2025-09-01 11:05 UTC (permalink / raw) To: Serge E. Hallyn Cc: Andy Lutomirski, Aleksa Sarai, Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook, Paul Moore, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Thu, Aug 28, 2025 at 11:01 PM Serge E. Hallyn <serge@hallyn.com> wrote: > On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote: > > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote: > > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote: > > > > > Nothing has changed in that regard and I'm not interested in stuffing > > > > > the VFS APIs full of special-purpose behavior to work around the fact > > > > > that this is work that needs to be done in userspace. Change the apps, > > > > > stop pushing more and more cruft into the VFS that has no business > > > > > there. > > > > > > > > It would be interesting to know how to patch user space to get the same > > > > guarantees... Do you think I would propose a kernel patch otherwise? > > > > > > You could mmap the script file with MAP_PRIVATE. This is the *actual* > > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is > > > nice but IIRC there are ways to get around it anyway). > > > > Wait, really? MAP_PRIVATE prevents writes to the mapping from > > affecting the file, but I don't think that writes to the file will > > break the MAP_PRIVATE CoW if it's not already broken. > > > > IPython says: > > > > In [1]: import mmap, tempfile > > > > In [2]: f = tempfile.TemporaryFile() > > > > In [3]: f.write(b'initial contents') > > Out[3]: 16 > > > > In [4]: f.flush() > > > > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, > > prot=mmap.PROT_READ) > > > > In [6]: map[:] > > Out[6]: b'initial contents' > > > > In [7]: f.seek(0) > > Out[7]: 0 > > > > In [8]: f.write(b'changed') > > Out[8]: 7 > > > > In [9]: f.flush() > > > > In [10]: map[:] > > Out[10]: b'changed contents' > > That was surprising to me, however, if I split the reader > and writer into different processes, so Testing this in python is a terrible idea because it obfuscates the actual syscalls from you. > P1: > f = open("/tmp/3", "w") > f.write('initial contents') > f.flush() > > P2: > import mmap > f = open("/tmp/3", "r") > map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ) > > Back to P1: > f.seek(0) > f.write('changed') > > Back to P2: > map[:] > > Then P2 gives me: > > b'initial contents' Because when you executed `f.write('changed')`, Python internally buffered the write. "changed" is never actually written into the file in your example. If you add a `f.flush()` in P1 after this, running `map[:]` in P2 again will show you the new data. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-09-01 11:05 ` Jann Horn @ 2025-09-01 13:18 ` Serge E. Hallyn 2025-09-01 16:01 ` Andy Lutomirski 1 sibling, 0 replies; 40+ messages in thread From: Serge E. Hallyn @ 2025-09-01 13:18 UTC (permalink / raw) To: Jann Horn Cc: Serge E. Hallyn, Andy Lutomirski, Aleksa Sarai, Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook, Paul Moore, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Mon, Sep 01, 2025 at 01:05:16PM +0200, Jann Horn wrote: > On Thu, Aug 28, 2025 at 11:01 PM Serge E. Hallyn <serge@hallyn.com> wrote: > > On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote: > > > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > > > > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote: > > > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote: > > > > > > Nothing has changed in that regard and I'm not interested in stuffing > > > > > > the VFS APIs full of special-purpose behavior to work around the fact > > > > > > that this is work that needs to be done in userspace. Change the apps, > > > > > > stop pushing more and more cruft into the VFS that has no business > > > > > > there. > > > > > > > > > > It would be interesting to know how to patch user space to get the same > > > > > guarantees... Do you think I would propose a kernel patch otherwise? > > > > > > > > You could mmap the script file with MAP_PRIVATE. This is the *actual* > > > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is > > > > nice but IIRC there are ways to get around it anyway). > > > > > > Wait, really? MAP_PRIVATE prevents writes to the mapping from > > > affecting the file, but I don't think that writes to the file will > > > break the MAP_PRIVATE CoW if it's not already broken. > > > > > > IPython says: > > > > > > In [1]: import mmap, tempfile > > > > > > In [2]: f = tempfile.TemporaryFile() > > > > > > In [3]: f.write(b'initial contents') > > > Out[3]: 16 > > > > > > In [4]: f.flush() > > > > > > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, > > > prot=mmap.PROT_READ) > > > > > > In [6]: map[:] > > > Out[6]: b'initial contents' > > > > > > In [7]: f.seek(0) > > > Out[7]: 0 > > > > > > In [8]: f.write(b'changed') > > > Out[8]: 7 > > > > > > In [9]: f.flush() > > > > > > In [10]: map[:] > > > Out[10]: b'changed contents' > > > > That was surprising to me, however, if I split the reader > > and writer into different processes, so > > Testing this in python is a terrible idea because it obfuscates the > actual syscalls from you. Hah, I was just trying to fit in :), but of course you're right. Redoing it in straight c, I'm getting the updates. -serge // mmap-w.c (creates an overwrites) #include <stdio.h> #include <fcntl.h> #include <unistd.h> #define FIRST "Initial contents" #define SECOND "updated contents" int main() { int fd, rc; char c; fd = open("/tmp/m", O_CREAT | O_RDWR, 0644); if (fd < 0) { printf("failed to open /tmp/m: %m\n"); _exit(1); } rc = write(fd, FIRST, sizeof(FIRST)); if (rc < 0) { printf("write failed: %m\n"); _exit(1); } rc = fsync(fd); if (rc < 0) { printf("flush failed: %m\n"); _exit(1); } read(STDIN_FILENO, &c, 1); printf("updating the contents\n"); rc = lseek(fd, 0, SEEK_SET); if (rc < 0) { printf("seek failed; %m\n"); _exit(1); } rc = write(fd, SECOND, sizeof(SECOND)); if (fd < 0) { printf("write failed: %m\n"); _exit(1); } rc = close(fd); if (rc < 0) { printf("close failed: %m\n"); _exit(1); } printf("done\n"); } // mmap-r.c (checks and re-checks contents) #include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> #define FIRST "Initial contents" #define SECOND "Updated contents" int main() { int fd, rc; char *m; char c; fd = open("/tmp/m", O_RDONLY); if (fd < 0) { printf("failed to open /tmp/m: %m\n"); _exit(1); } m = mmap(NULL, 40, PROT_READ, MAP_PRIVATE, fd, 0); if (m == MAP_FAILED) { printf("mmap failed: %m\n"); _exit(1); } if (strncmp(m, FIRST, 7) != 0) { printf("m is %c%c%c%c%c%c%c\n", m[0], m[1], m[2], m[3], m[4], m[5], m[6]); _exit(1); } read(STDIN_FILENO, &c, 1); if (strncmp(m, SECOND, 7) != 0) { printf("m is %c%c%c%c%c%c%c\n", m[0], m[1], m[2], m[3], m[4], m[5], m[6]); _exit(1); } printf("done\n"); } ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-09-01 11:05 ` Jann Horn 2025-09-01 13:18 ` Serge E. Hallyn @ 2025-09-01 16:01 ` Andy Lutomirski 1 sibling, 0 replies; 40+ messages in thread From: Andy Lutomirski @ 2025-09-01 16:01 UTC (permalink / raw) To: Jann Horn Cc: Serge E. Hallyn, Andy Lutomirski, Aleksa Sarai, Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook, Paul Moore, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Mon, Sep 1, 2025 at 4:06 AM Jann Horn <jannh@google.com> wrote: > > On Thu, Aug 28, 2025 at 11:01 PM Serge E. Hallyn <serge@hallyn.com> wrote: > > On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote: > > > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > > > > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote: > > > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote: > > > > > > Nothing has changed in that regard and I'm not interested in stuffing > > > > > > the VFS APIs full of special-purpose behavior to work around the fact > > > > > > that this is work that needs to be done in userspace. Change the apps, > > > > > > stop pushing more and more cruft into the VFS that has no business > > > > > > there. > > > > > > > > > > It would be interesting to know how to patch user space to get the same > > > > > guarantees... Do you think I would propose a kernel patch otherwise? > > > > > > > > You could mmap the script file with MAP_PRIVATE. This is the *actual* > > > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is > > > > nice but IIRC there are ways to get around it anyway). > > > > > > Wait, really? MAP_PRIVATE prevents writes to the mapping from > > > affecting the file, but I don't think that writes to the file will > > > break the MAP_PRIVATE CoW if it's not already broken. > > > > > > IPython says: > > > > > > In [1]: import mmap, tempfile > > > > > > In [2]: f = tempfile.TemporaryFile() > > > > > > In [3]: f.write(b'initial contents') > > > Out[3]: 16 > > > > > > In [4]: f.flush() > > > > > > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, > > > prot=mmap.PROT_READ) > > > > > > In [6]: map[:] > > > Out[6]: b'initial contents' > > > > > > In [7]: f.seek(0) > > > Out[7]: 0 > > > > > > In [8]: f.write(b'changed') > > > Out[8]: 7 > > > > > > In [9]: f.flush() > > > > > > In [10]: map[:] > > > Out[10]: b'changed contents' > > > > That was surprising to me, however, if I split the reader > > and writer into different processes, so > > Testing this in python is a terrible idea because it obfuscates the > actual syscalls from you. > > > P1: > > f = open("/tmp/3", "w") > > f.write('initial contents') > > f.flush() > > > > P2: > > import mmap > > f = open("/tmp/3", "r") > > map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ) > > > > Back to P1: > > f.seek(0) > > f.write('changed') > > > > Back to P2: > > map[:] > > > > Then P2 gives me: > > > > b'initial contents' > > Because when you executed `f.write('changed')`, Python internally > buffered the write. "changed" is never actually written into the file > in your example. If you add a `f.flush()` in P1 after this, running > `map[:]` in P2 again will show you the new data. > These days, one can type in Python, ask an LLM to translate to C, and get almost-correct output :) Or one can use os.write(), which is exactly what I should have done. --Andy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-08-28 0:14 ` Aleksa Sarai 2025-08-28 0:32 ` Andy Lutomirski @ 2025-09-01 9:24 ` Roberto Sassu 2025-09-01 16:25 ` Andy Lutomirski 1 sibling, 1 reply; 40+ messages in thread From: Roberto Sassu @ 2025-09-01 9:24 UTC (permalink / raw) To: Aleksa Sarai, Mickaël Salaün Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Thu, 2025-08-28 at 10:14 +1000, Aleksa Sarai wrote: > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote: > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote: > > > Nothing has changed in that regard and I'm not interested in stuffing > > > the VFS APIs full of special-purpose behavior to work around the fact > > > that this is work that needs to be done in userspace. Change the apps, > > > stop pushing more and more cruft into the VFS that has no business > > > there. > > > > It would be interesting to know how to patch user space to get the same > > guarantees... Do you think I would propose a kernel patch otherwise? > > You could mmap the script file with MAP_PRIVATE. This is the *actual* > protection the kernel uses against overwriting binaries (yes, ETXTBSY is > nice but IIRC there are ways to get around it anyway). Of course, most > interpreters don't mmap their scripts, but this is a potential solution. > If the security policy is based on validating the script text in some > way, this avoids the TOCTOU. > > Now, in cases where you have IMA or something and you only permit signed > binaries to execute, you could argue there is a different race here (an > attacker creates a malicious script, runs it, and then replaces it with > a valid script's contents and metadata after the fact to get > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that Uhm, let's consider measurement, I'm more familiar with. I think the race you wanted to express was that the attacker replaces the good script, verified with AT_EXECVE_CHECK, with the bad script after the IMA verification but before the interpreter reads it. Fortunately, IMA is able to cope with this situation, since this race can happen for any file open, where of course a file can be not read- locked. If the attacker tries to concurrently open the script for write in this race window, IMA will report this event (called violation) in the measurement list, and during remote attestation it will be clear that the interpreter did not read what was measured. We just need to run the violation check for the BPRM_CHECK hook too (then, probably for us the O_DENY_WRITE flag or alternative solution would not be needed, for measurement). Please, let us know when you apply patches like 2a010c412853 ("fs: don't block i_writecount during exec"). We had a discussion [1], but probably I missed when it was decided to be applied (I saw now it was in the same thread, but didn't get that at the time). We would have needed to update our code accordingly. In the future, we will try to clarify better our expectations from the VFS. Thanks Roberto [1]: https://lore.kernel.org/linux-fsdevel/88d5a92379755413e1ec3c981d9a04e6796da110.camel@huaweicloud.com/#t > this is even possible with IMA (can an unprivileged user even set > security.ima?). But even then, I would expect users that really need > this would also probably use fs-verity or dm-verity that would block > this kind of attack since it would render the files read-only anyway. > > This is why a more detailed threat model of what kinds of attacks are > relevant is useful. I was there for the talk you gave and subsequent > discussion at last year's LPC, but I felt that your threat model was > not really fleshed out at all. I am still not sure what capabilities you > expect the attacker to have nor what is being used to authenticate > binaries (other than AT_EXECVE_CHECK). Maybe I'm wrong with my above > assumptions, but I can't know without knowing what threat model you have > in mind, *in detail*. > > For example, if you are dealing with an attacker that has CAP_SYS_ADMIN, > there are plenty of ways for an attacker to execute their own code > without using interpreters (create a new tmpfs with fsopen(2) for > instance). Executable memfds are even easier and don't require > privileges on most systems (yes, you can block them with vm.memfd_noexec > but CAP_SYS_ADMIN can disable that -- and there's always fsopen(2) or > mount(2)). > > (As an aside, it's a shame that AT_EXECVE_CHECK burned one of the > top-level AT_* bits for a per-syscall flag -- the block comment I added > in b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be > allocated") was meant to avoid this happening but it seems you and the > reviewers missed that...) > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-09-01 9:24 ` Roberto Sassu @ 2025-09-01 16:25 ` Andy Lutomirski 2025-09-01 17:01 ` Roberto Sassu 0 siblings, 1 reply; 40+ messages in thread From: Andy Lutomirski @ 2025-09-01 16:25 UTC (permalink / raw) To: Roberto Sassu Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module Can you clarify this a bit for those of us who are not well-versed in exactly what "measurement" does? On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu <roberto.sassu@huaweicloud.com> wrote: > > Now, in cases where you have IMA or something and you only permit signed > > binaries to execute, you could argue there is a different race here (an > > attacker creates a malicious script, runs it, and then replaces it with > > a valid script's contents and metadata after the fact to get > > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that > > Uhm, let's consider measurement, I'm more familiar with. > > I think the race you wanted to express was that the attacker replaces > the good script, verified with AT_EXECVE_CHECK, with the bad script > after the IMA verification but before the interpreter reads it. > > Fortunately, IMA is able to cope with this situation, since this race > can happen for any file open, where of course a file can be not read- > locked. I assume you mean that this has nothing specifically to do with scripts, as IMA tries to protect ordinary (non-"execute" file access) as well. Am I right? > > If the attacker tries to concurrently open the script for write in this > race window, IMA will report this event (called violation) in the > measurement list, and during remote attestation it will be clear that > the interpreter did not read what was measured. > > We just need to run the violation check for the BPRM_CHECK hook too > (then, probably for us the O_DENY_WRITE flag or alternative solution > would not be needed, for measurement). This seems consistent with my interpretation above, but ... > > Please, let us know when you apply patches like 2a010c412853 ("fs: > don't block i_writecount during exec"). We had a discussion [1], but > probably I missed when it was decided to be applied (I saw now it was > in the same thread, but didn't get that at the time). We would have > needed to update our code accordingly. In the future, we will try to > clarify better our expectations from the VFS. ... I didn't follow this. Suppose there's some valid contents of /bin/sleep. I execute /bin/sleep 1m. While it's running, I modify /bin/sleep (by opening it for write, not by replacing it), and the kernel in question doesn't do ETXTBSY. Then the sleep process reads (and executes) the modified contents. Wouldn't a subsequent attestation fail? Why is ETXTBSY needed? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-09-01 16:25 ` Andy Lutomirski @ 2025-09-01 17:01 ` Roberto Sassu 2025-09-02 8:57 ` Roberto Sassu 0 siblings, 1 reply; 40+ messages in thread From: Roberto Sassu @ 2025-09-01 17:01 UTC (permalink / raw) To: Andy Lutomirski Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Mon, 2025-09-01 at 09:25 -0700, Andy Lutomirski wrote: > Can you clarify this a bit for those of us who are not well-versed in > exactly what "measurement" does? > > On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu > <roberto.sassu@huaweicloud.com> wrote: > > > Now, in cases where you have IMA or something and you only permit signed > > > binaries to execute, you could argue there is a different race here (an > > > attacker creates a malicious script, runs it, and then replaces it with > > > a valid script's contents and metadata after the fact to get > > > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that > > > > Uhm, let's consider measurement, I'm more familiar with. > > > > I think the race you wanted to express was that the attacker replaces > > the good script, verified with AT_EXECVE_CHECK, with the bad script > > after the IMA verification but before the interpreter reads it. > > > > Fortunately, IMA is able to cope with this situation, since this race > > can happen for any file open, where of course a file can be not read- > > locked. > > I assume you mean that this has nothing specifically to do with > scripts, as IMA tries to protect ordinary (non-"execute" file access) > as well. Am I right? Yes, correct, violations are checked for all open() and mmap() involving regular files. It would not be special to do it for scripts. > > If the attacker tries to concurrently open the script for write in this > > race window, IMA will report this event (called violation) in the > > measurement list, and during remote attestation it will be clear that > > the interpreter did not read what was measured. > > > > We just need to run the violation check for the BPRM_CHECK hook too > > (then, probably for us the O_DENY_WRITE flag or alternative solution > > would not be needed, for measurement). > > This seems consistent with my interpretation above, but ... The comment here [1] seems to be clear on why the violation check it is not done for execution (BPRM_CHECK hook). Since the OS read-locks the files during execution, this implicitly guarantees that there will not be concurrent writes, and thus no IMA violations. However, recently, we took advantage of AT_EXECVE_CHECK to also evaluate the integrity of scripts (when not executed via ./). Since we are using the same hook for both executed files (read-locked) and scripts (I guess non-read-locked), then we need to do a violation check for BPRM_CHECK too, although it will be redundant for the first category. > > Please, let us know when you apply patches like 2a010c412853 ("fs: > > don't block i_writecount during exec"). We had a discussion [1], but > > probably I missed when it was decided to be applied (I saw now it was > > in the same thread, but didn't get that at the time). We would have > > needed to update our code accordingly. In the future, we will try to > > clarify better our expectations from the VFS. > > ... I didn't follow this. > > Suppose there's some valid contents of /bin/sleep. I execute > /bin/sleep 1m. While it's running, I modify /bin/sleep (by opening it > for write, not by replacing it), and the kernel in question doesn't do > ETXTBSY. Then the sleep process reads (and executes) the modified > contents. Wouldn't a subsequent attestation fail? Why is ETXTBSY > needed? Ok, this is actually a good opportunity to explain what it will be missing. If you do the operations in the order you proposed, actually a violation will be emitted, because the violating operation is an open() and the check is done for this system call. However, if you do the opposite, first open for write and then execution, IMA will not be aware of that since it trusts the OS to not make it happen and will not check for violations. So yes, in your case the remote attestation will fail (actually it is up to the remote verifier to decide...). But in the opposite case, the writer could wait for IMA to measure the genuine content and then modify the content conveniently. The remote attestation will succeed. Adding the violation check on BPRM_CHECK should be sufficient to avoid such situation, but I would try to think if there are other implications for IMA of not read-locking the files on execution. Roberto [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/integrity/ima/ima_main.c?h=v6.17-rc4#n565 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) 2025-09-01 17:01 ` Roberto Sassu @ 2025-09-02 8:57 ` Roberto Sassu 0 siblings, 0 replies; 40+ messages in thread From: Roberto Sassu @ 2025-09-02 8:57 UTC (permalink / raw) To: Andy Lutomirski Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module On Mon, 2025-09-01 at 19:01 +0200, Roberto Sassu wrote: > On Mon, 2025-09-01 at 09:25 -0700, Andy Lutomirski wrote: > > Can you clarify this a bit for those of us who are not well-versed in > > exactly what "measurement" does? Ah, sorry, I missed that. Measurement refers to the process of collecting the file digest and storing it in the measurement list, as opposed to appraisal which instead compares the collected file digest with a reference value (assumed to be good), and denies access in case of a mismatch. Integrity violations are detected and reported only for measurement. Roberto > > On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu > > <roberto.sassu@huaweicloud.com> wrote: > > > > Now, in cases where you have IMA or something and you only permit signed > > > > binaries to execute, you could argue there is a different race here (an > > > > attacker creates a malicious script, runs it, and then replaces it with > > > > a valid script's contents and metadata after the fact to get > > > > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that > > > > > > Uhm, let's consider measurement, I'm more familiar with. > > > > > > I think the race you wanted to express was that the attacker replaces > > > the good script, verified with AT_EXECVE_CHECK, with the bad script > > > after the IMA verification but before the interpreter reads it. > > > > > > Fortunately, IMA is able to cope with this situation, since this race > > > can happen for any file open, where of course a file can be not read- > > > locked. > > > > I assume you mean that this has nothing specifically to do with > > scripts, as IMA tries to protect ordinary (non-"execute" file access) > > as well. Am I right? > > Yes, correct, violations are checked for all open() and mmap() > involving regular files. It would not be special to do it for scripts. > > > > If the attacker tries to concurrently open the script for write in this > > > race window, IMA will report this event (called violation) in the > > > measurement list, and during remote attestation it will be clear that > > > the interpreter did not read what was measured. > > > > > > We just need to run the violation check for the BPRM_CHECK hook too > > > (then, probably for us the O_DENY_WRITE flag or alternative solution > > > would not be needed, for measurement). > > > > This seems consistent with my interpretation above, but ... > > The comment here [1] seems to be clear on why the violation check it is > not done for execution (BPRM_CHECK hook). Since the OS read-locks the > files during execution, this implicitly guarantees that there will not > be concurrent writes, and thus no IMA violations. > > However, recently, we took advantage of AT_EXECVE_CHECK to also > evaluate the integrity of scripts (when not executed via ./). Since we > are using the same hook for both executed files (read-locked) and > scripts (I guess non-read-locked), then we need to do a violation check > for BPRM_CHECK too, although it will be redundant for the first > category. > > > > Please, let us know when you apply patches like 2a010c412853 ("fs: > > > don't block i_writecount during exec"). We had a discussion [1], but > > > probably I missed when it was decided to be applied (I saw now it was > > > in the same thread, but didn't get that at the time). We would have > > > needed to update our code accordingly. In the future, we will try to > > > clarify better our expectations from the VFS. > > > > ... I didn't follow this. > > > > Suppose there's some valid contents of /bin/sleep. I execute > > /bin/sleep 1m. While it's running, I modify /bin/sleep (by opening it > > for write, not by replacing it), and the kernel in question doesn't do > > ETXTBSY. Then the sleep process reads (and executes) the modified > > contents. Wouldn't a subsequent attestation fail? Why is ETXTBSY > > needed? > > Ok, this is actually a good opportunity to explain what it will be > missing. If you do the operations in the order you proposed, actually a > violation will be emitted, because the violating operation is an open() > and the check is done for this system call. > > However, if you do the opposite, first open for write and then > execution, IMA will not be aware of that since it trusts the OS to not > make it happen and will not check for violations. > > So yes, in your case the remote attestation will fail (actually it is > up to the remote verifier to decide...). But in the opposite case, the > writer could wait for IMA to measure the genuine content and then > modify the content conveniently. The remote attestation will succeed. > > Adding the violation check on BPRM_CHECK should be sufficient to avoid > such situation, but I would try to think if there are other > implications for IMA of not read-locking the files on execution. > > Roberto > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/integrity/ima/ima_main.c?h=v6.17-rc4#n565 > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE @ 2025-08-25 21:56 Andy Lutomirski 2025-08-25 23:06 ` Jeff Xu 0 siblings, 1 reply; 40+ messages in thread From: Andy Lutomirski @ 2025-08-25 21:56 UTC (permalink / raw) To: Jeff Xu Cc: Mickaël Salaün, Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Jeff Xu > On Aug 25, 2025, at 11:10 AM, Jeff Xu <jeffxu@google.com> wrote: > > On Mon, Aug 25, 2025 at 9:43 AM Andy Lutomirski <luto@amacapital.net> wrote: >>> On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote: >>> On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote: >>>> On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote: >>>>> On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote: >>>>>> On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: >>>>>>> Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. >>>>>>> passed file descriptors). This changes the state of the opened file by >>>>>>> making it read-only until it is closed. The main use case is for script >>>>>>> interpreters to get the guarantee that script' content cannot be altered >>>>>>> while being read and interpreted. This is useful for generic distros >>>>>>> that may not have a write-xor-execute policy. See commit a5874fde3c08 >>>>>>> ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") >>>>>>> Both execve(2) and the IOCTL to enable fsverity can already set this >>>>>>> property on files with deny_write_access(). This new O_DENY_WRITE make >>>>>> The kernel actually tried to get rid of this behavior on execve() in >>>>>> commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had >>>>>> to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d >>>>>> because it broke userspace assumptions. >>>>> Oh, good to know. >>>>>>> it widely available. This is similar to what other OSs may provide >>>>>>> e.g., opening a file with only FILE_SHARE_READ on Windows. >>>>>> We used to have the analogous mmap() flag MAP_DENYWRITE, and that was >>>>>> removed for security reasons; as >>>>>> https://man7.org/linux/man-pages/man2/mmap.2.html says: >>>>>> | MAP_DENYWRITE >>>>>> | This flag is ignored. (Long ago—Linux 2.0 and earlier—it >>>>>> | signaled that attempts to write to the underlying file >>>>>> | should fail with ETXTBSY. But this was a source of denial- >>>>>> | of-service attacks.)" >>>>>> It seems to me that the same issue applies to your patch - it would >>>>>> allow unprivileged processes to essentially lock files such that other >>>>>> processes can't write to them anymore. This might allow unprivileged >>>>>> users to prevent root from updating config files or stuff like that if >>>>>> they're updated in-place. >>>>> Yes, I agree, but since it is the case for executed files I though it >>>>> was worth starting a discussion on this topic. This new flag could be >>>>> restricted to executable files, but we should avoid system-wide locks >>>>> like this. I'm not sure how Windows handle these issues though. >>>>> Anyway, we should rely on the access control policy to control write and >>>>> execute access in a consistent way (e.g. write-xor-execute). Thanks for >>>>> the references and the background! >>>> I'm confused. I understand that there are many contexts in which one >>>> would want to prevent execution of unapproved content, which might >>>> include preventing a given process from modifying some code and then >>>> executing it. >>>> I don't understand what these deny-write features have to do with it. >>>> These features merely prevent someone from modifying code *that is >>>> currently in use*, which is not at all the same thing as preventing >>>> modifying code that might get executed -- one can often modify >>>> contents *before* executing those contents. >>> The order of checks would be: >>> 1. open script with O_DENY_WRITE >>> 2. check executability with AT_EXECVE_CHECK >>> 3. read the content and interpret it >> Hmm. Common LSM configurations should be able to handle this without >> deny write, I think. If you don't want a program to be able to make >> their own scripts, then don't allow AT_EXECVE_CHECK to succeed on a >> script that the program can write. > Yes, Common LSM could handle this, however, due to historic and app > backward compability reason, sometimes it is impossible to enforce > that kind of policy in practice, therefore as an alternative, a > machinism such as AT_EXECVE_CHECK is really useful. Can you clarify? I’m suspicious that we’re taking past each other. AT_EXECVE_CHECK solves a problem that there are actions that effectively “execute” a file that don’t execute literal CPU instructions for it. Sometimes open+read has the effect of interpreting the contents of the file as something code-like. But, as I see it, deny-write is almost entirely orthogonal. If you open a file with the intent of executing it (mmap-execute or interpret — makes little practical difference here), then the kernel can enforce some policy. If the file is writable by a process that ought not have permission to execute code in the context of the opening-for-execute process, then LSMs need deny-write to be enforced so that they can verify the contents at the time of opening. But let’s step back a moment: is there any actual sensible security policy that does this? If I want to *enforce* that a process only execute approved code, then wouldn’t I do it be only allowing executing files that the process can’t write? The reason that the removal of deny-write wasn’t security — it was a functionality issue: a linker accidentally modified an in-use binary. If you have permission to use gcc or lld, etc to create binaries, and you have permission to run them, then you pretty much have permission to run whatever code you like. So, if there’s a real security use case for deny-write, I’m still not seeing it. >> Keep in mind that trying to lock this down too hard is pointless for >> users who are allowed to to ptrace-write to their own processes. Or >> for users who can do JIT, or for users who can run a REPL, etc. > The ptrace-write and /proc/pid/mem writing are on my radar, at least > for ChomeOS and Android. > AT_EXECVE_CHECK is orthogonal to those IMO, I hope eventually all > those paths will be hardened. > > Thanks and regards, > -Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE 2025-08-25 21:56 [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE Andy Lutomirski @ 2025-08-25 23:06 ` Jeff Xu 0 siblings, 0 replies; 40+ messages in thread From: Jeff Xu @ 2025-08-25 23:06 UTC (permalink / raw) To: Andy Lutomirski Cc: Mickaël Salaün, Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb, kernel-hardening, linux-api, linux-fsdevel, linux-integrity, linux-kernel, linux-security-module, Jeff Xu On Mon, Aug 25, 2025 at 2:56 PM Andy Lutomirski <luto@amacapital.net> wrote: > > > > On Aug 25, 2025, at 11:10 AM, Jeff Xu <jeffxu@google.com> wrote: > > > > On Mon, Aug 25, 2025 at 9:43 AM Andy Lutomirski <luto@amacapital.net> wrote: > >>> On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote: > >>> On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote: > >>>> On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote: > >>>>> On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote: > >>>>>> On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote: > >>>>>>> Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g. > >>>>>>> passed file descriptors). This changes the state of the opened file by > >>>>>>> making it read-only until it is closed. The main use case is for script > >>>>>>> interpreters to get the guarantee that script' content cannot be altered > >>>>>>> while being read and interpreted. This is useful for generic distros > >>>>>>> that may not have a write-xor-execute policy. See commit a5874fde3c08 > >>>>>>> ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)") > >>>>>>> Both execve(2) and the IOCTL to enable fsverity can already set this > >>>>>>> property on files with deny_write_access(). This new O_DENY_WRITE make > >>>>>> The kernel actually tried to get rid of this behavior on execve() in > >>>>>> commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had > >>>>>> to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d > >>>>>> because it broke userspace assumptions. > >>>>> Oh, good to know. > >>>>>>> it widely available. This is similar to what other OSs may provide > >>>>>>> e.g., opening a file with only FILE_SHARE_READ on Windows. > >>>>>> We used to have the analogous mmap() flag MAP_DENYWRITE, and that was > >>>>>> removed for security reasons; as > >>>>>> https://man7.org/linux/man-pages/man2/mmap.2.html says: > >>>>>> | MAP_DENYWRITE > >>>>>> | This flag is ignored. (Long ago—Linux 2.0 and earlier—it > >>>>>> | signaled that attempts to write to the underlying file > >>>>>> | should fail with ETXTBSY. But this was a source of denial- > >>>>>> | of-service attacks.)" > >>>>>> It seems to me that the same issue applies to your patch - it would > >>>>>> allow unprivileged processes to essentially lock files such that other > >>>>>> processes can't write to them anymore. This might allow unprivileged > >>>>>> users to prevent root from updating config files or stuff like that if > >>>>>> they're updated in-place. > >>>>> Yes, I agree, but since it is the case for executed files I though it > >>>>> was worth starting a discussion on this topic. This new flag could be > >>>>> restricted to executable files, but we should avoid system-wide locks > >>>>> like this. I'm not sure how Windows handle these issues though. > >>>>> Anyway, we should rely on the access control policy to control write and > >>>>> execute access in a consistent way (e.g. write-xor-execute). Thanks for > >>>>> the references and the background! > >>>> I'm confused. I understand that there are many contexts in which one > >>>> would want to prevent execution of unapproved content, which might > >>>> include preventing a given process from modifying some code and then > >>>> executing it. > >>>> I don't understand what these deny-write features have to do with it. > >>>> These features merely prevent someone from modifying code *that is > >>>> currently in use*, which is not at all the same thing as preventing > >>>> modifying code that might get executed -- one can often modify > >>>> contents *before* executing those contents. > >>> The order of checks would be: > >>> 1. open script with O_DENY_WRITE > >>> 2. check executability with AT_EXECVE_CHECK > >>> 3. read the content and interpret it > >> Hmm. Common LSM configurations should be able to handle this without > >> deny write, I think. If you don't want a program to be able to make > >> their own scripts, then don't allow AT_EXECVE_CHECK to succeed on a > >> script that the program can write. > > Yes, Common LSM could handle this, however, due to historic and app > > backward compability reason, sometimes it is impossible to enforce > > that kind of policy in practice, therefore as an alternative, a > > machinism such as AT_EXECVE_CHECK is really useful. > > Can you clarify? I’m suspicious that we’re taking past each other. > Apology, my response isn't clear. > AT_EXECVE_CHECK solves a problem that there are actions that effectively “execute” a file that don’t execute literal CPU instructions for it. Sometimes open+read has the effect of interpreting the contents of the file as something code-like. > Yes. We have the same understanding of this. As an example, shell script or java byte code, their file permission can be rw, but no x bit set. The interpreter reads those and executes them. > But, as I see it, deny-write is almost entirely orthogonal. If you open a file with the intent of executing it (mmap-execute or interpret — makes little practical difference here), then the kernel can enforce some policy. If the file is writable by a process that ought not have permission to execute code in the context of the opening-for-execute process, then LSMs need deny-write to be enforced so that they can verify the contents at the time of opening. > > But let’s step back a moment: is there any actual sensible security policy that does this? If I want to *enforce* that a process only execute approved code, then wouldn’t I do it be only allowing executing files that the process can’t write? > I imagine the following situation: an app has both "rw" access to the file that holds the script code, the "w" is needed because the app updates the script sometimes. What is a reasonable sandbox solution for such an app? There are maybe two options: 1> split the app as two processes: processA has "w" access to the script for updating when needed. Process B has "r" access but no "w", for executing. ProcessA and ProcessB will coordinate to avoid racing on the script update. 2> The process will use AT_EXECVE_CHECK (added by interpreter) to validate the file before opening , and the file content held by the process should be immutable while being validated and executed later by interpreter. option 1 is the ideal, and IIUC, you promote this too. However, that requires refactoring the app as two processes. option 2 is an alternative. Because it doesn't require the change from the apps, therefore a solution worth considering. > The reason that the removal of deny-write wasn’t security — it was a functionality issue: a linker accidentally modified an in-use binary. If you have permission to use gcc or lld, etc to create binaries, and you have permission to run them, then you pretty much have permission to run whatever code you like. > > So, if there’s a real security use case for deny-write, I’m still not seeing it. > Although the current patch might not be ideal due to the potential DOS attack, it does offer a starting point to address the needs. Let's continue the discussion based on this patch and explore different ideas. Thanks and regards, -Jeff > >> Keep in mind that trying to lock this down too hard is pointless for > >> users who are allowed to to ptrace-write to their own processes. Or > >> for users who can do JIT, or for users who can run a REPL, etc. > > The ptrace-write and /proc/pid/mem writing are on my radar, at least > > for ChomeOS and Android. > > AT_EXECVE_CHECK is orthogonal to those IMO, I hope eventually all > > those paths will be hardened. > > > > Thanks and regards, > > -Jeff ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2025-09-02 9:16 UTC | newest] Thread overview: 40+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-22 17:07 [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) Mickaël Salaün 2025-08-22 17:07 ` [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE Mickaël Salaün 2025-08-22 19:45 ` Jann Horn 2025-08-24 11:03 ` Mickaël Salaün 2025-08-24 18:04 ` Andy Lutomirski 2025-08-25 9:31 ` Mickaël Salaün 2025-08-25 9:39 ` Florian Weimer 2025-08-26 12:35 ` Mickaël Salaün 2025-08-25 16:43 ` Andy Lutomirski 2025-08-25 18:10 ` Jeff Xu 2025-08-25 17:57 ` Jeff Xu 2025-08-26 12:39 ` Mickaël Salaün 2025-08-26 20:29 ` Jeff Xu 2025-08-27 8:19 ` Mickaël Salaün 2025-08-28 20:17 ` Jeff Xu 2025-08-27 10:18 ` Aleksa Sarai 2025-08-27 10:29 ` Aleksa Sarai 2025-08-22 17:08 ` [RFC PATCH v1 2/2] selftests/exec: Add O_DENY_WRITE tests Mickaël Salaün 2025-08-26 9:07 ` [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK) Christian Brauner 2025-08-26 11:23 ` Mickaël Salaün 2025-08-26 12:30 ` Theodore Ts'o 2025-08-26 17:47 ` Mickaël Salaün 2025-08-26 20:50 ` Theodore Ts'o 2025-08-27 8:19 ` Mickaël Salaün 2025-08-27 17:35 ` Andy Lutomirski 2025-08-27 19:07 ` Mickaël Salaün 2025-08-27 20:35 ` Andy Lutomirski 2025-08-28 0:14 ` Aleksa Sarai 2025-08-28 0:32 ` Andy Lutomirski 2025-08-28 0:52 ` Aleksa Sarai 2025-08-28 21:01 ` Serge E. Hallyn 2025-09-01 11:05 ` Jann Horn 2025-09-01 13:18 ` Serge E. Hallyn 2025-09-01 16:01 ` Andy Lutomirski 2025-09-01 9:24 ` Roberto Sassu 2025-09-01 16:25 ` Andy Lutomirski 2025-09-01 17:01 ` Roberto Sassu 2025-09-02 8:57 ` Roberto Sassu -- strict thread matches above, loose matches on Subject: below -- 2025-08-25 21:56 [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE Andy Lutomirski 2025-08-25 23:06 ` Jeff Xu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).