Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Andy Lutomirski @ 2025-08-25 16:43 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore,
	Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes,
	Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, Jeff Xu
In-Reply-To: <20250825.mahNeel0dohz@digikod.net>

On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:
> > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
> > >
> > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > > > > passed file descriptors).  This changes the state of the opened file by
> > > > > making it read-only until it is closed.  The main use case is for script
> > > > > interpreters to get the guarantee that script' content cannot be altered
> > > > > while being read and interpreted.  This is useful for generic distros
> > > > > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> > > > >
> > > > > Both execve(2) and the IOCTL to enable fsverity can already set this
> > > > > property on files with deny_write_access().  This new O_DENY_WRITE make
> > > >
> > > > The kernel actually tried to get rid of this behavior on execve() in
> > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> > > > because it broke userspace assumptions.
> > >
> > > Oh, good to know.
> > >
> > > >
> > > > > it widely available.  This is similar to what other OSs may provide
> > > > > e.g., opening a file with only FILE_SHARE_READ on Windows.
> > > >
> > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> > > > removed for security reasons; as
> > > > https://man7.org/linux/man-pages/man2/mmap.2.html says:
> > > >
> > > > |        MAP_DENYWRITE
> > > > |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> > > > |               signaled that attempts to write to the underlying file
> > > > |               should fail with ETXTBSY.  But this was a source of denial-
> > > > |               of-service attacks.)"
> > > >
> > > > It seems to me that the same issue applies to your patch - it would
> > > > allow unprivileged processes to essentially lock files such that other
> > > > processes can't write to them anymore. This might allow unprivileged
> > > > users to prevent root from updating config files or stuff like that if
> > > > they're updated in-place.
> > >
> > > Yes, I agree, but since it is the case for executed files I though it
> > > was worth starting a discussion on this topic.  This new flag could be
> > > restricted to executable files, but we should avoid system-wide locks
> > > like this.  I'm not sure how Windows handle these issues though.
> > >
> > > Anyway, we should rely on the access control policy to control write and
> > > execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> > > the references and the background!
> >
> > I'm confused.  I understand that there are many contexts in which one
> > would want to prevent execution of unapproved content, which might
> > include preventing a given process from modifying some code and then
> > executing it.
> >
> > I don't understand what these deny-write features have to do with it.
> > These features merely prevent someone from modifying code *that is
> > currently in use*, which is not at all the same thing as preventing
> > modifying code that might get executed -- one can often modify
> > contents *before* executing those contents.
>
> The order of checks would be:
> 1. open script with O_DENY_WRITE
> 2. check executability with AT_EXECVE_CHECK
> 3. read the content and interpret it

Hmm.  Common LSM configurations should be able to handle this without
deny write, I think.  If you don't want a program to be able to make
their own scripts, then don't allow AT_EXECVE_CHECK to succeed on a
script that the program can write.

Keep in mind that trying to lock this down too hard is pointless for
users who are allowed to to ptrace-write to their own processes.  Or
for users who can do JIT, or for users who can run a REPL, etc.

> > But maybe a less kludgy version could be used for real.  What if there
> > was a syscall that would take an fd and make a snapshot of the file?
>
> Yes, that would be a clean solution.  I don't think this is achievable
> in an efficient way without involving filesystem implementations though.

It wouldn't be so terrible to involve filesystem implementations.
Most of the filesystems that people who care at all about security run
their binaries from either support reflinks or are immutable.  Things
like OCI implementations may already fit meet those criteria, and it
would be pretty nifty if the kernel was actually aware that OCI layers
are intended to be immutable.  We could even have an API to
generically query the hash of an immutable file and to ask the kernel
if it's validating the hash on reads.

^ permalink raw reply

* Re: [PATCH v3 05/12] man/man2/fspick.2: document "new" mount API
From: Askar Safin @ 2025-08-25 16:22 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-08-22.1755869779-quirky-demur-grunts-mace-Hoxz0h@cyphar.com>

 ---- On Fri, 22 Aug 2025 17:40:18 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > On 2025-08-22, Askar Safin <safinaskar@zohomail.com> wrote:
 > >  ---- On Sat, 09 Aug 2025 00:39:49 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > >  > +The above procedure is functionally equivalent to
 > >  > +the following mount operation using
 > >  > +.BR mount (2):
 > > 
 > > This is not true.
 > > 
 > > fspick adds options to superblock. It doesn't remove existing ones.
 > 
 > fspick "copies the existing parameters" would be more accurate. I can
 > reword this, but it's an example and I don't think it makes sense to add
 > a large amount of clarifying text for each example.

I suggest adding "but mount(2) clears existing parameters here, and fspick/fsconfig doesn't".

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH 1/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Askar Safin @ 2025-08-25 16:12 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <2025-08-22.1755866245-crummy-slate-scale-borough-gEcqKg@cyphar.com>

 ---- On Fri, 22 Aug 2025 17:06:00 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > > +The
 > > +.I mountflags
 > > +should specify existing per-mount-point flags,
 > > +except for those parameters
 > > +that are deliberately changed.
 > 
 > I would phrase this more like a note to make the advice a bit clearer:

I merely copied here text used for normal remounting (i. e. MS_REMOUNT without MS_BIND).
But okay, I changed anyway.

 >   (which can be retrieved using
 >   .BR statfs (2)),

I don't like reference to statfs here.

statfs returns both superblock and per-mount
flags. And you cannot know which is which. See
https://elixir.bootlin.com/linux/v6.17-rc2/source/fs/statfs.c#L51
.

So, I removed statfs reference and kept everything else.

 > The current docs only mention locked mounts once and the description is
 > kind of insufficient (it implies that only MS_REMOUNT affects this, and

I don't know how locked mounts work. So, if you have something to
add, then you can send separate patch.

I addressed all your points except for mentioned above.
--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* [PATCH v2 1/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Askar Safin @ 2025-08-25 15:48 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250825154839.2422856-1-safinaskar@zohomail.com>

My edit is based on experiments and reading Linux code

Signed-off-by: Askar Safin <safinaskar@zohomail.com>
---
 man/man2/mount.2 | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/man/man2/mount.2 b/man/man2/mount.2
index 5d83231f9..47fc2d21f 100644
--- a/man/man2/mount.2
+++ b/man/man2/mount.2
@@ -405,7 +405,30 @@ flag can be used with
 to modify only the per-mount-point flags.
 .\" See https://lwn.net/Articles/281157/
 This is particularly useful for setting or clearing the "read-only"
-flag on a mount without changing the underlying filesystem.
+flag on a mount without changing the underlying filesystem parameters.
+The
+.I data
+argument is ignored if
+.B MS_REMOUNT
+and
+.B MS_BIND
+are specified.
+Note that the mountpoint will
+have its existing per-mount-point flags
+cleared and replaced with those in
+.I mountflags
+when
+.B MS_REMOUNT
+and
+.B MS_BIND
+are specified.
+This means that if
+you wish to preserve
+any existing per-mount-point flags,
+you need to include them in
+.IR mountflags ,
+along with the per-mount-point flags you wish to set
+(or with the flags you wish to clear missing).
 Specifying
 .I mountflags
 as:
@@ -416,8 +439,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
 .EE
 .in
 .P
-will make access through this mountpoint read-only, without affecting
-other mounts.
+will make access through this mountpoint read-only
+(clearing all other per-mount-point flags),
+without affecting
+other mounts
+of this filesystem.
 .\"
 .SS Creating a bind mount
 If
-- 
2.47.2


^ permalink raw reply related

* [PATCH v2 0/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Askar Safin @ 2025-08-25 15:48 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man

My edit is based on experiments and reading Linux code

You will find C code I used for experiments below

v1: https://lore.kernel.org/linux-man/20250822114315.1571537-1-safinaskar@zohomail.com/

Askar Safin (1):
  man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND

 man/man2/mount.2 | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

-- 
2.47.2

// You need to be root in initial user namespace

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/mount.h>
#include <sys/syscall.h>
#include <sys/sysmacros.h>
#include <linux/openat2.h>

#define MY_ASSERT(cond) do { \
    if (!(cond)) { \
        fprintf (stderr, "%d: %s: assertion failed\n", __LINE__, #cond); \
        exit (1); \
    } \
} while (0)

int
main (void)
{
    // Init
    {
        MY_ASSERT (chdir ("/") == 0);
        MY_ASSERT (unshare (CLONE_NEWNS) == 0);
        MY_ASSERT (mount (NULL, "/", NULL, MS_PRIVATE | MS_REC, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp", "tmpfs", 0, NULL) == 0);
    }

    MY_ASSERT (mkdir ("/tmp/a", 0777) == 0);
    MY_ASSERT (mkdir ("/tmp/b", 0777) == 0);

    // MS_REMOUNT sets options for superblock
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // MS_REMOUNT | MS_BIND sets options for vfsmount
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/b/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // fspick sets options for superblock
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        {
            int fsfd = fspick (AT_FDCWD, "/tmp/a", 0);
            MY_ASSERT (fsfd >= 0);
            MY_ASSERT (fsconfig (fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0) == 0);
            MY_ASSERT (fsconfig (fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0) == 0);
            MY_ASSERT (close (fsfd) == 0);
        }
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // mount_setattr sets options for vfsmount
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        {
            struct mount_attr attr;
            memset (&attr, 0, sizeof attr);
            attr.attr_set = MOUNT_ATTR_RDONLY;
            MY_ASSERT (mount_setattr (AT_FDCWD, "/tmp/a", 0, &attr, sizeof attr) == 0);
        }
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/b/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // "ro" as a string works for MS_REMOUNT
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, "ro") == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // "ro" as a string doesn't work for MS_REMOUNT | MS_BIND
    // Option string is ignored
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND, "ro") == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/b/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // Removing MS_RDONLY makes mount writable again (in case of MS_REMOUNT | MS_BIND)
    // Same for other options (not tested, but I did read code)
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
    }

    // Removing "ro" from option string makes mount writable again (in case of MS_REMOUNT)
    // I. e. mount(2) works exactly as documented
    // This works even if option string is NULL, i. e. NULL works as default option string
    {
        typedef const char *c_string;
        c_string opts[3] = {NULL, "", "rw"};
        for (int i = 0; i != 3; ++i)
            {
                for (int j = 0; j != 3; ++j)
                    {
                        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, opts[i]) == 0);
                        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
                        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
                        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, "ro") == 0);
                        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
                        MY_ASSERT (errno == EROFS);
                        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, opts[j]) == 0);
                        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
                        MY_ASSERT (umount ("/tmp/a") == 0);
                    }
            }
    }

    // Removing MS_RDONLY makes mount writable again (in case of MS_REMOUNT)
    // I. e. mount(2) works exactly as documented
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
    }

    // Setting MS_RDONLY (without other flags) removes all other flags, such as MS_NODEV (in case of MS_REMOUNT | MS_BIND)
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mknod ("/tmp/a/mynull", S_IFCHR | 0666, makedev (1, 3)) == 0);

        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        {
            int fd = open ("/tmp/a/mynull", O_WRONLY);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (write (fd, "a", 1) == 1);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_NODEV, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (open ("/tmp/a/mynull", O_WRONLY) == -1);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        {
            int fd = open ("/tmp/a/mynull", O_WRONLY);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (write (fd, "a", 1) == 1);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (umount ("/tmp/a") == 0);
    }
    printf ("All tests passed\n");
    exit (0);
}

^ permalink raw reply

* Re: [PATCH] uapi/fcntl: conditionally define AT_RENAME* macros
From: Matthew Wilcox @ 2025-08-25 13:03 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-kernel, Amir Goldstein, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, linux-fsdevel, linux-api
In-Reply-To: <9b2c8fe2-cf17-445b-abd7-a1ed44812a73@infradead.org>

On Sun, Aug 24, 2025 at 04:54:50PM -0700, Randy Dunlap wrote:
> In file included from ../samples/vfs/test-statx.c:23:
> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>   159 | #define AT_RENAME_NOREPLACE     0x0001
> In file included from ../samples/vfs/test-statx.c:13:
> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE

Oh dear.  This is going to be libc-version-dependent.

$ grep -r AT_RENAME_NOREPLACE /usr/include
/usr/include/linux/fcntl.h:#define AT_RENAME_NOREPLACE	0x0001

It's not in stdio.h at all.  This is with libc6 2.41-10

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Florian Weimer @ 2025-08-25  9:39 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner, Kees Cook,
	Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, Jeff Xu
In-Reply-To: <20250825.mahNeel0dohz@digikod.net>

* Mickaël Salaün:

> The order of checks would be:
> 1. open script with O_DENY_WRITE
> 2. check executability with AT_EXECVE_CHECK
> 3. read the content and interpret it
>
> The deny-write feature was to guarantee that there is no race condition
> between step 2 and 3.  All these checks are supposed to be done by a
> trusted interpreter (which is allowed to be executed).  The
> AT_EXECVE_CHECK call enables the caller to know if the kernel (and
> associated security policies) allowed the *current* content of the file
> to be executed.  Whatever happen before or after that (wrt.
> O_DENY_WRITE) should be covered by the security policy.

Why isn't it an improper system configuration if the script file is
writable?

In the past, the argument was that making a file (writable and)
executable was an auditable even, and that provided enough coverage for
those people who are interested in this.

Thanks,
Florian


^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Mickaël Salaün @ 2025-08-25  9:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore,
	Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes,
	Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, Jeff Xu
In-Reply-To: <CALCETrWwd90qQ3U2nZg9Fhye6CMQ6ZF20oQ4ME6BoyrFd0t88Q@mail.gmail.com>

On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:
> On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
> >
> > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > > > passed file descriptors).  This changes the state of the opened file by
> > > > making it read-only until it is closed.  The main use case is for script
> > > > interpreters to get the guarantee that script' content cannot be altered
> > > > while being read and interpreted.  This is useful for generic distros
> > > > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> > > >
> > > > Both execve(2) and the IOCTL to enable fsverity can already set this
> > > > property on files with deny_write_access().  This new O_DENY_WRITE make
> > >
> > > The kernel actually tried to get rid of this behavior on execve() in
> > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> > > because it broke userspace assumptions.
> >
> > Oh, good to know.
> >
> > >
> > > > it widely available.  This is similar to what other OSs may provide
> > > > e.g., opening a file with only FILE_SHARE_READ on Windows.
> > >
> > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> > > removed for security reasons; as
> > > https://man7.org/linux/man-pages/man2/mmap.2.html says:
> > >
> > > |        MAP_DENYWRITE
> > > |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> > > |               signaled that attempts to write to the underlying file
> > > |               should fail with ETXTBSY.  But this was a source of denial-
> > > |               of-service attacks.)"
> > >
> > > It seems to me that the same issue applies to your patch - it would
> > > allow unprivileged processes to essentially lock files such that other
> > > processes can't write to them anymore. This might allow unprivileged
> > > users to prevent root from updating config files or stuff like that if
> > > they're updated in-place.
> >
> > Yes, I agree, but since it is the case for executed files I though it
> > was worth starting a discussion on this topic.  This new flag could be
> > restricted to executable files, but we should avoid system-wide locks
> > like this.  I'm not sure how Windows handle these issues though.
> >
> > Anyway, we should rely on the access control policy to control write and
> > execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> > the references and the background!
> 
> I'm confused.  I understand that there are many contexts in which one
> would want to prevent execution of unapproved content, which might
> include preventing a given process from modifying some code and then
> executing it.
> 
> I don't understand what these deny-write features have to do with it.
> These features merely prevent someone from modifying code *that is
> currently in use*, which is not at all the same thing as preventing
> modifying code that might get executed -- one can often modify
> contents *before* executing those contents.

The order of checks would be:
1. open script with O_DENY_WRITE
2. check executability with AT_EXECVE_CHECK
3. read the content and interpret it

The deny-write feature was to guarantee that there is no race condition
between step 2 and 3.  All these checks are supposed to be done by a
trusted interpreter (which is allowed to be executed).  The
AT_EXECVE_CHECK call enables the caller to know if the kernel (and
associated security policies) allowed the *current* content of the file
to be executed.  Whatever happen before or after that (wrt.
O_DENY_WRITE) should be covered by the security policy.

> 
> In any case, IMO it's rather sad that the elimination of ETXTBSY had
> to be reverted -- it's really quite a nasty feature.  But it occurs to
> me that Linux can more or less do what is IMO the actually desired
> thing: snapshot the contents of a file and execute the snapshot.  The
> hack at the end of the email works!  (Well, it works if the chosen
> filesystem supports it.)
> 
> $ ./silly_tmp /tmp/test /tmp vim /proc/self/fd/3
> 
> emacs is apparently far, far too clever and can't save if you do:
> 
> $ ./silly_tmp /tmp/test /tmp emacs /proc/self/fd/3
> 
> 
> I'm not seriously suggesting that anyone should execute binaries or
> scripts on Linux exactly like this, for a whole bunch of reasons:
> 
> - It needs filesystem support (but maybe this isn't so bad)
> 
> - It needs write access to a directory on the correct filesystem (a
> showstopper for serious use)
> 
> - It is wildly incompatible with write-xor-execute, so this would be a
> case of one step forward, ten steps back.
> 
> - It would defeat a lot of tools that inspect /proc, which would be
> quite annoying to say the least.
> 
> 
> But maybe a less kludgy version could be used for real.  What if there
> was a syscall that would take an fd and make a snapshot of the file?

Yes, that would be a clean solution.  I don't think this is achievable
in an efficient way without involving filesystem implementations though.

> It would, at least by default, produce a *read-only* snapshot (fully
> sealed a la F_SEAL_*), inherit any integrity data that came with the
> source (e.g. LSMs could understand it), would not require a writable
> directory on the filesystem, and would maybe even come with an extra
> seal-like thing that prevents it from being linkat-ed.  (I'm not sure
> that linkat would actually be a problem, but I'm also not immediately
> sure that LSMs would be as comfortable with it if linkat were
> allowed.)  And there could probably be an extremely efficient
> implementation that might even reuse the existing deny-write mechanism
> to optimize the common case where the file is never written.
> 
> For that matter, the actual common case would be to execute stuff in
> /usr or similar, and those files really ought never to be modified.
> So there could be a file attribute or something that means "this file
> CANNOT be modified, but it can still be unlinked or replaced as
> usual", and snapshotting such a file would be a no-op.  Distributions
> and container tools could set that attribute.  Overlayfs could also
> provide an efficient implementation if the file currently comes from
> an immutable source.
> 
> Hmm, maybe it's not strictly necessary that it be immutable -- maybe
> it's sometimes okay if reads start to fail if the contents change.
> Let's call this a "weak snapshot" -- reads of a weak snapshot either
> return the original contents or fail.  fsverity would give weak
> snapshots for at no additional cost.
> 
> 
> It's worth noting that the common case doesn't actually need an fd.
> We have mmap(..., MAP_PRIVATE, ...).  What we would actually want for
> mmap use cases is mmap(..., MAP_SNAPSHOT, ...), with the semantics
> that the kernel promises that future writes to the source would either
> not be reflected in the mapping or would cause SIGBUS.  One might
> reasonably debate what forced-writes would do (I think forced-writes
> should be allowed just like they currently are, since anyone who can
> force-write to process memory is already assumed to be permitted to
> bypass write-xor-execute).
> 
> 
> ---
> 
> /* Written by Claude Sonnet 4 with a surprisingly small amount of help
> from Andy */
> 
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <sys/ioctl.h>
> #include <linux/fs.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <errno.h>
> #include <string.h>
> 
> int main(int argc, char *argv[]) {
>     if (argc < 4) {
>         fprintf(stderr, "Usage: %s <source_file> <temp_dir>
> [exec_args...]\n", argv[0]);
>         exit(1);
>     }
> 
>     const char *source_file = argv[1];
>     const char *temp_dir = argv[2];
> 
>     // Open source file
>     int source_fd = open(source_file, O_RDONLY);
>     if (source_fd == -1) {
>         perror("Failed to open source file");
>         exit(1);
>     }
> 
>     // Create temporary file
>     int temp_fd = open(temp_dir, O_TMPFILE | O_RDWR, 0600);
>     if (temp_fd == -1) {
>         perror("Failed to create temporary file");
>         close(source_fd);
>         exit(1);
>     }
> 
>     // Clone the file contents using FICLONE
>     if (ioctl(temp_fd, FICLONE, source_fd) == -1) {
>         perror("Failed to clone file");
>         close(source_fd);
>         close(temp_fd);
>         exit(1);
>     }
> 
>     // Close source file
>     close(source_fd);
> 
>     // Make sure temp file is on fd 3
>     if (temp_fd != 3) {
>         if (dup2(temp_fd, 3) == -1) {
>             perror("Failed to move temp file to fd 3");
>             close(temp_fd);
>             exit(1);
>         }
>         close(temp_fd);
>     }
> 
>     // Execute the remaining arguments
>     if (argc >= 3) {
>         execvp(argv[3], &argv[3]);
>         perror("Failed to execute command");
>         exit(1);
>     }
> 
>     return 0;
> }

As you said, this doesn't work if temp_dir is not allowed for execution,
and it doesn't allow the kernel to check/track the content of the
script, which is the purpose of AT_EXECVE_CHECK.

^ permalink raw reply

* Re: [PATCH] uapi/fcntl: conditionally define AT_RENAME* macros
From: Randy Dunlap @ 2025-08-25  6:49 UTC (permalink / raw)
  To: Amir Goldstein, Aleksa Sarai
  Cc: Matthew Wilcox, linux-kernel, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Jan Kara, Christian Brauner,
	linux-fsdevel, linux-api
In-Reply-To: <CAOQ4uxiShq5gPCsRh5ZDNXbG4AGH5XpfHx0HXDWTS+5Y95hieQ@mail.gmail.com>

Hi Amir,


On 8/24/25 10:58 PM, Amir Goldstein wrote:
> On Mon, Aug 25, 2025 at 1:54 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>>
>>
>>
>> On 8/24/25 4:21 PM, Matthew Wilcox wrote:
>>> On Sun, Aug 24, 2025 at 03:10:55PM -0700, Randy Dunlap wrote:
>>>> Don't define the AT_RENAME_* macros when __USE_GNU is defined since
>>>> /usr/include/stdio.h defines them in that case (i.e. when _GNU_SOURCE
>>>> is defined, which causes __USE_GNU to be defined).
>>>>
>>>> Having them defined in 2 places causes build warnings (duplicate
>>>> definitions) in both samples/watch_queue/watch_test.c and
>>>> samples/vfs/test-statx.c.
>>>
>>> It does?  What flags?
>>>
>>
>> for samples/vfs/test-statx.c:
>>
>> In file included from ../samples/vfs/test-statx.c:23:
>> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>>   159 | #define AT_RENAME_NOREPLACE     0x0001
>> In file included from ../samples/vfs/test-statx.c:13:
>> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
>> usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
>>   160 | #define AT_RENAME_EXCHANGE      0x0002
>> /usr/include/stdio.h:173:10: note: this is the location of the previous definition
>>   173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
>> usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
>>   161 | #define AT_RENAME_WHITEOUT      0x0004
>> /usr/include/stdio.h:175:10: note: this is the location of the previous definition
>>   175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>>
>> for samples/watch_queue/watch_test.c:
>>
>> In file included from usr/include/linux/watch_queue.h:6,
>>                  from ../samples/watch_queue/watch_test.c:19:
>> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>>   159 | #define AT_RENAME_NOREPLACE     0x0001
>> In file included from ../samples/watch_queue/watch_test.c:11:
>> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
>> usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
>>   160 | #define AT_RENAME_EXCHANGE      0x0002
>> /usr/include/stdio.h:173:10: note: this is the location of the previous definition
>>   173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
>> usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
>>   161 | #define AT_RENAME_WHITEOUT      0x0004
>> /usr/include/stdio.h:175:10: note: this is the location of the previous definition
>>   175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>>

>>>
>>> I'm pretty sure C says that duplicate definitions are fine as long
>>> as they're identical.
>> The vales are identical but the strings are not identical.
>>
>> We can't fix stdio.h, but we could just change uapi/linux/fcntl.h
>> to match stdio.h. I suppose.
> 
> I do not specifically object to a patch like this (assuming that is works?):
> 
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -156,9 +156,9 @@
>   */
> 
>  /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
> -#define AT_RENAME_NOREPLACE    0x0001
> -#define AT_RENAME_EXCHANGE     0x0002
> -#define AT_RENAME_WHITEOUT     0x0004
> +#define AT_RENAME_NOREPLACE    RENAME_NOREPLACE
> +#define AT_RENAME_EXCHANGE     RENAME_EXCHANGE
> +#define AT_RENAME_WHITEOUT     RENAME_WHITEOUT
> 

I'll test that.

> 
> But to be clear, this is a regression introduced by glibc that is likely
> to break many other builds, not only the kernel samples
> and even if we fix linux uapi to conform to its downstream
> copy of definitions, it won't help those users whose programs
> build was broken until they install kernel headers, so feels like you
> should report this regression to glibc and they'd better not "fix" the
> regression by copying the current definition string as that may change as per
> the patch above.
> 

I'll look into that also.

> Why would a library copy definitions from kernel uapi without
> wrapping them with #ifndef or #undef?

To me it looks like they stuck them into the wrong file - stdio.h
instead of fcntl.h.

thanks.
-- 
~Randy


^ permalink raw reply

* Re: [PATCH] uapi/fcntl: conditionally define AT_RENAME* macros
From: Amir Goldstein @ 2025-08-25  5:58 UTC (permalink / raw)
  To: Randy Dunlap, Aleksa Sarai
  Cc: Matthew Wilcox, linux-kernel, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Jan Kara, Christian Brauner,
	linux-fsdevel, linux-api
In-Reply-To: <9b2c8fe2-cf17-445b-abd7-a1ed44812a73@infradead.org>

On Mon, Aug 25, 2025 at 1:54 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>
>
>
> On 8/24/25 4:21 PM, Matthew Wilcox wrote:
> > On Sun, Aug 24, 2025 at 03:10:55PM -0700, Randy Dunlap wrote:
> >> Don't define the AT_RENAME_* macros when __USE_GNU is defined since
> >> /usr/include/stdio.h defines them in that case (i.e. when _GNU_SOURCE
> >> is defined, which causes __USE_GNU to be defined).
> >>
> >> Having them defined in 2 places causes build warnings (duplicate
> >> definitions) in both samples/watch_queue/watch_test.c and
> >> samples/vfs/test-statx.c.
> >
> > It does?  What flags?
> >
>
> for samples/vfs/test-statx.c:
>
> In file included from ../samples/vfs/test-statx.c:23:
> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>   159 | #define AT_RENAME_NOREPLACE     0x0001
> In file included from ../samples/vfs/test-statx.c:13:
> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
> usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
>   160 | #define AT_RENAME_EXCHANGE      0x0002
> /usr/include/stdio.h:173:10: note: this is the location of the previous definition
>   173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
> usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
>   161 | #define AT_RENAME_WHITEOUT      0x0004
> /usr/include/stdio.h:175:10: note: this is the location of the previous definition
>   175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>
> for samples/watch_queue/watch_test.c:
>
> In file included from usr/include/linux/watch_queue.h:6,
>                  from ../samples/watch_queue/watch_test.c:19:
> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>   159 | #define AT_RENAME_NOREPLACE     0x0001
> In file included from ../samples/watch_queue/watch_test.c:11:
> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
> usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
>   160 | #define AT_RENAME_EXCHANGE      0x0002
> /usr/include/stdio.h:173:10: note: this is the location of the previous definition
>   173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
> usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
>   161 | #define AT_RENAME_WHITEOUT      0x0004
> /usr/include/stdio.h:175:10: note: this is the location of the previous definition
>   175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>
>
> > #define AT_RENAME_NOREPLACE     0x0001
> > #define AT_RENAME_NOREPLACE     0x0001
> >
> > int main(void)
> > {
> >       return AT_RENAME_NOREPLACE;
> > }
> >
> > gcc -W -Wall testA.c -o testA
> >
> > (no warnings)
> >
> > I'm pretty sure C says that duplicate definitions are fine as long
> > as they're identical.
> The vales are identical but the strings are not identical.
>
> We can't fix stdio.h, but we could just change uapi/linux/fcntl.h
> to match stdio.h. I suppose.

I do not specifically object to a patch like this (assuming that is works?):

--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -156,9 +156,9 @@
  */

 /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
-#define AT_RENAME_NOREPLACE    0x0001
-#define AT_RENAME_EXCHANGE     0x0002
-#define AT_RENAME_WHITEOUT     0x0004
+#define AT_RENAME_NOREPLACE    RENAME_NOREPLACE
+#define AT_RENAME_EXCHANGE     RENAME_EXCHANGE
+#define AT_RENAME_WHITEOUT     RENAME_WHITEOUT


But to be clear, this is a regression introduced by glibc that is likely
to break many other builds, not only the kernel samples
and even if we fix linux uapi to conform to its downstream
copy of definitions, it won't help those users whose programs
build was broken until they install kernel headers, so feels like you
should report this regression to glibc and they'd better not "fix" the
regression by copying the current definition string as that may change as per
the patch above.

Why would a library copy definitions from kernel uapi without
wrapping them with #ifndef or #undef?

Thanks,
Amir.

^ permalink raw reply

* Re: [PATCH] uapi/fcntl: conditionally define AT_RENAME* macros
From: Randy Dunlap @ 2025-08-24 23:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Amir Goldstein, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, linux-fsdevel, linux-api
In-Reply-To: <aKuedOXEIapocQ8l@casper.infradead.org>



On 8/24/25 4:21 PM, Matthew Wilcox wrote:
> On Sun, Aug 24, 2025 at 03:10:55PM -0700, Randy Dunlap wrote:
>> Don't define the AT_RENAME_* macros when __USE_GNU is defined since
>> /usr/include/stdio.h defines them in that case (i.e. when _GNU_SOURCE
>> is defined, which causes __USE_GNU to be defined).
>>
>> Having them defined in 2 places causes build warnings (duplicate
>> definitions) in both samples/watch_queue/watch_test.c and
>> samples/vfs/test-statx.c.
> 
> It does?  What flags?
> 

for samples/vfs/test-statx.c:

In file included from ../samples/vfs/test-statx.c:23:
usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
  159 | #define AT_RENAME_NOREPLACE     0x0001
In file included from ../samples/vfs/test-statx.c:13:
/usr/include/stdio.h:171:10: note: this is the location of the previous definition
  171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
  160 | #define AT_RENAME_EXCHANGE      0x0002
/usr/include/stdio.h:173:10: note: this is the location of the previous definition
  173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
  161 | #define AT_RENAME_WHITEOUT      0x0004
/usr/include/stdio.h:175:10: note: this is the location of the previous definition
  175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT

for samples/watch_queue/watch_test.c:

In file included from usr/include/linux/watch_queue.h:6,
                 from ../samples/watch_queue/watch_test.c:19:
usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
  159 | #define AT_RENAME_NOREPLACE     0x0001
In file included from ../samples/watch_queue/watch_test.c:11:
/usr/include/stdio.h:171:10: note: this is the location of the previous definition
  171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
  160 | #define AT_RENAME_EXCHANGE      0x0002
/usr/include/stdio.h:173:10: note: this is the location of the previous definition
  173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
  161 | #define AT_RENAME_WHITEOUT      0x0004
/usr/include/stdio.h:175:10: note: this is the location of the previous definition
  175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT


> #define AT_RENAME_NOREPLACE     0x0001
> #define AT_RENAME_NOREPLACE     0x0001
> 
> int main(void)
> {
> 	return AT_RENAME_NOREPLACE;
> }
> 
> gcc -W -Wall testA.c -o testA
> 
> (no warnings)
> 
> I'm pretty sure C says that duplicate definitions are fine as long
> as they're identical.
The vales are identical but the strings are not identical.

We can't fix stdio.h, but we could just change uapi/linux/fcntl.h
to match stdio.h. I suppose.

-- 
~Randy


^ permalink raw reply

* Re: [PATCH] uapi/fcntl: conditionally define AT_RENAME* macros
From: Matthew Wilcox @ 2025-08-24 23:21 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-kernel, Amir Goldstein, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, linux-fsdevel, linux-api
In-Reply-To: <20250824221055.86110-1-rdunlap@infradead.org>

On Sun, Aug 24, 2025 at 03:10:55PM -0700, Randy Dunlap wrote:
> Don't define the AT_RENAME_* macros when __USE_GNU is defined since
> /usr/include/stdio.h defines them in that case (i.e. when _GNU_SOURCE
> is defined, which causes __USE_GNU to be defined).
> 
> Having them defined in 2 places causes build warnings (duplicate
> definitions) in both samples/watch_queue/watch_test.c and
> samples/vfs/test-statx.c.

It does?  What flags?

#define AT_RENAME_NOREPLACE     0x0001
#define AT_RENAME_NOREPLACE     0x0001

int main(void)
{
	return AT_RENAME_NOREPLACE;
}

gcc -W -Wall testA.c -o testA

(no warnings)

I'm pretty sure C says that duplicate definitions are fine as long
as they're identical.

^ permalink raw reply

* [PATCH] uapi/fcntl: conditionally define AT_RENAME* macros
From: Randy Dunlap @ 2025-08-24 22:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Randy Dunlap, Amir Goldstein, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, linux-fsdevel, linux-api

Don't define the AT_RENAME_* macros when __USE_GNU is defined since
/usr/include/stdio.h defines them in that case (i.e. when _GNU_SOURCE
is defined, which causes __USE_GNU to be defined).

Having them defined in 2 places causes build warnings (duplicate
definitions) in both samples/watch_queue/watch_test.c and
samples/vfs/test-statx.c.

Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
---
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
CC: linux-api@vger.kernel.org

 include/uapi/linux/fcntl.h |    2 ++
 1 file changed, 2 insertions(+)

--- linux-next-20250819.orig/include/uapi/linux/fcntl.h
+++ linux-next-20250819/include/uapi/linux/fcntl.h
@@ -155,10 +155,12 @@
  * as possible, so we can use them for generic bits in the future if necessary.
  */
 
+#ifndef __USE_GNU
 /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
 #define AT_RENAME_NOREPLACE	0x0001
 #define AT_RENAME_EXCHANGE	0x0002
 #define AT_RENAME_WHITEOUT	0x0004
+#endif
 
 /* Flag for faccessat(2). */
 #define AT_EACCESS		0x200	/* Test access permitted for

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Andy Lutomirski @ 2025-08-24 18:04 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore,
	Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes,
	Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, Jeff Xu
In-Reply-To: <20250824.Ujoh8unahy5a@digikod.net>

On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > > passed file descriptors).  This changes the state of the opened file by
> > > making it read-only until it is closed.  The main use case is for script
> > > interpreters to get the guarantee that script' content cannot be altered
> > > while being read and interpreted.  This is useful for generic distros
> > > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> > >
> > > Both execve(2) and the IOCTL to enable fsverity can already set this
> > > property on files with deny_write_access().  This new O_DENY_WRITE make
> >
> > The kernel actually tried to get rid of this behavior on execve() in
> > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> > because it broke userspace assumptions.
>
> Oh, good to know.
>
> >
> > > it widely available.  This is similar to what other OSs may provide
> > > e.g., opening a file with only FILE_SHARE_READ on Windows.
> >
> > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> > removed for security reasons; as
> > https://man7.org/linux/man-pages/man2/mmap.2.html says:
> >
> > |        MAP_DENYWRITE
> > |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> > |               signaled that attempts to write to the underlying file
> > |               should fail with ETXTBSY.  But this was a source of denial-
> > |               of-service attacks.)"
> >
> > It seems to me that the same issue applies to your patch - it would
> > allow unprivileged processes to essentially lock files such that other
> > processes can't write to them anymore. This might allow unprivileged
> > users to prevent root from updating config files or stuff like that if
> > they're updated in-place.
>
> Yes, I agree, but since it is the case for executed files I though it
> was worth starting a discussion on this topic.  This new flag could be
> restricted to executable files, but we should avoid system-wide locks
> like this.  I'm not sure how Windows handle these issues though.
>
> Anyway, we should rely on the access control policy to control write and
> execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> the references and the background!

I'm confused.  I understand that there are many contexts in which one
would want to prevent execution of unapproved content, which might
include preventing a given process from modifying some code and then
executing it.

I don't understand what these deny-write features have to do with it.
These features merely prevent someone from modifying code *that is
currently in use*, which is not at all the same thing as preventing
modifying code that might get executed -- one can often modify
contents *before* executing those contents.

In any case, IMO it's rather sad that the elimination of ETXTBSY had
to be reverted -- it's really quite a nasty feature.  But it occurs to
me that Linux can more or less do what is IMO the actually desired
thing: snapshot the contents of a file and execute the snapshot.  The
hack at the end of the email works!  (Well, it works if the chosen
filesystem supports it.)

$ ./silly_tmp /tmp/test /tmp vim /proc/self/fd/3

emacs is apparently far, far too clever and can't save if you do:

$ ./silly_tmp /tmp/test /tmp emacs /proc/self/fd/3


I'm not seriously suggesting that anyone should execute binaries or
scripts on Linux exactly like this, for a whole bunch of reasons:

- It needs filesystem support (but maybe this isn't so bad)

- It needs write access to a directory on the correct filesystem (a
showstopper for serious use)

- It is wildly incompatible with write-xor-execute, so this would be a
case of one step forward, ten steps back.

- It would defeat a lot of tools that inspect /proc, which would be
quite annoying to say the least.


But maybe a less kludgy version could be used for real.  What if there
was a syscall that would take an fd and make a snapshot of the file?
It would, at least by default, produce a *read-only* snapshot (fully
sealed a la F_SEAL_*), inherit any integrity data that came with the
source (e.g. LSMs could understand it), would not require a writable
directory on the filesystem, and would maybe even come with an extra
seal-like thing that prevents it from being linkat-ed.  (I'm not sure
that linkat would actually be a problem, but I'm also not immediately
sure that LSMs would be as comfortable with it if linkat were
allowed.)  And there could probably be an extremely efficient
implementation that might even reuse the existing deny-write mechanism
to optimize the common case where the file is never written.

For that matter, the actual common case would be to execute stuff in
/usr or similar, and those files really ought never to be modified.
So there could be a file attribute or something that means "this file
CANNOT be modified, but it can still be unlinked or replaced as
usual", and snapshotting such a file would be a no-op.  Distributions
and container tools could set that attribute.  Overlayfs could also
provide an efficient implementation if the file currently comes from
an immutable source.

Hmm, maybe it's not strictly necessary that it be immutable -- maybe
it's sometimes okay if reads start to fail if the contents change.
Let's call this a "weak snapshot" -- reads of a weak snapshot either
return the original contents or fail.  fsverity would give weak
snapshots for at no additional cost.


It's worth noting that the common case doesn't actually need an fd.
We have mmap(..., MAP_PRIVATE, ...).  What we would actually want for
mmap use cases is mmap(..., MAP_SNAPSHOT, ...), with the semantics
that the kernel promises that future writes to the source would either
not be reflected in the mapping or would cause SIGBUS.  One might
reasonably debate what forced-writes would do (I think forced-writes
should be allowed just like they currently are, since anyone who can
force-write to process memory is already assumed to be permitted to
bypass write-xor-execute).


---

/* Written by Claude Sonnet 4 with a surprisingly small amount of help
from Andy */

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <linux/fs.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>

int main(int argc, char *argv[]) {
    if (argc < 4) {
        fprintf(stderr, "Usage: %s <source_file> <temp_dir>
[exec_args...]\n", argv[0]);
        exit(1);
    }

    const char *source_file = argv[1];
    const char *temp_dir = argv[2];

    // Open source file
    int source_fd = open(source_file, O_RDONLY);
    if (source_fd == -1) {
        perror("Failed to open source file");
        exit(1);
    }

    // Create temporary file
    int temp_fd = open(temp_dir, O_TMPFILE | O_RDWR, 0600);
    if (temp_fd == -1) {
        perror("Failed to create temporary file");
        close(source_fd);
        exit(1);
    }

    // Clone the file contents using FICLONE
    if (ioctl(temp_fd, FICLONE, source_fd) == -1) {
        perror("Failed to clone file");
        close(source_fd);
        close(temp_fd);
        exit(1);
    }

    // Close source file
    close(source_fd);

    // Make sure temp file is on fd 3
    if (temp_fd != 3) {
        if (dup2(temp_fd, 3) == -1) {
            perror("Failed to move temp file to fd 3");
            close(temp_fd);
            exit(1);
        }
        close(temp_fd);
    }

    // Execute the remaining arguments
    if (argc >= 3) {
        execvp(argv[3], &argv[3]);
        perror("Failed to execute command");
        exit(1);
    }

    return 0;
}

^ permalink raw reply

* Re: [PATCH v3 00/12] man2: document "new" mount API
From: Askar Safin @ 2025-08-24 13:07 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-0-f61405c80f34@cyphar.com>

 ---- On Sat, 09 Aug 2025 00:39:44 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > Back in 2019, the new mount API was merged[1]. David Howells then set

I finished my experiments!

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Mickaël Salaün @ 2025-08-24 11:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module,
	Andy Lutomirski, Jeff Xu
In-Reply-To: <CAG48ez1XjUdcFztc_pF2qcoLi7xvfpJ224Ypc=FoGi-Px-qyZw@mail.gmail.com>

On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > passed file descriptors).  This changes the state of the opened file by
> > making it read-only until it is closed.  The main use case is for script
> > interpreters to get the guarantee that script' content cannot be altered
> > while being read and interpreted.  This is useful for generic distros
> > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> >
> > Both execve(2) and the IOCTL to enable fsverity can already set this
> > property on files with deny_write_access().  This new O_DENY_WRITE make
> 
> The kernel actually tried to get rid of this behavior on execve() in
> commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> because it broke userspace assumptions.

Oh, good to know.

> 
> > it widely available.  This is similar to what other OSs may provide
> > e.g., opening a file with only FILE_SHARE_READ on Windows.
> 
> We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> removed for security reasons; as
> https://man7.org/linux/man-pages/man2/mmap.2.html says:
> 
> |        MAP_DENYWRITE
> |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> |               signaled that attempts to write to the underlying file
> |               should fail with ETXTBSY.  But this was a source of denial-
> |               of-service attacks.)"
> 
> It seems to me that the same issue applies to your patch - it would
> allow unprivileged processes to essentially lock files such that other
> processes can't write to them anymore. This might allow unprivileged
> users to prevent root from updating config files or stuff like that if
> they're updated in-place.

Yes, I agree, but since it is the case for executed files I though it
was worth starting a discussion on this topic.  This new flag could be
restricted to executable files, but we should avoid system-wide locks
like this.  I'm not sure how Windows handle these issues though.

Anyway, we should rely on the access control policy to control write and
execute access in a consistent way (e.g. write-xor-execute).  Thanks for
the references and the background!

^ permalink raw reply

* Re: [PATCH v3 09/12] man/man2/open_tree.2: document "new" mount API
From: Askar Safin @ 2025-08-24  6:53 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-9-f61405c80f34@cyphar.com>

 ---- On Sat, 09 Aug 2025 00:39:53 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > +If
 > +.I flags
 > +does not contain
 > +.BR \%OPEN_TREE_CLONE ,
 > +.BR open_tree ()
 > +returns a file descriptor
 > +that is exactly equivalent to
 > +one produced by
 > +.BR openat (2)

This is not true. They differ in handling of automounts.
open_tree follows them in final component (by default),
and openat - not.

See reproducer in the end of this letter.

I suggest merely adding this:
> that is exactly equivalent to one produced by openat(2) (modulo automounts)

--
Askar Safin
https://types.pl/@safinaskar


// Root in initial user namespace

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/mount.h>
#include <sys/syscall.h>
#include <linux/openat2.h>

#define MY_ASSERT(cond) do { \
    if (!(cond)) { \
        fprintf (stderr, "%s: assertion failed\n", #cond); \
        exit (1); \
    } \
} while (0)

bool
tracing_mounted (void)
{
    struct statx tracing;
    if (statx (AT_FDCWD, "/tmp/debugfs/tracing", AT_NO_AUTOMOUNT, 0, &tracing) != 0)
        {
            perror ("statx tracing");
            exit (1);
        }
    if (!(tracing.stx_attributes_mask & STATX_ATTR_MOUNT_ROOT))
        {
            fprintf (stderr, "???\n");
            exit (1);
        }
    return tracing.stx_attributes & STATX_ATTR_MOUNT_ROOT;
}

void
mount_debugfs (void)
{
    if (mount (NULL, "/tmp/debugfs", "debugfs", 0, NULL) != 0)
        {
            perror ("mount debugfs");
            exit (1);
        }
    MY_ASSERT (!tracing_mounted ());
}

void
umount_debugfs (void)
{
    umount ("/tmp/debugfs/tracing"); // Ignore errors
    if (umount ("/tmp/debugfs") != 0)
        {
            perror ("umount debugfs");
            exit (1);
        }
}

int
main (void)
{
    // Init
    {
        if (chdir ("/") != 0)
            {
                perror ("chdir /");
                exit (1);
            }
        if (unshare (CLONE_NEWNS) != 0)
            {
                perror ("unshare");
                exit (1);
            }
        if (mount (NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL) != 0)
            {
                perror ("mount(NULL, /, NULL, MS_REC | MS_PRIVATE, NULL)");
                exit (1);
            }
        if (mount (NULL, "/tmp", "tmpfs", 0, NULL) != 0)
            {
                perror ("mount tmpfs");
                exit (1);
            }
    }
    if (mkdir ("/tmp/debugfs", 0777) != 0)
        {
            perror ("mkdir(/tmp/debugfs)");
            exit (1);
        }

    // open(O_PATH) doesn't follow automounts
    {
        mount_debugfs ();
        {
            int fd = open ("/tmp/debugfs/tracing", O_PATH);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (!tracing_mounted ());
        umount_debugfs ();
    }

    // open_tree does follow automounts (by default)
    {
        mount_debugfs ();
        {
            int fd = open_tree (AT_FDCWD, "/tmp/debugfs/tracing", 0);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (tracing_mounted ());
        umount_debugfs ();
    }

    // open (O_PATH | O_DIRECTORY)
    {
        mount_debugfs ();
        {
            int fd = open ("/tmp/debugfs/tracing", O_PATH | O_DIRECTORY);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (tracing_mounted ());
        umount_debugfs ();
    }

    // AT_NO_AUTOMOUNT
    {
        mount_debugfs ();
        {
            int fd = open_tree (AT_FDCWD, "/tmp/debugfs/tracing", AT_NO_AUTOMOUNT);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (!tracing_mounted ());
        umount_debugfs ();
    }

    printf ("All tests passed\n");
    exit (0);
}


^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Jann Horn @ 2025-08-22 19:45 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module,
	Andy Lutomirski, Jeff Xu
In-Reply-To: <20250822170800.2116980-2-mic@digikod.net>

On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> passed file descriptors).  This changes the state of the opened file by
> making it read-only until it is closed.  The main use case is for script
> interpreters to get the guarantee that script' content cannot be altered
> while being read and interpreted.  This is useful for generic distros
> that may not have a write-xor-execute policy.  See commit a5874fde3c08
> ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
>
> Both execve(2) and the IOCTL to enable fsverity can already set this
> property on files with deny_write_access().  This new O_DENY_WRITE make

The kernel actually tried to get rid of this behavior on execve() in
commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
because it broke userspace assumptions.

> it widely available.  This is similar to what other OSs may provide
> e.g., opening a file with only FILE_SHARE_READ on Windows.

We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
removed for security reasons; as
https://man7.org/linux/man-pages/man2/mmap.2.html says:

|        MAP_DENYWRITE
|               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
|               signaled that attempts to write to the underlying file
|               should fail with ETXTBSY.  But this was a source of denial-
|               of-service attacks.)"

It seems to me that the same issue applies to your patch - it would
allow unprivileged processes to essentially lock files such that other
processes can't write to them anymore. This might allow unprivileged
users to prevent root from updating config files or stuff like that if
they're updated in-place.

^ permalink raw reply

* [RFC PATCH v1 2/2] selftests/exec: Add O_DENY_WRITE tests
From: Mickaël Salaün @ 2025-08-22 17:08 UTC (permalink / raw)
  To: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn
  Cc: Mickaël Salaün, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module,
	Andy Lutomirski, Jeff Xu
In-Reply-To: <20250822170800.2116980-1-mic@digikod.net>

Add 6 test suites to check O_DENY_WRITE used through open(2) and
fcntl(2).  Check that it fills its purpose, that it only applies to
regular files, and that setting this flag several times is not an issue.

The O_DENY_WRITE flag is useful in conjunction with AT_EXECVE_CHECK for
systems that don't enforce a write-xor-execute policy.  Extend related
tests to also use them as examples.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Robert Waite <rowait@microsoft.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Link: https://lore.kernel.org/r/20250822170800.2116980-3-mic@digikod.net
---
 tools/testing/selftests/exec/check-exec.c | 219 ++++++++++++++++++++++
 1 file changed, 219 insertions(+)

diff --git a/tools/testing/selftests/exec/check-exec.c b/tools/testing/selftests/exec/check-exec.c
index 55bce47e56b7..9db1d7b9aa97 100644
--- a/tools/testing/selftests/exec/check-exec.c
+++ b/tools/testing/selftests/exec/check-exec.c
@@ -30,6 +30,10 @@
 #define _ASM_GENERIC_FCNTL_H
 #include <linux/fcntl.h>
 
+#ifndef O_DENY_WRITE
+#define O_DENY_WRITE 040000000
+#endif
+
 #include "../kselftest_harness.h"
 
 static int sys_execveat(int dirfd, const char *pathname, char *const argv[],
@@ -319,6 +323,221 @@ TEST_F(access, non_regular_files)
 	test_exec_fd(_metadata, self->pipefd, EACCES);
 }
 
+TEST_F(access, deny_write_check_open)
+{
+	int fd_deny, fd_read, fd_write;
+
+	fd_deny = open(reg_file_path, O_DENY_WRITE | O_RDONLY | O_CLOEXEC);
+	ASSERT_LE(0, fd_deny);
+
+	/* Concurrent reads are always allowed. */
+	fd_read = open(reg_file_path, O_RDONLY | O_CLOEXEC);
+	EXPECT_LE(0, fd_read);
+	EXPECT_EQ(0, close(fd_read));
+
+	/* Concurrent writes are denied. */
+	fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC);
+	EXPECT_EQ(-1, fd_write);
+	EXPECT_EQ(ETXTBSY, errno);
+
+	/* Drops O_DENY_WRITE. */
+	EXPECT_EQ(0, close(fd_deny));
+
+	/* The restriction is now gone. */
+	fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC);
+	EXPECT_LE(0, fd_write);
+	EXPECT_EQ(0, close(fd_write));
+}
+
+TEST_F(access, deny_write_check_open_and_fcntl)
+{
+	int fd_deny, fd_read, fd_write, flags;
+
+	fd_deny = open(reg_file_path, O_DENY_WRITE | O_RDONLY | O_CLOEXEC);
+	ASSERT_LE(0, fd_deny);
+
+	/* Sets O_DENY_WRITE a "second" time. */
+	flags = fcntl(fd_deny, F_GETFL);
+	ASSERT_NE(-1, flags);
+	EXPECT_EQ(0, fcntl(fd_deny, F_SETFL, flags | O_DENY_WRITE));
+
+	/* Concurrent reads are always allowed. */
+	fd_read = open(reg_file_path, O_RDONLY | O_CLOEXEC);
+	EXPECT_LE(0, fd_read);
+	EXPECT_EQ(0, close(fd_read));
+
+	/* Concurrent writes are denied. */
+	fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC);
+	EXPECT_EQ(-1, fd_write);
+	EXPECT_EQ(ETXTBSY, errno);
+
+	/* Drops O_DENY_WRITE. */
+	EXPECT_EQ(0, close(fd_deny));
+
+	/* The restriction is now gone. */
+	fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC);
+	EXPECT_LE(0, fd_write);
+	EXPECT_EQ(0, close(fd_write));
+}
+
+TEST_F(access, deny_write_check_fcntl)
+{
+	int fd_deny, fd_read, fd_write, flags;
+
+	fd_deny = open(reg_file_path, O_RDONLY | O_CLOEXEC);
+	ASSERT_LE(0, fd_deny);
+
+	/* Sets O_DENY_WRITE a first time. */
+	flags = fcntl(fd_deny, F_GETFL);
+	ASSERT_NE(-1, flags);
+	EXPECT_EQ(0, fcntl(fd_deny, F_SETFL, flags | O_DENY_WRITE));
+
+	/* Sets O_DENY_WRITE a "second" time. */
+	EXPECT_EQ(0, fcntl(fd_deny, F_SETFL, flags | O_DENY_WRITE));
+
+	/* Concurrent reads are always allowed. */
+	fd_read = open(reg_file_path, O_RDONLY | O_CLOEXEC);
+	EXPECT_LE(0, fd_read);
+	EXPECT_EQ(0, close(fd_read));
+
+	/* Concurrent writes are denied. */
+	fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC);
+	EXPECT_EQ(-1, fd_write);
+	EXPECT_EQ(ETXTBSY, errno);
+
+	/* Drops O_DENY_WRITE. */
+	EXPECT_EQ(0, fcntl(fd_deny, F_SETFL, flags & ~O_DENY_WRITE));
+
+	/* The restriction is now gone. */
+	fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC);
+	EXPECT_LE(0, fd_write);
+	EXPECT_EQ(0, close(fd_write));
+
+	EXPECT_EQ(0, close(fd_deny));
+}
+
+static void test_deny_write_open(struct __test_metadata *_metadata,
+				 const char *const path, int flags,
+				 const int err_code)
+{
+	int fd;
+
+	flags |= O_CLOEXEC;
+
+	/* Do not block on pipes. */
+	if (path == fifo_path)
+		flags |= O_NONBLOCK;
+
+	fd = open(path, flags | O_RDONLY);
+	if (err_code) {
+		ASSERT_EQ(-1, fd)
+		{
+			TH_LOG("Successfully opened %s", path);
+		}
+		EXPECT_EQ(errno, err_code)
+		{
+			TH_LOG("Wrong error code for %s: %s", path,
+			       strerror(errno));
+		}
+	} else {
+		ASSERT_LE(0, fd)
+		{
+			TH_LOG("Failed to open %s: %s", path, strerror(errno));
+		}
+		EXPECT_EQ(0, close(fd));
+	}
+}
+
+TEST_F(access, deny_write_type_open)
+{
+	test_deny_write_open(_metadata, reg_file_path, O_DENY_WRITE, 0);
+	test_deny_write_open(_metadata, dir_path, O_DENY_WRITE, EINVAL);
+	test_deny_write_open(_metadata, block_dev_path, O_DENY_WRITE, EINVAL);
+	test_deny_write_open(_metadata, char_dev_path, O_DENY_WRITE, EINVAL);
+	test_deny_write_open(_metadata, fifo_path, O_DENY_WRITE, EINVAL);
+}
+
+static void test_deny_write_fcntl(struct __test_metadata *_metadata,
+				  const char *const path, int setfl,
+				  const int err_code)
+{
+	int fd, ret;
+	int getfl, flags = O_CLOEXEC;
+
+	/* Do not block on pipes. */
+	if (path == fifo_path)
+		flags |= O_NONBLOCK;
+
+	fd = open(path, flags | O_RDONLY);
+	ASSERT_LE(0, fd)
+	{
+		TH_LOG("Failed to open %s: %s", path, strerror(errno));
+	}
+	getfl = fcntl(fd, F_GETFL);
+	ASSERT_NE(-1, getfl);
+	ret = fcntl(fd, F_SETFL, getfl | setfl);
+	if (err_code) {
+		ASSERT_EQ(-1, ret)
+		{
+			TH_LOG("Successfully updated flags for %s", path);
+		}
+		EXPECT_EQ(errno, err_code)
+		{
+			TH_LOG("Wrong error code for %s: %s", path,
+			       strerror(errno));
+		}
+	} else {
+		ASSERT_LE(0, ret)
+		{
+			TH_LOG("Failed to update flags for %s: %s", path,
+			       strerror(errno));
+		}
+		EXPECT_EQ(0, close(fd));
+	}
+}
+
+TEST_F(access, deny_write_type_fcntl)
+{
+	int flags;
+
+	test_deny_write_fcntl(_metadata, reg_file_path, O_DENY_WRITE, 0);
+	test_deny_write_fcntl(_metadata, dir_path, O_DENY_WRITE, EINVAL);
+	test_deny_write_fcntl(_metadata, block_dev_path, O_DENY_WRITE, EINVAL);
+	test_deny_write_fcntl(_metadata, char_dev_path, O_DENY_WRITE, EINVAL);
+	test_deny_write_fcntl(_metadata, fifo_path, O_DENY_WRITE, EINVAL);
+
+	flags = fcntl(self->socket_fds[0], F_GETFL);
+	ASSERT_NE(-1, flags);
+	EXPECT_EQ(-1,
+		  fcntl(self->socket_fds[0], F_SETFL, flags | O_DENY_WRITE));
+	EXPECT_EQ(EINVAL, errno);
+
+	flags = fcntl(self->pipefd, F_GETFL);
+	ASSERT_NE(-1, flags);
+	EXPECT_EQ(-1, fcntl(self->pipefd, F_SETFL, flags | O_DENY_WRITE));
+	EXPECT_EQ(EINVAL, errno);
+}
+
+TEST_F(access, allow_write_type_fcntl)
+{
+	int flags;
+
+	test_deny_write_fcntl(_metadata, reg_file_path, 0, 0);
+	test_deny_write_fcntl(_metadata, dir_path, 0, 0);
+	test_deny_write_fcntl(_metadata, block_dev_path, 0, 0);
+	test_deny_write_fcntl(_metadata, char_dev_path, 0, 0);
+	test_deny_write_fcntl(_metadata, fifo_path, 0, 0);
+
+	flags = fcntl(self->socket_fds[0], F_GETFL);
+	ASSERT_NE(-1, flags);
+	EXPECT_EQ(0,
+		  fcntl(self->socket_fds[0], F_SETFL, flags & ~O_DENY_WRITE));
+
+	flags = fcntl(self->pipefd, F_GETFL);
+	ASSERT_NE(-1, flags);
+	EXPECT_EQ(0, fcntl(self->pipefd, F_SETFL, flags & ~O_DENY_WRITE));
+}
+
 /* clang-format off */
 FIXTURE(secbits) {};
 /* clang-format on */
-- 
2.50.1


^ permalink raw reply related

* [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Mickaël Salaün @ 2025-08-22 17:07 UTC (permalink / raw)
  To: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn
  Cc: Mickaël Salaün, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module,
	Andy Lutomirski, Jeff Xu
In-Reply-To: <20250822170800.2116980-1-mic@digikod.net>

Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
passed file descriptors).  This changes the state of the opened file by
making it read-only until it is closed.  The main use case is for script
interpreters to get the guarantee that script' content cannot be altered
while being read and interpreted.  This is useful for generic distros
that may not have a write-xor-execute policy.  See commit a5874fde3c08
("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")

Both execve(2) and the IOCTL to enable fsverity can already set this
property on files with deny_write_access().  This new O_DENY_WRITE make
it widely available.  This is similar to what other OSs may provide
e.g., opening a file with only FILE_SHARE_READ on Windows.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Serge Hallyn <serge@hallyn.com>
Reported-by: Robert Waite <rowait@microsoft.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Link: https://lore.kernel.org/r/20250822170800.2116980-2-mic@digikod.net
---
 fs/fcntl.c                       | 26 ++++++++++++++++++++++++--
 fs/file_table.c                  |  2 ++
 fs/namei.c                       |  6 ++++++
 include/linux/fcntl.h            |  2 +-
 include/uapi/asm-generic/fcntl.h |  4 ++++
 5 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 5598e4d57422..0c80c0fbc706 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -34,7 +34,8 @@
 
 #include "internal.h"
 
-#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
+#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
+	O_DENY_WRITE)
 
 static int setfl(int fd, struct file * filp, unsigned int arg)
 {
@@ -80,8 +81,29 @@ static int setfl(int fd, struct file * filp, unsigned int arg)
 			error = 0;
 	}
 	spin_lock(&filp->f_lock);
+
+	if (arg & O_DENY_WRITE) {
+		/* Only regular files. */
+		if (!S_ISREG(inode->i_mode)) {
+			error = -EINVAL;
+			goto unlock;
+		}
+
+		/* Only sets once. */
+		if (!(filp->f_flags & O_DENY_WRITE)) {
+			error = exe_file_deny_write_access(filp);
+			if (error)
+				goto unlock;
+		}
+	} else {
+		if (filp->f_flags & O_DENY_WRITE)
+			exe_file_allow_write_access(filp);
+	}
+
 	filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK);
 	filp->f_iocb_flags = iocb_flags(filp);
+
+unlock:
 	spin_unlock(&filp->f_lock);
 
  out:
@@ -1158,7 +1180,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC));
diff --git a/fs/file_table.c b/fs/file_table.c
index 81c72576e548..6ba896b6a53f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -460,6 +460,8 @@ static void __fput(struct file *file)
 	locks_remove_file(file);
 
 	security_file_release(file);
+	if (unlikely(file->f_flags & O_DENY_WRITE))
+		exe_file_allow_write_access(file);
 	if (unlikely(file->f_flags & FASYNC)) {
 		if (file->f_op->fasync)
 			file->f_op->fasync(-1, file, 0);
diff --git a/fs/namei.c b/fs/namei.c
index cd43ff89fbaa..366530bf937d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3885,6 +3885,12 @@ static int do_open(struct nameidata *nd,
 	error = may_open(idmap, &nd->path, acc_mode, open_flag);
 	if (!error && !(file->f_mode & FMODE_OPENED))
 		error = vfs_open(&nd->path, file);
+	if (!error && (open_flag & O_DENY_WRITE)) {
+		if (S_ISREG(file_inode(file)->i_mode))
+			error = exe_file_deny_write_access(file);
+		else
+			error = -EINVAL;
+	}
 	if (!error)
 		error = security_file_post_open(file, op->acc_mode);
 	if (!error && do_truncate)
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index a332e79b3207..dad14101686f 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_DENY_WRITE)
 
 /* List of all valid flags for the how->resolve argument: */
 #define VALID_RESOLVE_FLAGS \
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 613475285643..facd9136f5af 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -91,6 +91,10 @@
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 
+#ifndef O_DENY_WRITE
+#define O_DENY_WRITE	040000000
+#endif
+
 #ifndef O_NDELAY
 #define O_NDELAY	O_NONBLOCK
 #endif
-- 
2.50.1


^ permalink raw reply related

* [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Mickaël Salaün @ 2025-08-22 17:07 UTC (permalink / raw)
  To: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn
  Cc: Mickaël Salaün, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module

Hi,

Script interpreters can check if a file would be allowed to be executed
by the kernel using the new AT_EXECVE_CHECK flag. This approach works
well on systems with write-xor-execute policies, where scripts cannot
be modified by malicious processes. However, this protection may not be
available on more generic distributions.

The key difference between `./script.sh` and `sh script.sh` (when using
AT_EXECVE_CHECK) is that execve(2) prevents the script from being opened
for writing while it's being executed. To achieve parity, the kernel
should provide a mechanism for script interpreters to deny write access
during script interpretation. While interpreters can copy script content
into a buffer, a race condition remains possible after AT_EXECVE_CHECK.

This patch series introduces a new O_DENY_WRITE flag for use with
open*(2) and fcntl(2). Both interfaces are necessary since script
interpreters may receive either a file path or file descriptor. For
backward compatibility, open(2) with O_DENY_WRITE will not fail on
unsupported systems, while users requiring explicit support guarantees
can use openat2(2).

The check_exec.rst documentation and related examples do not mention this new
feature yet.

Regards,

Mickaël Salaün (2):
  fs: Add O_DENY_WRITE
  selftests/exec: Add O_DENY_WRITE tests

 fs/fcntl.c                                |  26 ++-
 fs/file_table.c                           |   2 +
 fs/namei.c                                |   6 +
 include/linux/fcntl.h                     |   2 +-
 include/uapi/asm-generic/fcntl.h          |   4 +
 tools/testing/selftests/exec/check-exec.c | 219 ++++++++++++++++++++++
 6 files changed, 256 insertions(+), 3 deletions(-)


base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
-- 
2.50.1


^ permalink raw reply

* Re: [PATCH v3 05/12] man/man2/fspick.2: document "new" mount API
From: Aleksa Sarai @ 2025-08-22 13:40 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <198d1f2e189.11dbac16b2998.3847935512688537521@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1055 bytes --]

On 2025-08-22, Askar Safin <safinaskar@zohomail.com> wrote:
>  ---- On Sat, 09 Aug 2025 00:39:49 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
>  > +The above procedure is functionally equivalent to
>  > +the following mount operation using
>  > +.BR mount (2):
> 
> This is not true.
> 
> fspick adds options to superblock. It doesn't remove existing ones.

fspick "copies the existing parameters" would be more accurate. I can
reword this, but it's an example and I don't think it makes sense to add
a large amount of clarifying text for each example.

The comparisons to mount(2) are meant to be indicative, but if you I can
also just remove them (David's versions didn't include them).

> mount(MS_REMOUNT) replaces options. I. e. mount(2) call provided in
> example will unset all other options.
> 
> In the end of this message you will find C code, which proves this.

Yes, I am already keenly aware of this behaviour.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Aleksa Sarai @ 2025-08-22 13:25 UTC (permalink / raw)
  To: Rob Landley
  Cc: Christian Brauner, Lichen Liu, linux-fsdevel, linux-kernel,
	safinaskar, kexec, weilongchen, linux-api, zohar, stefanb,
	initramfs, corbet, linux-doc, viro, jack
In-Reply-To: <da1b1926-ba18-4a81-93e0-56cb2f85e4dd@landley.net>

[-- Attachment #1: Type: text/plain, Size: 2158 bytes --]

On 2025-08-21, Rob Landley <rob@landley.net> wrote:
> P.S. It's a pity lkml.iu.edu and spinics.net are both down right now, but
> after vger.kernel.org deleted all reference to them I can't say I'm
> surprised. Neither lkml.org nor lore.kernel.org have an obvious threaded
> interface allowing you to find stuff without a keyword search, and

I'm not sure what issue you're gesturing to exactly, but if you have the
Message-ID you can link to it directly with
<https://lore.kernel.org/lkml/$MESSAGE_ID>. For instance, this email
will be available at
<https://lore.kernel.org/lkml/2025-08-22-witty-worthy-wink-sitcom-T5L8wA@cyphar.com>.

To be honest, I much prefer that to lkml.org's completely opaque
mappings based on arrival order and date (and in my experience it seems
to miss messages). The same goes for lkml.iu.edu, spinics and gmane.

One of the biggest losses when gmane disappeared was that all of the
URLs that referenced it were rendered unusable because the mapping from
their numbering to Message-IDs was not maintained. If lkml.org goes down
10 years from now, every reference to it will also be unusable, but
lore.kernel.org addresses will still be usable even if it goes down (you
can even search your local archives for the mails).

(It would be nice if more people spent a bit of time configuring their
Message-ID generation to be more friendly for this usecase -- mutt
changed their default Message-ID generation to be completely random
characters a few years ago, which made Message-IDs less recognisable
until folks adjusted their configs. Gmail is even worse, obviously.)

Also, lore.kernel.org has threading on the main page and on individual
thread pages? Maybe I don't understand what you're referring to?

> lore.kernel.org somehow manages not to list "linux-kernel" in its top level
> list of "inboxes" at all. The wagons are circled pretty tightly...

No it's definitely there, it's just labeled as "lkml" (they're sorted by
latest message timestamp so LKML will usually be near the top).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 05/12] man/man2/fspick.2: document "new" mount API
From: Askar Safin @ 2025-08-22 13:23 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-5-f61405c80f34@cyphar.com>

 ---- On Sat, 09 Aug 2025 00:39:49 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > +The above procedure is functionally equivalent to
 > +the following mount operation using
 > +.BR mount (2):

This is not true.

fspick adds options to superblock. It doesn't remove existing ones.

mount(MS_REMOUNT) replaces options. I. e. mount(2) call provided in
example will unset all other options.

In the end of this message you will find C code, which proves this.

Also, recently I sent patch to mount(2) manpage,
which clarifies MS_REMOUNT | MS_BIND behavior.
This is somewhat related.
The patch comes with another reproducer.
Here is a link: https://lore.kernel.org/linux-man/20250822114315.1571537-1-safinaskar@zohomail.com/

--
Askar Safin
https://types.pl/@safinaskar

// You need to be root (non-initial user namespace will go)

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/mount.h>
#include <sys/syscall.h>
#include <sys/vfs.h>
#include <sys/sysmacros.h>
#include <sys/statvfs.h>
#include <linux/openat2.h>

#define MY_ASSERT(cond) do { \
    if (!(cond)) { \
        fprintf (stderr, "%d: %s: assertion failed\n", __LINE__, #cond); \
        exit (1); \
    } \
} while (0)

int
main (void)
{
    // Init
    {
        MY_ASSERT (chdir ("/") == 0);
        MY_ASSERT (unshare (CLONE_NEWNS) == 0);
        MY_ASSERT (mount (NULL, "/", NULL, MS_PRIVATE | MS_REC, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp", "tmpfs", 0, NULL) == 0);
    }

    MY_ASSERT (mkdir ("/tmp/a", 0777) == 0);
    MY_ASSERT (mkdir ("/tmp/b", 0777) == 0);

    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", MS_SYNCHRONOUS, NULL) == 0);
        {
            struct statfs buf;
            memset (&buf, 0, sizeof buf);
            MY_ASSERT (statfs ("/tmp/a", &buf) == 0);
            MY_ASSERT (buf.f_flags & ST_SYNCHRONOUS);
            MY_ASSERT (!(buf.f_flags & ST_RDONLY));
        }
        {
            int fsfd = fspick (AT_FDCWD, "/tmp/a", FSPICK_CLOEXEC);
            MY_ASSERT (fsfd >= 0);
            MY_ASSERT (fsconfig (fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0) == 0);
            MY_ASSERT (fsconfig (fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0) == 0);
            MY_ASSERT (close (fsfd) == 0);
        }
        {
            struct statfs buf;
            memset (&buf, 0, sizeof buf);
            MY_ASSERT (statfs ("/tmp/a", &buf) == 0);
            MY_ASSERT (buf.f_flags & ST_SYNCHRONOUS);
            MY_ASSERT (buf.f_flags & ST_RDONLY);
        }
        MY_ASSERT (umount ("/tmp/a") == 0);
    }

    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", MS_SYNCHRONOUS, NULL) == 0);
        {
            struct statfs buf;
            memset (&buf, 0, sizeof buf);
            MY_ASSERT (statfs ("/tmp/a", &buf) == 0);
            MY_ASSERT (buf.f_flags & ST_SYNCHRONOUS);
            MY_ASSERT (!(buf.f_flags & ST_RDONLY));
        }
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_RDONLY, NULL) == 0);
        {
            struct statfs buf;
            memset (&buf, 0, sizeof buf);
            MY_ASSERT (statfs ("/tmp/a", &buf) == 0);
            MY_ASSERT (!(buf.f_flags & ST_SYNCHRONOUS));
            MY_ASSERT (buf.f_flags & ST_RDONLY);
        }
        MY_ASSERT (umount ("/tmp/a") == 0);
    }

    printf ("All tests passed\n");
    exit (0);
}


^ permalink raw reply

* Re: [PATCH 1/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Aleksa Sarai @ 2025-08-22 13:06 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250822114315.1571537-2-safinaskar@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 2925 bytes --]

On 2025-08-22, Askar Safin <safinaskar@zohomail.com> wrote:
> My edit is based on experiments and reading Linux code
> 
> Signed-off-by: Askar Safin <safinaskar@zohomail.com>
> ---
>  man/man2/mount.2 | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/man/man2/mount.2 b/man/man2/mount.2
> index 5d83231f9..909b82e88 100644
> --- a/man/man2/mount.2
> +++ b/man/man2/mount.2
> @@ -405,7 +405,19 @@ flag can be used with
>  to modify only the per-mount-point flags.
>  .\" See https://lwn.net/Articles/281157/
>  This is particularly useful for setting or clearing the "read-only"
> -flag on a mount without changing the underlying filesystem.
> +flag on a mount without changing flags of the underlying filesystem.

For obvious reasons, I would prefer the term "filesystem parameters"
here but mount(2) is kind of loose with its terminology...

> +The
> +.I data
> +argument is ignored if
> +.B MS_REMOUNT
> +and
> +.B MS_BIND
> +are specified.
> +The
> +.I mountflags
> +should specify existing per-mount-point flags,
> +except for those parameters
> +that are deliberately changed.

I would phrase this more like a note to make the advice a bit clearer:

  Note that the mountpoint will
  have its existing per-mount-point flags
  cleared and replaced with those in
  .I mountflags
  when
  .B MS_REMOUNT
  and
  .B MS_BIND
  are specified.
  This means that if
  you wish to preserve
  any existing per-mount-point flags
  (which can be retrieved using
  .BR statfs (2)),
  you need to include them in
  .IR mountflags ,
  along with the per-mount-point flags you wish to set
  (or with the flags you wish to clear missing).

(Still a bit too wordy, there's probably a nicer way of writing it...)

It might also be a good idea to reference locked mount flags (which are
explained in more detail in mount_namespaces(7)) since they are very
relevant to the text you're adding about MS_REMOUNT|MS_BIND.

The current docs only mention locked mounts once and the description is
kind of insufficient (it implies that only MS_REMOUNT affects this, and
that it's related to the mount being locked -- neither is really true).
When dealing with a mount with locked flags, remembering to include
existing mount attributes is very important.

>  Specifying
>  .I mountflags
>  as:
> @@ -416,8 +428,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
>  .EE
>  .in
>  .P
> -will make access through this mountpoint read-only, without affecting
> -other mounts.
> +will make access through this mountpoint read-only
> +(and clear all other per-mount-point flags),

   (clearing all other per-mount-point flags)

> +without affecting
> +other mounts
> +of this filesystem.
>  .\"
>  .SS Creating a bind mount
>  If

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox