Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH] uapi/fcntl: conditionally define AT_RENAME* macros
From: Randy Dunlap @ 2025-08-24 22:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Randy Dunlap, Amir Goldstein, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, linux-fsdevel, linux-api

Don't define the AT_RENAME_* macros when __USE_GNU is defined since
/usr/include/stdio.h defines them in that case (i.e. when _GNU_SOURCE
is defined, which causes __USE_GNU to be defined).

Having them defined in 2 places causes build warnings (duplicate
definitions) in both samples/watch_queue/watch_test.c and
samples/vfs/test-statx.c.

Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
---
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
CC: linux-api@vger.kernel.org

 include/uapi/linux/fcntl.h |    2 ++
 1 file changed, 2 insertions(+)

--- linux-next-20250819.orig/include/uapi/linux/fcntl.h
+++ linux-next-20250819/include/uapi/linux/fcntl.h
@@ -155,10 +155,12 @@
  * as possible, so we can use them for generic bits in the future if necessary.
  */
 
+#ifndef __USE_GNU
 /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
 #define AT_RENAME_NOREPLACE	0x0001
 #define AT_RENAME_EXCHANGE	0x0002
 #define AT_RENAME_WHITEOUT	0x0004
+#endif
 
 /* Flag for faccessat(2). */
 #define AT_EACCESS		0x200	/* Test access permitted for

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Andy Lutomirski @ 2025-08-24 18:04 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Jann Horn, Al Viro, Christian Brauner, Kees Cook, Paul Moore,
	Serge Hallyn, Andy Lutomirski, Arnd Bergmann, Christian Heimes,
	Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, Jeff Xu
In-Reply-To: <20250824.Ujoh8unahy5a@digikod.net>

On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > > passed file descriptors).  This changes the state of the opened file by
> > > making it read-only until it is closed.  The main use case is for script
> > > interpreters to get the guarantee that script' content cannot be altered
> > > while being read and interpreted.  This is useful for generic distros
> > > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> > >
> > > Both execve(2) and the IOCTL to enable fsverity can already set this
> > > property on files with deny_write_access().  This new O_DENY_WRITE make
> >
> > The kernel actually tried to get rid of this behavior on execve() in
> > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> > because it broke userspace assumptions.
>
> Oh, good to know.
>
> >
> > > it widely available.  This is similar to what other OSs may provide
> > > e.g., opening a file with only FILE_SHARE_READ on Windows.
> >
> > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> > removed for security reasons; as
> > https://man7.org/linux/man-pages/man2/mmap.2.html says:
> >
> > |        MAP_DENYWRITE
> > |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> > |               signaled that attempts to write to the underlying file
> > |               should fail with ETXTBSY.  But this was a source of denial-
> > |               of-service attacks.)"
> >
> > It seems to me that the same issue applies to your patch - it would
> > allow unprivileged processes to essentially lock files such that other
> > processes can't write to them anymore. This might allow unprivileged
> > users to prevent root from updating config files or stuff like that if
> > they're updated in-place.
>
> Yes, I agree, but since it is the case for executed files I though it
> was worth starting a discussion on this topic.  This new flag could be
> restricted to executable files, but we should avoid system-wide locks
> like this.  I'm not sure how Windows handle these issues though.
>
> Anyway, we should rely on the access control policy to control write and
> execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> the references and the background!

I'm confused.  I understand that there are many contexts in which one
would want to prevent execution of unapproved content, which might
include preventing a given process from modifying some code and then
executing it.

I don't understand what these deny-write features have to do with it.
These features merely prevent someone from modifying code *that is
currently in use*, which is not at all the same thing as preventing
modifying code that might get executed -- one can often modify
contents *before* executing those contents.

In any case, IMO it's rather sad that the elimination of ETXTBSY had
to be reverted -- it's really quite a nasty feature.  But it occurs to
me that Linux can more or less do what is IMO the actually desired
thing: snapshot the contents of a file and execute the snapshot.  The
hack at the end of the email works!  (Well, it works if the chosen
filesystem supports it.)

$ ./silly_tmp /tmp/test /tmp vim /proc/self/fd/3

emacs is apparently far, far too clever and can't save if you do:

$ ./silly_tmp /tmp/test /tmp emacs /proc/self/fd/3

I'm not seriously suggesting that anyone should execute binaries or
scripts on Linux exactly like this, for a whole bunch of reasons:

- It needs filesystem support (but maybe this isn't so bad)

- It needs write access to a directory on the correct filesystem (a
showstopper for serious use)

- It is wildly incompatible with write-xor-execute, so this would be a
case of one step forward, ten steps back.

- It would defeat a lot of tools that inspect /proc, which would be
quite annoying to say the least.

But maybe a less kludgy version could be used for real.  What if there
was a syscall that would take an fd and make a snapshot of the file?
It would, at least by default, produce a *read-only* snapshot (fully
sealed a la F_SEAL_*), inherit any integrity data that came with the
source (e.g. LSMs could understand it), would not require a writable
directory on the filesystem, and would maybe even come with an extra
seal-like thing that prevents it from being linkat-ed.  (I'm not sure
that linkat would actually be a problem, but I'm also not immediately
sure that LSMs would be as comfortable with it if linkat were
allowed.)  And there could probably be an extremely efficient
implementation that might even reuse the existing deny-write mechanism
to optimize the common case where the file is never written.

For that matter, the actual common case would be to execute stuff in
/usr or similar, and those files really ought never to be modified.
So there could be a file attribute or something that means "this file
CANNOT be modified, but it can still be unlinked or replaced as
usual", and snapshotting such a file would be a no-op.  Distributions
and container tools could set that attribute.  Overlayfs could also
provide an efficient implementation if the file currently comes from
an immutable source.

Hmm, maybe it's not strictly necessary that it be immutable -- maybe
it's sometimes okay if reads start to fail if the contents change.
Let's call this a "weak snapshot" -- reads of a weak snapshot either
return the original contents or fail.  fsverity would give weak
snapshots for at no additional cost.

It's worth noting that the common case doesn't actually need an fd.
We have mmap(..., MAP_PRIVATE, ...).  What we would actually want for
mmap use cases is mmap(..., MAP_SNAPSHOT, ...), with the semantics
that the kernel promises that future writes to the source would either
not be reflected in the mapping or would cause SIGBUS.  One might
reasonably debate what forced-writes would do (I think forced-writes
should be allowed just like they currently are, since anyone who can
force-write to process memory is already assumed to be permitted to
bypass write-xor-execute).

---

/* Written by Claude Sonnet 4 with a surprisingly small amount of help
from Andy */

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <linux/fs.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>

int main(int argc, char *argv[]) {
    if (argc < 4) {
        fprintf(stderr, "Usage: %s <source_file> <temp_dir>
[exec_args...]\n", argv[0]);
        exit(1);
    }

    const char *source_file = argv[1];
    const char *temp_dir = argv[2];

    // Open source file
    int source_fd = open(source_file, O_RDONLY);
    if (source_fd == -1) {
        perror("Failed to open source file");
        exit(1);
    }

    // Create temporary file
    int temp_fd = open(temp_dir, O_TMPFILE | O_RDWR, 0600);
    if (temp_fd == -1) {
        perror("Failed to create temporary file");
        close(source_fd);
        exit(1);
    }

    // Clone the file contents using FICLONE
    if (ioctl(temp_fd, FICLONE, source_fd) == -1) {
        perror("Failed to clone file");
        close(source_fd);
        close(temp_fd);
        exit(1);
    }

    // Close source file
    close(source_fd);

    // Make sure temp file is on fd 3
    if (temp_fd != 3) {
        if (dup2(temp_fd, 3) == -1) {
            perror("Failed to move temp file to fd 3");
            close(temp_fd);
            exit(1);
        }
        close(temp_fd);
    }

    // Execute the remaining arguments
    if (argc >= 3) {
        execvp(argv[3], &argv[3]);
        perror("Failed to execute command");
        exit(1);
    }

    return 0;
}

^ permalink raw reply

* Re: [PATCH v3 00/12] man2: document "new" mount API
From: Askar Safin @ 2025-08-24 13:07 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-0-f61405c80f34@cyphar.com>

 ---- On Sat, 09 Aug 2025 00:39:44 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > Back in 2019, the new mount API was merged[1]. David Howells then set

I finished my experiments!

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Mickaël Salaün @ 2025-08-24 11:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module,
	Andy Lutomirski, Jeff Xu
In-Reply-To: <CAG48ez1XjUdcFztc_pF2qcoLi7xvfpJ224Ypc=FoGi-Px-qyZw@mail.gmail.com>

On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > passed file descriptors).  This changes the state of the opened file by
> > making it read-only until it is closed.  The main use case is for script
> > interpreters to get the guarantee that script' content cannot be altered
> > while being read and interpreted.  This is useful for generic distros
> > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> >
> > Both execve(2) and the IOCTL to enable fsverity can already set this
> > property on files with deny_write_access().  This new O_DENY_WRITE make
> 
> The kernel actually tried to get rid of this behavior on execve() in
> commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> because it broke userspace assumptions.

Oh, good to know.

> 
> > it widely available.  This is similar to what other OSs may provide
> > e.g., opening a file with only FILE_SHARE_READ on Windows.
> 
> We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> removed for security reasons; as
> https://man7.org/linux/man-pages/man2/mmap.2.html says:
> 
> |        MAP_DENYWRITE
> |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> |               signaled that attempts to write to the underlying file
> |               should fail with ETXTBSY.  But this was a source of denial-
> |               of-service attacks.)"
> 
> It seems to me that the same issue applies to your patch - it would
> allow unprivileged processes to essentially lock files such that other
> processes can't write to them anymore. This might allow unprivileged
> users to prevent root from updating config files or stuff like that if
> they're updated in-place.

Yes, I agree, but since it is the case for executed files I though it
was worth starting a discussion on this topic.  This new flag could be
restricted to executable files, but we should avoid system-wide locks
like this.  I'm not sure how Windows handle these issues though.

Anyway, we should rely on the access control policy to control write and
execute access in a consistent way (e.g. write-xor-execute).  Thanks for
the references and the background!

^ permalink raw reply

* Re: [PATCH v3 09/12] man/man2/open_tree.2: document "new" mount API
From: Askar Safin @ 2025-08-24  6:53 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-9-f61405c80f34@cyphar.com>

 ---- On Sat, 09 Aug 2025 00:39:53 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > +If
 > +.I flags
 > +does not contain
 > +.BR \%OPEN_TREE_CLONE ,
 > +.BR open_tree ()
 > +returns a file descriptor
 > +that is exactly equivalent to
 > +one produced by
 > +.BR openat (2)

This is not true. They differ in handling of automounts.
open_tree follows them in final component (by default),
and openat - not.

See reproducer in the end of this letter.

I suggest merely adding this:
> that is exactly equivalent to one produced by openat(2) (modulo automounts)

--
Askar Safin
https://types.pl/@safinaskar


// Root in initial user namespace

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/mount.h>
#include <sys/syscall.h>
#include <linux/openat2.h>

#define MY_ASSERT(cond) do { \
    if (!(cond)) { \
        fprintf (stderr, "%s: assertion failed\n", #cond); \
        exit (1); \
    } \
} while (0)

bool
tracing_mounted (void)
{
    struct statx tracing;
    if (statx (AT_FDCWD, "/tmp/debugfs/tracing", AT_NO_AUTOMOUNT, 0, &tracing) != 0)
        {
            perror ("statx tracing");
            exit (1);
        }
    if (!(tracing.stx_attributes_mask & STATX_ATTR_MOUNT_ROOT))
        {
            fprintf (stderr, "???\n");
            exit (1);
        }
    return tracing.stx_attributes & STATX_ATTR_MOUNT_ROOT;
}

void
mount_debugfs (void)
{
    if (mount (NULL, "/tmp/debugfs", "debugfs", 0, NULL) != 0)
        {
            perror ("mount debugfs");
            exit (1);
        }
    MY_ASSERT (!tracing_mounted ());
}

void
umount_debugfs (void)
{
    umount ("/tmp/debugfs/tracing"); // Ignore errors
    if (umount ("/tmp/debugfs") != 0)
        {
            perror ("umount debugfs");
            exit (1);
        }
}

int
main (void)
{
    // Init
    {
        if (chdir ("/") != 0)
            {
                perror ("chdir /");
                exit (1);
            }
        if (unshare (CLONE_NEWNS) != 0)
            {
                perror ("unshare");
                exit (1);
            }
        if (mount (NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL) != 0)
            {
                perror ("mount(NULL, /, NULL, MS_REC | MS_PRIVATE, NULL)");
                exit (1);
            }
        if (mount (NULL, "/tmp", "tmpfs", 0, NULL) != 0)
            {
                perror ("mount tmpfs");
                exit (1);
            }
    }
    if (mkdir ("/tmp/debugfs", 0777) != 0)
        {
            perror ("mkdir(/tmp/debugfs)");
            exit (1);
        }

    // open(O_PATH) doesn't follow automounts
    {
        mount_debugfs ();
        {
            int fd = open ("/tmp/debugfs/tracing", O_PATH);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (!tracing_mounted ());
        umount_debugfs ();
    }

    // open_tree does follow automounts (by default)
    {
        mount_debugfs ();
        {
            int fd = open_tree (AT_FDCWD, "/tmp/debugfs/tracing", 0);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (tracing_mounted ());
        umount_debugfs ();
    }

    // open (O_PATH | O_DIRECTORY)
    {
        mount_debugfs ();
        {
            int fd = open ("/tmp/debugfs/tracing", O_PATH | O_DIRECTORY);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (tracing_mounted ());
        umount_debugfs ();
    }

    // AT_NO_AUTOMOUNT
    {
        mount_debugfs ();
        {
            int fd = open_tree (AT_FDCWD, "/tmp/debugfs/tracing", AT_NO_AUTOMOUNT);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (!tracing_mounted ());
        umount_debugfs ();
    }

    printf ("All tests passed\n");
    exit (0);
}


^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Jann Horn @ 2025-08-22 19:45 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module,
	Andy Lutomirski, Jeff Xu
In-Reply-To: <20250822170800.2116980-2-mic@digikod.net>

On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> passed file descriptors).  This changes the state of the opened file by
> making it read-only until it is closed.  The main use case is for script
> interpreters to get the guarantee that script' content cannot be altered
> while being read and interpreted.  This is useful for generic distros
> that may not have a write-xor-execute policy.  See commit a5874fde3c08
> ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
>
> Both execve(2) and the IOCTL to enable fsverity can already set this
> property on files with deny_write_access().  This new O_DENY_WRITE make

The kernel actually tried to get rid of this behavior on execve() in
commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
because it broke userspace assumptions.

> it widely available.  This is similar to what other OSs may provide
> e.g., opening a file with only FILE_SHARE_READ on Windows.

We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
removed for security reasons; as
https://man7.org/linux/man-pages/man2/mmap.2.html says:

|        MAP_DENYWRITE
|               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
|               signaled that attempts to write to the underlying file
|               should fail with ETXTBSY.  But this was a source of denial-
|               of-service attacks.)"

It seems to me that the same issue applies to your patch - it would
allow unprivileged processes to essentially lock files such that other
processes can't write to them anymore. This might allow unprivileged
users to prevent root from updating config files or stuff like that if
they're updated in-place.

^ permalink raw reply

* [RFC PATCH v1 2/2] selftests/exec: Add O_DENY_WRITE tests
From: Mickaël Salaün @ 2025-08-22 17:08 UTC (permalink / raw)
  To: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn
  Cc: Mickaël Salaün, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module,
	Andy Lutomirski, Jeff Xu
In-Reply-To: <20250822170800.2116980-1-mic@digikod.net>

Add 6 test suites to check O_DENY_WRITE used through open(2) and
fcntl(2).  Check that it fills its purpose, that it only applies to
regular files, and that setting this flag several times is not an issue.

The O_DENY_WRITE flag is useful in conjunction with AT_EXECVE_CHECK for
systems that don't enforce a write-xor-execute policy.  Extend related
tests to also use them as examples.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Robert Waite <rowait@microsoft.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Link: https://lore.kernel.org/r/20250822170800.2116980-3-mic@digikod.net
---
 tools/testing/selftests/exec/check-exec.c | 219 ++++++++++++++++++++++
 1 file changed, 219 insertions(+)

diff --git a/tools/testing/selftests/exec/check-exec.c b/tools/testing/selftests/exec/check-exec.c
index 55bce47e56b7..9db1d7b9aa97 100644
--- a/tools/testing/selftests/exec/check-exec.c
+++ b/tools/testing/selftests/exec/check-exec.c
@@ -30,6 +30,10 @@
 #define _ASM_GENERIC_FCNTL_H
 #include <linux/fcntl.h>
 
+#ifndef O_DENY_WRITE
+#define O_DENY_WRITE 040000000
+#endif
+
 #include "../kselftest_harness.h"
 
 static int sys_execveat(int dirfd, const char *pathname, char *const argv[],
@@ -319,6 +323,221 @@ TEST_F(access, non_regular_files)
 	test_exec_fd(_metadata, self->pipefd, EACCES);
 }
 
+TEST_F(access, deny_write_check_open)
+{
+	int fd_deny, fd_read, fd_write;
+
+	fd_deny = open(reg_file_path, O_DENY_WRITE | O_RDONLY | O_CLOEXEC);
+	ASSERT_LE(0, fd_deny);
+
+	/* Concurrent reads are always allowed. */
+	fd_read = open(reg_file_path, O_RDONLY | O_CLOEXEC);
+	EXPECT_LE(0, fd_read);
+	EXPECT_EQ(0, close(fd_read));
+
+	/* Concurrent writes are denied. */
+	fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC);
+	EXPECT_EQ(-1, fd_write);
+	EXPECT_EQ(ETXTBSY, errno);
+
+	/* Drops O_DENY_WRITE. */
+	EXPECT_EQ(0, close(fd_deny));
+
+	/* The restriction is now gone. */
+	fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC);
+	EXPECT_LE(0, fd_write);
+	EXPECT_EQ(0, close(fd_write));
+}
+
+TEST_F(access, deny_write_check_open_and_fcntl)
+{
+	int fd_deny, fd_read, fd_write, flags;
+
+	fd_deny = open(reg_file_path, O_DENY_WRITE | O_RDONLY | O_CLOEXEC);
+	ASSERT_LE(0, fd_deny);
+
+	/* Sets O_DENY_WRITE a "second" time. */
+	flags = fcntl(fd_deny, F_GETFL);
+	ASSERT_NE(-1, flags);
+	EXPECT_EQ(0, fcntl(fd_deny, F_SETFL, flags | O_DENY_WRITE));
+
+	/* Concurrent reads are always allowed. */
+	fd_read = open(reg_file_path, O_RDONLY | O_CLOEXEC);
+	EXPECT_LE(0, fd_read);
+	EXPECT_EQ(0, close(fd_read));
+
+	/* Concurrent writes are denied. */
+	fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC);
+	EXPECT_EQ(-1, fd_write);
+	EXPECT_EQ(ETXTBSY, errno);
+
+	/* Drops O_DENY_WRITE. */
+	EXPECT_EQ(0, close(fd_deny));
+
+	/* The restriction is now gone. */
+	fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC);
+	EXPECT_LE(0, fd_write);
+	EXPECT_EQ(0, close(fd_write));
+}
+
+TEST_F(access, deny_write_check_fcntl)
+{
+	int fd_deny, fd_read, fd_write, flags;
+
+	fd_deny = open(reg_file_path, O_RDONLY | O_CLOEXEC);
+	ASSERT_LE(0, fd_deny);
+
+	/* Sets O_DENY_WRITE a first time. */
+	flags = fcntl(fd_deny, F_GETFL);
+	ASSERT_NE(-1, flags);
+	EXPECT_EQ(0, fcntl(fd_deny, F_SETFL, flags | O_DENY_WRITE));
+
+	/* Sets O_DENY_WRITE a "second" time. */
+	EXPECT_EQ(0, fcntl(fd_deny, F_SETFL, flags | O_DENY_WRITE));
+
+	/* Concurrent reads are always allowed. */
+	fd_read = open(reg_file_path, O_RDONLY | O_CLOEXEC);
+	EXPECT_LE(0, fd_read);
+	EXPECT_EQ(0, close(fd_read));
+
+	/* Concurrent writes are denied. */
+	fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC);
+	EXPECT_EQ(-1, fd_write);
+	EXPECT_EQ(ETXTBSY, errno);
+
+	/* Drops O_DENY_WRITE. */
+	EXPECT_EQ(0, fcntl(fd_deny, F_SETFL, flags & ~O_DENY_WRITE));
+
+	/* The restriction is now gone. */
+	fd_write = open(reg_file_path, O_WRONLY | O_CLOEXEC);
+	EXPECT_LE(0, fd_write);
+	EXPECT_EQ(0, close(fd_write));
+
+	EXPECT_EQ(0, close(fd_deny));
+}
+
+static void test_deny_write_open(struct __test_metadata *_metadata,
+				 const char *const path, int flags,
+				 const int err_code)
+{
+	int fd;
+
+	flags |= O_CLOEXEC;
+
+	/* Do not block on pipes. */
+	if (path == fifo_path)
+		flags |= O_NONBLOCK;
+
+	fd = open(path, flags | O_RDONLY);
+	if (err_code) {
+		ASSERT_EQ(-1, fd)
+		{
+			TH_LOG("Successfully opened %s", path);
+		}
+		EXPECT_EQ(errno, err_code)
+		{
+			TH_LOG("Wrong error code for %s: %s", path,
+			       strerror(errno));
+		}
+	} else {
+		ASSERT_LE(0, fd)
+		{
+			TH_LOG("Failed to open %s: %s", path, strerror(errno));
+		}
+		EXPECT_EQ(0, close(fd));
+	}
+}
+
+TEST_F(access, deny_write_type_open)
+{
+	test_deny_write_open(_metadata, reg_file_path, O_DENY_WRITE, 0);
+	test_deny_write_open(_metadata, dir_path, O_DENY_WRITE, EINVAL);
+	test_deny_write_open(_metadata, block_dev_path, O_DENY_WRITE, EINVAL);
+	test_deny_write_open(_metadata, char_dev_path, O_DENY_WRITE, EINVAL);
+	test_deny_write_open(_metadata, fifo_path, O_DENY_WRITE, EINVAL);
+}
+
+static void test_deny_write_fcntl(struct __test_metadata *_metadata,
+				  const char *const path, int setfl,
+				  const int err_code)
+{
+	int fd, ret;
+	int getfl, flags = O_CLOEXEC;
+
+	/* Do not block on pipes. */
+	if (path == fifo_path)
+		flags |= O_NONBLOCK;
+
+	fd = open(path, flags | O_RDONLY);
+	ASSERT_LE(0, fd)
+	{
+		TH_LOG("Failed to open %s: %s", path, strerror(errno));
+	}
+	getfl = fcntl(fd, F_GETFL);
+	ASSERT_NE(-1, getfl);
+	ret = fcntl(fd, F_SETFL, getfl | setfl);
+	if (err_code) {
+		ASSERT_EQ(-1, ret)
+		{
+			TH_LOG("Successfully updated flags for %s", path);
+		}
+		EXPECT_EQ(errno, err_code)
+		{
+			TH_LOG("Wrong error code for %s: %s", path,
+			       strerror(errno));
+		}
+	} else {
+		ASSERT_LE(0, ret)
+		{
+			TH_LOG("Failed to update flags for %s: %s", path,
+			       strerror(errno));
+		}
+		EXPECT_EQ(0, close(fd));
+	}
+}
+
+TEST_F(access, deny_write_type_fcntl)
+{
+	int flags;
+
+	test_deny_write_fcntl(_metadata, reg_file_path, O_DENY_WRITE, 0);
+	test_deny_write_fcntl(_metadata, dir_path, O_DENY_WRITE, EINVAL);
+	test_deny_write_fcntl(_metadata, block_dev_path, O_DENY_WRITE, EINVAL);
+	test_deny_write_fcntl(_metadata, char_dev_path, O_DENY_WRITE, EINVAL);
+	test_deny_write_fcntl(_metadata, fifo_path, O_DENY_WRITE, EINVAL);
+
+	flags = fcntl(self->socket_fds[0], F_GETFL);
+	ASSERT_NE(-1, flags);
+	EXPECT_EQ(-1,
+		  fcntl(self->socket_fds[0], F_SETFL, flags | O_DENY_WRITE));
+	EXPECT_EQ(EINVAL, errno);
+
+	flags = fcntl(self->pipefd, F_GETFL);
+	ASSERT_NE(-1, flags);
+	EXPECT_EQ(-1, fcntl(self->pipefd, F_SETFL, flags | O_DENY_WRITE));
+	EXPECT_EQ(EINVAL, errno);
+}
+
+TEST_F(access, allow_write_type_fcntl)
+{
+	int flags;
+
+	test_deny_write_fcntl(_metadata, reg_file_path, 0, 0);
+	test_deny_write_fcntl(_metadata, dir_path, 0, 0);
+	test_deny_write_fcntl(_metadata, block_dev_path, 0, 0);
+	test_deny_write_fcntl(_metadata, char_dev_path, 0, 0);
+	test_deny_write_fcntl(_metadata, fifo_path, 0, 0);
+
+	flags = fcntl(self->socket_fds[0], F_GETFL);
+	ASSERT_NE(-1, flags);
+	EXPECT_EQ(0,
+		  fcntl(self->socket_fds[0], F_SETFL, flags & ~O_DENY_WRITE));
+
+	flags = fcntl(self->pipefd, F_GETFL);
+	ASSERT_NE(-1, flags);
+	EXPECT_EQ(0, fcntl(self->pipefd, F_SETFL, flags & ~O_DENY_WRITE));
+}
+
 /* clang-format off */
 FIXTURE(secbits) {};
 /* clang-format on */
-- 
2.50.1


^ permalink raw reply related

* [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Mickaël Salaün @ 2025-08-22 17:07 UTC (permalink / raw)
  To: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn
  Cc: Mickaël Salaün, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module,
	Andy Lutomirski, Jeff Xu
In-Reply-To: <20250822170800.2116980-1-mic@digikod.net>

Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
passed file descriptors).  This changes the state of the opened file by
making it read-only until it is closed.  The main use case is for script
interpreters to get the guarantee that script' content cannot be altered
while being read and interpreted.  This is useful for generic distros
that may not have a write-xor-execute policy.  See commit a5874fde3c08
("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")

Both execve(2) and the IOCTL to enable fsverity can already set this
property on files with deny_write_access().  This new O_DENY_WRITE make
it widely available.  This is similar to what other OSs may provide
e.g., opening a file with only FILE_SHARE_READ on Windows.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Serge Hallyn <serge@hallyn.com>
Reported-by: Robert Waite <rowait@microsoft.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Link: https://lore.kernel.org/r/20250822170800.2116980-2-mic@digikod.net
---
 fs/fcntl.c                       | 26 ++++++++++++++++++++++++--
 fs/file_table.c                  |  2 ++
 fs/namei.c                       |  6 ++++++
 include/linux/fcntl.h            |  2 +-
 include/uapi/asm-generic/fcntl.h |  4 ++++
 5 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 5598e4d57422..0c80c0fbc706 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -34,7 +34,8 @@
 
 #include "internal.h"
 
-#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
+#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
+	O_DENY_WRITE)
 
 static int setfl(int fd, struct file * filp, unsigned int arg)
 {
@@ -80,8 +81,29 @@ static int setfl(int fd, struct file * filp, unsigned int arg)
 			error = 0;
 	}
 	spin_lock(&filp->f_lock);
+
+	if (arg & O_DENY_WRITE) {
+		/* Only regular files. */
+		if (!S_ISREG(inode->i_mode)) {
+			error = -EINVAL;
+			goto unlock;
+		}
+
+		/* Only sets once. */
+		if (!(filp->f_flags & O_DENY_WRITE)) {
+			error = exe_file_deny_write_access(filp);
+			if (error)
+				goto unlock;
+		}
+	} else {
+		if (filp->f_flags & O_DENY_WRITE)
+			exe_file_allow_write_access(filp);
+	}
+
 	filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK);
 	filp->f_iocb_flags = iocb_flags(filp);
+
+unlock:
 	spin_unlock(&filp->f_lock);
 
  out:
@@ -1158,7 +1180,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC));
diff --git a/fs/file_table.c b/fs/file_table.c
index 81c72576e548..6ba896b6a53f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -460,6 +460,8 @@ static void __fput(struct file *file)
 	locks_remove_file(file);
 
 	security_file_release(file);
+	if (unlikely(file->f_flags & O_DENY_WRITE))
+		exe_file_allow_write_access(file);
 	if (unlikely(file->f_flags & FASYNC)) {
 		if (file->f_op->fasync)
 			file->f_op->fasync(-1, file, 0);
diff --git a/fs/namei.c b/fs/namei.c
index cd43ff89fbaa..366530bf937d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3885,6 +3885,12 @@ static int do_open(struct nameidata *nd,
 	error = may_open(idmap, &nd->path, acc_mode, open_flag);
 	if (!error && !(file->f_mode & FMODE_OPENED))
 		error = vfs_open(&nd->path, file);
+	if (!error && (open_flag & O_DENY_WRITE)) {
+		if (S_ISREG(file_inode(file)->i_mode))
+			error = exe_file_deny_write_access(file);
+		else
+			error = -EINVAL;
+	}
 	if (!error)
 		error = security_file_post_open(file, op->acc_mode);
 	if (!error && do_truncate)
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index a332e79b3207..dad14101686f 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_DENY_WRITE)
 
 /* List of all valid flags for the how->resolve argument: */
 #define VALID_RESOLVE_FLAGS \
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 613475285643..facd9136f5af 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -91,6 +91,10 @@
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 
+#ifndef O_DENY_WRITE
+#define O_DENY_WRITE	040000000
+#endif
+
 #ifndef O_NDELAY
 #define O_NDELAY	O_NONBLOCK
 #endif
-- 
2.50.1


^ permalink raw reply related

* [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Mickaël Salaün @ 2025-08-22 17:07 UTC (permalink / raw)
  To: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn
  Cc: Mickaël Salaün, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module

Hi,

Script interpreters can check if a file would be allowed to be executed
by the kernel using the new AT_EXECVE_CHECK flag. This approach works
well on systems with write-xor-execute policies, where scripts cannot
be modified by malicious processes. However, this protection may not be
available on more generic distributions.

The key difference between `./script.sh` and `sh script.sh` (when using
AT_EXECVE_CHECK) is that execve(2) prevents the script from being opened
for writing while it's being executed. To achieve parity, the kernel
should provide a mechanism for script interpreters to deny write access
during script interpretation. While interpreters can copy script content
into a buffer, a race condition remains possible after AT_EXECVE_CHECK.

This patch series introduces a new O_DENY_WRITE flag for use with
open*(2) and fcntl(2). Both interfaces are necessary since script
interpreters may receive either a file path or file descriptor. For
backward compatibility, open(2) with O_DENY_WRITE will not fail on
unsupported systems, while users requiring explicit support guarantees
can use openat2(2).

The check_exec.rst documentation and related examples do not mention this new
feature yet.

Regards,

Mickaël Salaün (2):
  fs: Add O_DENY_WRITE
  selftests/exec: Add O_DENY_WRITE tests

 fs/fcntl.c                                |  26 ++-
 fs/file_table.c                           |   2 +
 fs/namei.c                                |   6 +
 include/linux/fcntl.h                     |   2 +-
 include/uapi/asm-generic/fcntl.h          |   4 +
 tools/testing/selftests/exec/check-exec.c | 219 ++++++++++++++++++++++
 6 files changed, 256 insertions(+), 3 deletions(-)

base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
-- 
2.50.1

^ permalink raw reply

* Re: [PATCH v3 05/12] man/man2/fspick.2: document "new" mount API
From: Aleksa Sarai @ 2025-08-22 13:40 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <198d1f2e189.11dbac16b2998.3847935512688537521@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1055 bytes --]

On 2025-08-22, Askar Safin <safinaskar@zohomail.com> wrote:
>  ---- On Sat, 09 Aug 2025 00:39:49 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
>  > +The above procedure is functionally equivalent to
>  > +the following mount operation using
>  > +.BR mount (2):
> 
> This is not true.
> 
> fspick adds options to superblock. It doesn't remove existing ones.

fspick "copies the existing parameters" would be more accurate. I can
reword this, but it's an example and I don't think it makes sense to add
a large amount of clarifying text for each example.

The comparisons to mount(2) are meant to be indicative, but if you I can
also just remove them (David's versions didn't include them).

> mount(MS_REMOUNT) replaces options. I. e. mount(2) call provided in
> example will unset all other options.
> 
> In the end of this message you will find C code, which proves this.

Yes, I am already keenly aware of this behaviour.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Aleksa Sarai @ 2025-08-22 13:25 UTC (permalink / raw)
  To: Rob Landley
  Cc: Christian Brauner, Lichen Liu, linux-fsdevel, linux-kernel,
	safinaskar, kexec, weilongchen, linux-api, zohar, stefanb,
	initramfs, corbet, linux-doc, viro, jack
In-Reply-To: <da1b1926-ba18-4a81-93e0-56cb2f85e4dd@landley.net>

[-- Attachment #1: Type: text/plain, Size: 2158 bytes --]

On 2025-08-21, Rob Landley <rob@landley.net> wrote:
> P.S. It's a pity lkml.iu.edu and spinics.net are both down right now, but
> after vger.kernel.org deleted all reference to them I can't say I'm
> surprised. Neither lkml.org nor lore.kernel.org have an obvious threaded
> interface allowing you to find stuff without a keyword search, and

I'm not sure what issue you're gesturing to exactly, but if you have the
Message-ID you can link to it directly with
<https://lore.kernel.org/lkml/$MESSAGE_ID>. For instance, this email
will be available at
<https://lore.kernel.org/lkml/2025-08-22-witty-worthy-wink-sitcom-T5L8wA@cyphar.com>.

To be honest, I much prefer that to lkml.org's completely opaque
mappings based on arrival order and date (and in my experience it seems
to miss messages). The same goes for lkml.iu.edu, spinics and gmane.

One of the biggest losses when gmane disappeared was that all of the
URLs that referenced it were rendered unusable because the mapping from
their numbering to Message-IDs was not maintained. If lkml.org goes down
10 years from now, every reference to it will also be unusable, but
lore.kernel.org addresses will still be usable even if it goes down (you
can even search your local archives for the mails).

(It would be nice if more people spent a bit of time configuring their
Message-ID generation to be more friendly for this usecase -- mutt
changed their default Message-ID generation to be completely random
characters a few years ago, which made Message-IDs less recognisable
until folks adjusted their configs. Gmail is even worse, obviously.)

Also, lore.kernel.org has threading on the main page and on individual
thread pages? Maybe I don't understand what you're referring to?

> lore.kernel.org somehow manages not to list "linux-kernel" in its top level
> list of "inboxes" at all. The wagons are circled pretty tightly...

No it's definitely there, it's just labeled as "lkml" (they're sorted by
latest message timestamp so LKML will usually be near the top).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 05/12] man/man2/fspick.2: document "new" mount API
From: Askar Safin @ 2025-08-22 13:23 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-5-f61405c80f34@cyphar.com>

 ---- On Sat, 09 Aug 2025 00:39:49 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > +The above procedure is functionally equivalent to
 > +the following mount operation using
 > +.BR mount (2):

This is not true.

fspick adds options to superblock. It doesn't remove existing ones.

mount(MS_REMOUNT) replaces options. I. e. mount(2) call provided in
example will unset all other options.

In the end of this message you will find C code, which proves this.

Also, recently I sent patch to mount(2) manpage,
which clarifies MS_REMOUNT | MS_BIND behavior.
This is somewhat related.
The patch comes with another reproducer.
Here is a link: https://lore.kernel.org/linux-man/20250822114315.1571537-1-safinaskar@zohomail.com/

--
Askar Safin
https://types.pl/@safinaskar

// You need to be root (non-initial user namespace will go)

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/mount.h>
#include <sys/syscall.h>
#include <sys/vfs.h>
#include <sys/sysmacros.h>
#include <sys/statvfs.h>
#include <linux/openat2.h>

#define MY_ASSERT(cond) do { \
    if (!(cond)) { \
        fprintf (stderr, "%d: %s: assertion failed\n", __LINE__, #cond); \
        exit (1); \
    } \
} while (0)

int
main (void)
{
    // Init
    {
        MY_ASSERT (chdir ("/") == 0);
        MY_ASSERT (unshare (CLONE_NEWNS) == 0);
        MY_ASSERT (mount (NULL, "/", NULL, MS_PRIVATE | MS_REC, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp", "tmpfs", 0, NULL) == 0);
    }

    MY_ASSERT (mkdir ("/tmp/a", 0777) == 0);
    MY_ASSERT (mkdir ("/tmp/b", 0777) == 0);

    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", MS_SYNCHRONOUS, NULL) == 0);
        {
            struct statfs buf;
            memset (&buf, 0, sizeof buf);
            MY_ASSERT (statfs ("/tmp/a", &buf) == 0);
            MY_ASSERT (buf.f_flags & ST_SYNCHRONOUS);
            MY_ASSERT (!(buf.f_flags & ST_RDONLY));
        }
        {
            int fsfd = fspick (AT_FDCWD, "/tmp/a", FSPICK_CLOEXEC);
            MY_ASSERT (fsfd >= 0);
            MY_ASSERT (fsconfig (fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0) == 0);
            MY_ASSERT (fsconfig (fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0) == 0);
            MY_ASSERT (close (fsfd) == 0);
        }
        {
            struct statfs buf;
            memset (&buf, 0, sizeof buf);
            MY_ASSERT (statfs ("/tmp/a", &buf) == 0);
            MY_ASSERT (buf.f_flags & ST_SYNCHRONOUS);
            MY_ASSERT (buf.f_flags & ST_RDONLY);
        }
        MY_ASSERT (umount ("/tmp/a") == 0);
    }

    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", MS_SYNCHRONOUS, NULL) == 0);
        {
            struct statfs buf;
            memset (&buf, 0, sizeof buf);
            MY_ASSERT (statfs ("/tmp/a", &buf) == 0);
            MY_ASSERT (buf.f_flags & ST_SYNCHRONOUS);
            MY_ASSERT (!(buf.f_flags & ST_RDONLY));
        }
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_RDONLY, NULL) == 0);
        {
            struct statfs buf;
            memset (&buf, 0, sizeof buf);
            MY_ASSERT (statfs ("/tmp/a", &buf) == 0);
            MY_ASSERT (!(buf.f_flags & ST_SYNCHRONOUS));
            MY_ASSERT (buf.f_flags & ST_RDONLY);
        }
        MY_ASSERT (umount ("/tmp/a") == 0);
    }

    printf ("All tests passed\n");
    exit (0);
}


^ permalink raw reply

* Re: [PATCH 1/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Aleksa Sarai @ 2025-08-22 13:06 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250822114315.1571537-2-safinaskar@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 2925 bytes --]

On 2025-08-22, Askar Safin <safinaskar@zohomail.com> wrote:
> My edit is based on experiments and reading Linux code
> 
> Signed-off-by: Askar Safin <safinaskar@zohomail.com>
> ---
>  man/man2/mount.2 | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/man/man2/mount.2 b/man/man2/mount.2
> index 5d83231f9..909b82e88 100644
> --- a/man/man2/mount.2
> +++ b/man/man2/mount.2
> @@ -405,7 +405,19 @@ flag can be used with
>  to modify only the per-mount-point flags.
>  .\" See https://lwn.net/Articles/281157/
>  This is particularly useful for setting or clearing the "read-only"
> -flag on a mount without changing the underlying filesystem.
> +flag on a mount without changing flags of the underlying filesystem.

For obvious reasons, I would prefer the term "filesystem parameters"
here but mount(2) is kind of loose with its terminology...

> +The
> +.I data
> +argument is ignored if
> +.B MS_REMOUNT
> +and
> +.B MS_BIND
> +are specified.
> +The
> +.I mountflags
> +should specify existing per-mount-point flags,
> +except for those parameters
> +that are deliberately changed.

I would phrase this more like a note to make the advice a bit clearer:

  Note that the mountpoint will
  have its existing per-mount-point flags
  cleared and replaced with those in
  .I mountflags
  when
  .B MS_REMOUNT
  and
  .B MS_BIND
  are specified.
  This means that if
  you wish to preserve
  any existing per-mount-point flags
  (which can be retrieved using
  .BR statfs (2)),
  you need to include them in
  .IR mountflags ,
  along with the per-mount-point flags you wish to set
  (or with the flags you wish to clear missing).

(Still a bit too wordy, there's probably a nicer way of writing it...)

It might also be a good idea to reference locked mount flags (which are
explained in more detail in mount_namespaces(7)) since they are very
relevant to the text you're adding about MS_REMOUNT|MS_BIND.

The current docs only mention locked mounts once and the description is
kind of insufficient (it implies that only MS_REMOUNT affects this, and
that it's related to the mount being locked -- neither is really true).
When dealing with a mount with locked flags, remembering to include
existing mount attributes is very important.

>  Specifying
>  .I mountflags
>  as:
> @@ -416,8 +428,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
>  .EE
>  .in
>  .P
> -will make access through this mountpoint read-only, without affecting
> -other mounts.
> +will make access through this mountpoint read-only
> +(and clear all other per-mount-point flags),

   (clearing all other per-mount-point flags)

> +without affecting
> +other mounts
> +of this filesystem.
>  .\"
>  .SS Creating a bind mount
>  If

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH 1/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Alejandro Colomar @ 2025-08-22 12:49 UTC (permalink / raw)
  To: Askar Safin
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250822114315.1571537-2-safinaskar@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1750 bytes --]

Hi Askar,

On Fri, Aug 22, 2025 at 11:43:15AM +0000, Askar Safin wrote:
> My edit is based on experiments and reading Linux code
> 
> Signed-off-by: Askar Safin <safinaskar@zohomail.com>

You could add Cc: tags there for people you CC'd in the patch.
(For next time.)

I'll wait before applying the patch, to allow anyone to review it, in
case they want to comment.


Have a lovely day!
Alex

> ---
>  man/man2/mount.2 | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/man/man2/mount.2 b/man/man2/mount.2
> index 5d83231f9..909b82e88 100644
> --- a/man/man2/mount.2
> +++ b/man/man2/mount.2
> @@ -405,7 +405,19 @@ flag can be used with
>  to modify only the per-mount-point flags.
>  .\" See https://lwn.net/Articles/281157/
>  This is particularly useful for setting or clearing the "read-only"
> -flag on a mount without changing the underlying filesystem.
> +flag on a mount without changing flags of the underlying filesystem.
> +The
> +.I data
> +argument is ignored if
> +.B MS_REMOUNT
> +and
> +.B MS_BIND
> +are specified.
> +The
> +.I mountflags
> +should specify existing per-mount-point flags,
> +except for those parameters
> +that are deliberately changed.
>  Specifying
>  .I mountflags
>  as:
> @@ -416,8 +428,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
>  .EE
>  .in
>  .P
> -will make access through this mountpoint read-only, without affecting
> -other mounts.
> +will make access through this mountpoint read-only
> +(and clear all other per-mount-point flags),
> +without affecting
> +other mounts
> +of this filesystem.
>  .\"
>  .SS Creating a bind mount
>  If
> -- 
> 2.47.2
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Lichen Liu @ 2025-08-22 12:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, rob, weilongchen,
	cyphar, linux-api, zohar, stefanb, initramfs, corbet, linux-doc,
	viro, jack
In-Reply-To: <20250821-zirkel-leitkultur-2653cba2cd5b@brauner>

Thanks for reviewing and merging the code!

I used "rootfsflags" here because it is shown as "rootfs" in the mountinfo.

My opinion on naming is similar to Rob’s. However, for me, the function’s
implementation is more important than the variable names, so I don’t have a
strong opinion on this.

(Christian may see this message twice, sorry for that because I clicked reply
button instead of reply-all)

Thanks,
Lichen

On Thu, Aug 21, 2025 at 4:26 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, 15 Aug 2025 20:14:59 +0800, Lichen Liu wrote:
> > When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
> > By default, a tmpfs mount is limited to using 50% of the available RAM
> > for its content. This can be problematic in memory-constrained
> > environments, particularly during a kdump capture.
> >
> > In a kdump scenario, the capture kernel boots with a limited amount of
> > memory specified by the 'crashkernel' parameter. If the initramfs is
> > large, it may fail to unpack into the tmpfs rootfs due to insufficient
> > space. This is because to get X MB of usable space in tmpfs, 2*X MB of
> > memory must be available for the mount. This leads to an OOM failure
> > during the early boot process, preventing a successful crash dump.
> >
> > [...]
>
> This seems rather useful but I've renamed "rootfsflags" to
> "initramfs_options" because "rootfsflags" is ambiguous and it's not
> really just about flags.
>
> Other than that I think it would make sense to just raise the limit to
> 90% for the root_fs_type mount. I'm not sure why this super privileged
> code would only be allowed 50% by default.
>
> ---
>
> Applied to the vfs-6.18.misc branch of the vfs/vfs.git tree.
> Patches in the vfs-6.18.misc branch should appear in linux-next soon.
>
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
>
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
>
> Note that commit hashes shown below are subject to change due to rebase,
> trailer updates or similar. If in doubt, please check the listed branch.
>
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs-6.18.misc
>
> [1/1] fs: Add 'rootfsflags' to set rootfs mount options
>       https://git.kernel.org/vfs/vfs/c/278033a225e1
>


^ permalink raw reply

* [PATCH 1/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Askar Safin @ 2025-08-22 11:43 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250822114315.1571537-1-safinaskar@zohomail.com>

My edit is based on experiments and reading Linux code

Signed-off-by: Askar Safin <safinaskar@zohomail.com>
---
 man/man2/mount.2 | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/man/man2/mount.2 b/man/man2/mount.2
index 5d83231f9..909b82e88 100644
--- a/man/man2/mount.2
+++ b/man/man2/mount.2
@@ -405,7 +405,19 @@ flag can be used with
 to modify only the per-mount-point flags.
 .\" See https://lwn.net/Articles/281157/
 This is particularly useful for setting or clearing the "read-only"
-flag on a mount without changing the underlying filesystem.
+flag on a mount without changing flags of the underlying filesystem.
+The
+.I data
+argument is ignored if
+.B MS_REMOUNT
+and
+.B MS_BIND
+are specified.
+The
+.I mountflags
+should specify existing per-mount-point flags,
+except for those parameters
+that are deliberately changed.
 Specifying
 .I mountflags
 as:
@@ -416,8 +428,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
 .EE
 .in
 .P
-will make access through this mountpoint read-only, without affecting
-other mounts.
+will make access through this mountpoint read-only
+(and clear all other per-mount-point flags),
+without affecting
+other mounts
+of this filesystem.
 .\"
 .SS Creating a bind mount
 If
-- 
2.47.2


^ permalink raw reply related

* [PATCH 0/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Askar Safin @ 2025-08-22 11:43 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man

My edit is based on experiments and reading Linux code

You will find C code I used for experiments below

Askar Safin (1):
  man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND

 man/man2/mount.2 | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

-- 
2.47.2

// You need to be root in initial user namespace

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/mount.h>
#include <sys/syscall.h>
#include <sys/sysmacros.h>
#include <linux/openat2.h>

#define MY_ASSERT(cond) do { \
    if (!(cond)) { \
        fprintf (stderr, "%d: %s: assertion failed\n", __LINE__, #cond); \
        exit (1); \
    } \
} while (0)

int
main (void)
{
    // Init
    {
        MY_ASSERT (chdir ("/") == 0);
        MY_ASSERT (unshare (CLONE_NEWNS) == 0);
        MY_ASSERT (mount (NULL, "/", NULL, MS_PRIVATE | MS_REC, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp", "tmpfs", 0, NULL) == 0);
    }

    MY_ASSERT (mkdir ("/tmp/a", 0777) == 0);
    MY_ASSERT (mkdir ("/tmp/b", 0777) == 0);

    // MS_REMOUNT sets options for superblock
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // MS_REMOUNT | MS_BIND sets options for vfsmount
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/b/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // fspick sets options for superblock
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        {
            int fsfd = fspick (AT_FDCWD, "/tmp/a", 0);
            MY_ASSERT (fsfd >= 0);
            MY_ASSERT (fsconfig (fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0) == 0);
            MY_ASSERT (fsconfig (fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0) == 0);
            MY_ASSERT (close (fsfd) == 0);
        }
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // mount_setattr sets options for vfsmount
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        {
            struct mount_attr attr;
            memset (&attr, 0, sizeof attr);
            attr.attr_set = MOUNT_ATTR_RDONLY;
            MY_ASSERT (mount_setattr (AT_FDCWD, "/tmp/a", 0, &attr, sizeof attr) == 0);
        }
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/b/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // "ro" as a string works for MS_REMOUNT
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, "ro") == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // "ro" as a string doesn't work for MS_REMOUNT | MS_BIND
    // Option string is ignored
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND, "ro") == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/b/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // Removing MS_RDONLY makes mount writable again (in case of MS_REMOUNT | MS_BIND)
    // Same for other options (not tested, but I did read code)
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
    }

    // Removing "ro" from option string makes mount writable again (in case of MS_REMOUNT)
    // I. e. mount(2) works exactly as documented
    // This works even if option string is NULL, i. e. NULL works as default option string
    {
        typedef const char *c_string;
        c_string opts[3] = {NULL, "", "rw"};
        for (int i = 0; i != 3; ++i)
            {
                for (int j = 0; j != 3; ++j)
                    {
                        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, opts[i]) == 0);
                        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
                        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
                        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, "ro") == 0);
                        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
                        MY_ASSERT (errno == EROFS);
                        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, opts[j]) == 0);
                        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
                        MY_ASSERT (umount ("/tmp/a") == 0);
                    }
            }
    }

    // Removing MS_RDONLY makes mount writable again (in case of MS_REMOUNT)
    // I. e. mount(2) works exactly as documented
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
    }

    // Setting MS_RDONLY (without other flags) removes all other flags, such as MS_NODEV (in case of MS_REMOUNT | MS_BIND)
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mknod ("/tmp/a/mynull", S_IFCHR | 0666, makedev (1, 3)) == 0);

        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        {
            int fd = open ("/tmp/a/mynull", O_WRONLY);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (write (fd, "a", 1) == 1);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_NODEV, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (open ("/tmp/a/mynull", O_WRONLY) == -1);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        {
            int fd = open ("/tmp/a/mynull", O_WRONLY);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (write (fd, "a", 1) == 1);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (umount ("/tmp/a") == 0);
    }
    printf ("All tests passed\n");
    exit (0);
}

^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Randy Dunlap @ 2025-08-22  0:44 UTC (permalink / raw)
  To: Rob Landley, Christian Brauner, Lichen Liu
  Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, weilongchen,
	cyphar, linux-api, zohar, stefanb, initramfs, corbet, linux-doc,
	viro, jack
In-Reply-To: <da1b1926-ba18-4a81-93e0-56cb2f85e4dd@landley.net>

Hi Rob,


On 8/21/25 12:02 PM, Rob Landley wrote:
> On 8/21/25 03:24, Christian Brauner wrote:
>> This seems rather useful but I've renamed "rootfsflags" to
> 
> I remember when bikeshedding came in the form of a question.
> 
>> "initramfs_options" because "rootfsflags" is ambiguous and it's not
>> really just about flags.
> 
> The existing config option (applying to the fallback root=/dev/blah filesystem overmounting rootfs) is called "rootflags", the new name differs for the same reason init= and rdinit= differ.
> 
> The name "rootfs" has been around for over 20 years, as evidenced in https://kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt and so on. Over the past decade least three independently authored patches have come up with the same name for this option. Nobody ever suggested a name where people have to remember whether it has _ or - in it.

Either is accepted. From Documentation/admin-guide/kernel-parameters.rst:

Special handling
----------------

Hyphens (dashes) and underscores are equivalent in parameter names, so::

	log_buf_len=1M print-fatal-signals=1

can also be entered as::

	log-buf-len=1M print_fatal_signals=1


> Technically initramfs is the name of the cpio extractor and related plumbing, the filesystem instance identifies itself as "rootfs" in
> /proc/mounts:
> 
> $ head -n 1 /proc/mounts
> rootfs / rootfs rw,size=29444k,nr_inodes=7361 0 0
> 
> I.E. rootfs is an instance of ramfs (or tmpfs) populated by initramfs.
> 
> Given that rdinit= is two letters added to init= it made sense for rootfsflags= to be two letters added to rootflags= to distinguish them.
> 
> (The "rd" was because it's legacy shared infrastructure with the old 1990s initial ramdisk mechanism ala /dev/ram0. The same reason bootloaders like grub have an "initrd" command to load the external cpio.gz for initramfs when it's not statically linked into the kernel image: the delivery mechanism is the same, the kernel inspects the file type to determine how to handle it. This new option _isn't_ legacy, and "rootfs" is already common parlance, so it seemed obvious to everyone with even moderate domain familiarity what to call it.)
> 
>> Other than that I think it would make sense to just raise the limit to
>> 90% for the root_fs_type mount. I'm not sure why this super privileged
>> code would only be allowed 50% by default.
> 
> Because when a ram based filesystem pins all available memory the kernel deadlocks (ramfs always doing this was one of the motivations to use tmpfs, but tmpfs doesn't mean you have swap), because the existing use cases for this come from low memory systems that already micromanage this sort of thing so a different default wouldn't help, because it isn't a domain-specific decision but was inheriting the tmpfs default value so you'd need extra code _to_ specify a different default, because you didn't read the answer to the previous guy who asked this question earlier in this patch's discussion...
> 
> https://lkml.org/lkml/2025/8/8/1050
> 
> Rob

Thanks for the explanations.

> P.S. It's a pity lkml.iu.edu and spinics.net are both down right now, but after vger.kernel.org deleted all reference to them I can't say I'm surprised. Neither lkml.org nor lore.kernel.org have an obvious threaded interface allowing you to find stuff without a keyword search, and lore.kernel.org somehow manages not to list "linux-kernel" in its top level list of "inboxes" at all. The wagons are circled pretty tightly...
Yep, they down for me also. :(
linux-kernel is called lkml of lore. It would be nice if they were synonyms.
If you go to https://lore.kernel.org/lkml/, you can use the search box to look for
"s:rootfsflags" or just use a browser's Search (usually Ctrl-F) to search for
"rootflags". Then the email thread is visible.
Or just do a huge $search_engine search for something close to
the $Subject -- or some text from the body of the message. But you probably
know all of this.


If you go to lkml.org and click on "Last 100 messages", then scroll down to
	Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options	Rob Landley
you can read the email thread for this message (see left side panel).
Or you can find it by date (if you have any idea what the date was).

cheers.

-- 
~Randy


^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Rob Landley @ 2025-08-21 19:02 UTC (permalink / raw)
  To: Christian Brauner, Lichen Liu
  Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, weilongchen,
	cyphar, linux-api, zohar, stefanb, initramfs, corbet, linux-doc,
	viro, jack
In-Reply-To: <20250821-zirkel-leitkultur-2653cba2cd5b@brauner>

On 8/21/25 03:24, Christian Brauner wrote:
> This seems rather useful but I've renamed "rootfsflags" to

I remember when bikeshedding came in the form of a question.

> "initramfs_options" because "rootfsflags" is ambiguous and it's not
> really just about flags.

The existing config option (applying to the fallback root=/dev/blah 
filesystem overmounting rootfs) is called "rootflags", the new name 
differs for the same reason init= and rdinit= differ.

The name "rootfs" has been around for over 20 years, as evidenced in 
https://kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt 
and so on. Over the past decade least three independently authored 
patches have come up with the same name for this option. Nobody ever 
suggested a name where people have to remember whether it has _ or - in it.

Technically initramfs is the name of the cpio extractor and related 
plumbing, the filesystem instance identifies itself as "rootfs" in
/proc/mounts:

$ head -n 1 /proc/mounts
rootfs / rootfs rw,size=29444k,nr_inodes=7361 0 0

I.E. rootfs is an instance of ramfs (or tmpfs) populated by initramfs.

Given that rdinit= is two letters added to init= it made sense for 
rootfsflags= to be two letters added to rootflags= to distinguish them.

(The "rd" was because it's legacy shared infrastructure with the old 
1990s initial ramdisk mechanism ala /dev/ram0. The same reason 
bootloaders like grub have an "initrd" command to load the external 
cpio.gz for initramfs when it's not statically linked into the kernel 
image: the delivery mechanism is the same, the kernel inspects the file 
type to determine how to handle it. This new option _isn't_ legacy, and 
"rootfs" is already common parlance, so it seemed obvious to everyone 
with even moderate domain familiarity what to call it.)

> Other than that I think it would make sense to just raise the limit to
> 90% for the root_fs_type mount. I'm not sure why this super privileged
> code would only be allowed 50% by default.

Because when a ram based filesystem pins all available memory the kernel 
deadlocks (ramfs always doing this was one of the motivations to use 
tmpfs, but tmpfs doesn't mean you have swap), because the existing use 
cases for this come from low memory systems that already micromanage 
this sort of thing so a different default wouldn't help, because it 
isn't a domain-specific decision but was inheriting the tmpfs default 
value so you'd need extra code _to_ specify a different default, because 
you didn't read the answer to the previous guy who asked this question 
earlier in this patch's discussion...

https://lkml.org/lkml/2025/8/8/1050

Rob

P.S. It's a pity lkml.iu.edu and spinics.net are both down right now, 
but after vger.kernel.org deleted all reference to them I can't say I'm 
surprised. Neither lkml.org nor lore.kernel.org have an obvious threaded 
interface allowing you to find stuff without a keyword search, and 
lore.kernel.org somehow manages not to list "linux-kernel" in its top 
level list of "inboxes" at all. The wagons are circled pretty tightly...

^ permalink raw reply

* Re: [PATCH v3 00/12] man2: document "new" mount API
From: Aleksa Sarai @ 2025-08-21 14:21 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <198cc8d3da6.124bd761f86893.6196757670555212232@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 2380 bytes --]

On 2025-08-21, Askar Safin <safinaskar@zohomail.com> wrote:
> There is one particular case when open_tree is more powerful than openat with O_PATH. open_tree supports AT_EMPTY_PATH, and openat supports nothing similar.
> This means that we can convert normal O_RDONLY file descriptor to O_PATH descriptor using open_tree! I. e.:
>   rd = openat(AT_FDCWD, "/tmp/a", O_RDONLY, 0); // Regular file
>   open_tree(rd, "", AT_EMPTY_PATH);
> You can achieve same effect using /proc:
>   rd = openat(AT_FDCWD, "/tmp/a", O_RDONLY, 0); // Regular file
>   snprintf(buf, sizeof(buf), "/proc/self/fd/%d", rd);
>   openat(AT_FDCWD, buf, O_PATH, 0);
> But still I think this has security implications. This means that even if we deny access to /proc for container, it still is able to convert O_RDONLY
> descriptors to O_PATH descriptors using open_tree. I. e. this is yet another thing to think about when creating sandboxes.
> I know you delivered a talk about similar things a lot of time ago: https://lwn.net/Articles/934460/ . (I tested this.)

O_RDONLY -> O_PATH is less of an issue than the other way around. There
isn't much you can do with O_PATH that you can't do with a properly open
file (by design you actually should have strictly less privileges but
some operations are only really possible with O_PATH, but they're not
security-critical in that way).

I was working on a new patchset for resolving this issue (and adding
O_EMPTYPATH support) late last year but other things fell on my plate
and the design was quite difficult to get to a place where everyone
agreed to it.

The core issue is that we would need to block not just re-opening but
also any operation that is a write (or read) in disguise, which kind of
implies you need to have capabilities attached to file descriptors. This
is already slightly shaky ground if you look at the history of projects
like capsicum -- but also my impression was that just adding it to
"file_permission" was not sufficient, you need to put it in
"path_permission" which means we have to either bloat "struct path" or
come up with some extended structure that you need to plumb through
everywhere.

But yes, this is a thing that is still on my list of things to do, but
not in the immediate future.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Askar Safin @ 2025-08-21 13:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Lichen Liu, linux-fsdevel, linux-kernel, kexec, rob, weilongchen,
	cyphar, linux-api, zohar, stefanb, initramfs, corbet, linux-doc,
	viro, jack
In-Reply-To: <20250821-zirkel-leitkultur-2653cba2cd5b@brauner>

 ---- On Thu, 21 Aug 2025 12:24:11 +0400  Christian Brauner <brauner@kernel.org> wrote --- 
 > Applied to the vfs-6.18.misc branch of the vfs/vfs.git tree.
 > Patches in the vfs-6.18.misc branch should appear in linux-next soon.

Applied version contains this:
> Specify mount options for for the initramfs mount

I. e. "for" two times.

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v3 00/12] man2: document "new" mount API
From: Askar Safin @ 2025-08-21 12:14 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-0-f61405c80f34@cyphar.com>

There is one particular case when open_tree is more powerful than openat with O_PATH. open_tree supports AT_EMPTY_PATH, and openat supports nothing similar.
This means that we can convert normal O_RDONLY file descriptor to O_PATH descriptor using open_tree! I. e.:
  rd = openat(AT_FDCWD, "/tmp/a", O_RDONLY, 0); // Regular file
  open_tree(rd, "", AT_EMPTY_PATH);
You can achieve same effect using /proc:
  rd = openat(AT_FDCWD, "/tmp/a", O_RDONLY, 0); // Regular file
  snprintf(buf, sizeof(buf), "/proc/self/fd/%d", rd);
  openat(AT_FDCWD, buf, O_PATH, 0);
But still I think this has security implications. This means that even if we deny access to /proc for container, it still is able to convert O_RDONLY
descriptors to O_PATH descriptors using open_tree. I. e. this is yet another thing to think about when creating sandboxes.
I know you delivered a talk about similar things a lot of time ago: https://lwn.net/Articles/934460/ . (I tested this.)

--
Askar Safin
https://types.pl/@safinaskar

^ permalink raw reply

* Re: [PATCH v3 06/12] man/man2/fsconfig.2: document "new" mount API
From: Askar Safin @ 2025-08-21 11:57 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-08-21.1755776485-stony-another-giggle-rodent-9HLjPO@cyphar.com>

 ---- On Thu, 21 Aug 2025 15:44:42 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > I'm of two minds whether I should fix the behaviour and then re-send
 > man-pages with updated text (delaying the next round of man-page reviews
 > by a month) or just reduce the specificity of this text and then add
 > more details after it has been fixed.

Do what you want.
I'm not in hurry.

CC me if you write any patches, please.
--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v3 00/12] man2: document "new" mount API
From: Aleksa Sarai @ 2025-08-21 11:49 UTC (permalink / raw)
  To: Askar Safin
  Cc: alx, brauner, dhowells, g.branden.robinson, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-man, mtk.manpages, viro,
	Ian Kent, autofs mailing list
In-Reply-To: <198c74541c8.c835b65275081.1338200284666207736@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1181 bytes --]

On 2025-08-20, Askar Safin <safinaskar@zohomail.com> wrote:
>  ---- On Sun, 17 Aug 2025 20:16:04 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
>  > They are not tested by fstests AFAICS, but that's more of a flaw in
>  > fstests (automount requires you to have a running autofs daemon, which
>  > probably makes testing it in fstests or selftests impractical) not the
>  > feature itself.
> 
> I suggest testing automounts in fstests/selftests using "tracing" automount.
> This is what I do in my reproducers.
> 
>  > The automount behaviour of tracefs is different to the general automount
>  > mechanism which is managed by userspace with the autofs daemon.
> 
> Yes. But I still was able to write reproducers using "tracing", so this
> automount point is totally okay for tests. (At least for some tests,
> such as RESOLVE_NO_XDEV.)

Sure, but I don't think people use allyesconfig when running selftests.
I wonder if the automated test runners even enable deprecated features
like that.

In any case, you can definitely write some tests for it. :D

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 06/12] man/man2/fsconfig.2: document "new" mount API
From: Aleksa Sarai @ 2025-08-21 11:47 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <198cc299cd9.eec1817f85794.4679093070969175955@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1361 bytes --]

On 2025-08-21, Askar Safin <safinaskar@zohomail.com> wrote:
> There is a convention: you can pass invalid fd (such as -1) as dfd to *at-syscalls to enforce that the path is absolute.
> This is documented. "man openat" says: "Specifying an invalid file descriptor number in dirfd can be used as a means to ensure that pathname is absolute".
> But fsconfig with FSCONFIG_SET_PATH breaks this convention due to this line: https://elixir.bootlin.com/linux/v6.16/source/fs/fsopen.c#L377 .
> I think this is a bug, and it should be fixed in kernel. Also, it is possible there are a lot of similarly buggy syscalls. All of them should be fixed,
> and moreover a warning should be added to https://docs.kernel.org/process/adding-syscalls.html . And then new fsconfig behavior should be documented.
> (Of course, I'm not saying that *you* should do all these. I'm just saying that this bug exists.) (I tested this.)

Indeed, good catch! I think we discussed this before --
FSCONFIG_SET_PATH actually doesn't work with any parameters today so
it's not very surprising nobody has noticed this until now. I'll include
it in the set of fixes I have for fscontext.

(FWIW, the convention I see more commonly is -EBADF but that's just a
stylistic I suppose.)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox