Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 1/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Aleksa Sarai @ 2025-08-22 13:06 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250822114315.1571537-2-safinaskar@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 2925 bytes --]

On 2025-08-22, Askar Safin <safinaskar@zohomail.com> wrote:
> My edit is based on experiments and reading Linux code
> 
> Signed-off-by: Askar Safin <safinaskar@zohomail.com>
> ---
>  man/man2/mount.2 | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/man/man2/mount.2 b/man/man2/mount.2
> index 5d83231f9..909b82e88 100644
> --- a/man/man2/mount.2
> +++ b/man/man2/mount.2
> @@ -405,7 +405,19 @@ flag can be used with
>  to modify only the per-mount-point flags.
>  .\" See https://lwn.net/Articles/281157/
>  This is particularly useful for setting or clearing the "read-only"
> -flag on a mount without changing the underlying filesystem.
> +flag on a mount without changing flags of the underlying filesystem.

For obvious reasons, I would prefer the term "filesystem parameters"
here but mount(2) is kind of loose with its terminology...

> +The
> +.I data
> +argument is ignored if
> +.B MS_REMOUNT
> +and
> +.B MS_BIND
> +are specified.
> +The
> +.I mountflags
> +should specify existing per-mount-point flags,
> +except for those parameters
> +that are deliberately changed.

I would phrase this more like a note to make the advice a bit clearer:

  Note that the mountpoint will
  have its existing per-mount-point flags
  cleared and replaced with those in
  .I mountflags
  when
  .B MS_REMOUNT
  and
  .B MS_BIND
  are specified.
  This means that if
  you wish to preserve
  any existing per-mount-point flags
  (which can be retrieved using
  .BR statfs (2)),
  you need to include them in
  .IR mountflags ,
  along with the per-mount-point flags you wish to set
  (or with the flags you wish to clear missing).

(Still a bit too wordy, there's probably a nicer way of writing it...)

It might also be a good idea to reference locked mount flags (which are
explained in more detail in mount_namespaces(7)) since they are very
relevant to the text you're adding about MS_REMOUNT|MS_BIND.

The current docs only mention locked mounts once and the description is
kind of insufficient (it implies that only MS_REMOUNT affects this, and
that it's related to the mount being locked -- neither is really true).
When dealing with a mount with locked flags, remembering to include
existing mount attributes is very important.

>  Specifying
>  .I mountflags
>  as:
> @@ -416,8 +428,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
>  .EE
>  .in
>  .P
> -will make access through this mountpoint read-only, without affecting
> -other mounts.
> +will make access through this mountpoint read-only
> +(and clear all other per-mount-point flags),

   (clearing all other per-mount-point flags)

> +without affecting
> +other mounts
> +of this filesystem.
>  .\"
>  .SS Creating a bind mount
>  If

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH 1/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Alejandro Colomar @ 2025-08-22 12:49 UTC (permalink / raw)
  To: Askar Safin
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250822114315.1571537-2-safinaskar@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1750 bytes --]

Hi Askar,

On Fri, Aug 22, 2025 at 11:43:15AM +0000, Askar Safin wrote:
> My edit is based on experiments and reading Linux code
> 
> Signed-off-by: Askar Safin <safinaskar@zohomail.com>

You could add Cc: tags there for people you CC'd in the patch.
(For next time.)

I'll wait before applying the patch, to allow anyone to review it, in
case they want to comment.


Have a lovely day!
Alex

> ---
>  man/man2/mount.2 | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/man/man2/mount.2 b/man/man2/mount.2
> index 5d83231f9..909b82e88 100644
> --- a/man/man2/mount.2
> +++ b/man/man2/mount.2
> @@ -405,7 +405,19 @@ flag can be used with
>  to modify only the per-mount-point flags.
>  .\" See https://lwn.net/Articles/281157/
>  This is particularly useful for setting or clearing the "read-only"
> -flag on a mount without changing the underlying filesystem.
> +flag on a mount without changing flags of the underlying filesystem.
> +The
> +.I data
> +argument is ignored if
> +.B MS_REMOUNT
> +and
> +.B MS_BIND
> +are specified.
> +The
> +.I mountflags
> +should specify existing per-mount-point flags,
> +except for those parameters
> +that are deliberately changed.
>  Specifying
>  .I mountflags
>  as:
> @@ -416,8 +428,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
>  .EE
>  .in
>  .P
> -will make access through this mountpoint read-only, without affecting
> -other mounts.
> +will make access through this mountpoint read-only
> +(and clear all other per-mount-point flags),
> +without affecting
> +other mounts
> +of this filesystem.
>  .\"
>  .SS Creating a bind mount
>  If
> -- 
> 2.47.2
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Lichen Liu @ 2025-08-22 12:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, rob, weilongchen,
	cyphar, linux-api, zohar, stefanb, initramfs, corbet, linux-doc,
	viro, jack
In-Reply-To: <20250821-zirkel-leitkultur-2653cba2cd5b@brauner>

Thanks for reviewing and merging the code!

I used "rootfsflags" here because it is shown as "rootfs" in the mountinfo.

My opinion on naming is similar to Rob’s. However, for me, the function’s
implementation is more important than the variable names, so I don’t have a
strong opinion on this.

(Christian may see this message twice, sorry for that because I clicked reply
button instead of reply-all)

Thanks,
Lichen

On Thu, Aug 21, 2025 at 4:26 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, 15 Aug 2025 20:14:59 +0800, Lichen Liu wrote:
> > When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
> > By default, a tmpfs mount is limited to using 50% of the available RAM
> > for its content. This can be problematic in memory-constrained
> > environments, particularly during a kdump capture.
> >
> > In a kdump scenario, the capture kernel boots with a limited amount of
> > memory specified by the 'crashkernel' parameter. If the initramfs is
> > large, it may fail to unpack into the tmpfs rootfs due to insufficient
> > space. This is because to get X MB of usable space in tmpfs, 2*X MB of
> > memory must be available for the mount. This leads to an OOM failure
> > during the early boot process, preventing a successful crash dump.
> >
> > [...]
>
> This seems rather useful but I've renamed "rootfsflags" to
> "initramfs_options" because "rootfsflags" is ambiguous and it's not
> really just about flags.
>
> Other than that I think it would make sense to just raise the limit to
> 90% for the root_fs_type mount. I'm not sure why this super privileged
> code would only be allowed 50% by default.
>
> ---
>
> Applied to the vfs-6.18.misc branch of the vfs/vfs.git tree.
> Patches in the vfs-6.18.misc branch should appear in linux-next soon.
>
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
>
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
>
> Note that commit hashes shown below are subject to change due to rebase,
> trailer updates or similar. If in doubt, please check the listed branch.
>
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs-6.18.misc
>
> [1/1] fs: Add 'rootfsflags' to set rootfs mount options
>       https://git.kernel.org/vfs/vfs/c/278033a225e1
>


^ permalink raw reply

* [PATCH 1/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Askar Safin @ 2025-08-22 11:43 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250822114315.1571537-1-safinaskar@zohomail.com>

My edit is based on experiments and reading Linux code

Signed-off-by: Askar Safin <safinaskar@zohomail.com>
---
 man/man2/mount.2 | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/man/man2/mount.2 b/man/man2/mount.2
index 5d83231f9..909b82e88 100644
--- a/man/man2/mount.2
+++ b/man/man2/mount.2
@@ -405,7 +405,19 @@ flag can be used with
 to modify only the per-mount-point flags.
 .\" See https://lwn.net/Articles/281157/
 This is particularly useful for setting or clearing the "read-only"
-flag on a mount without changing the underlying filesystem.
+flag on a mount without changing flags of the underlying filesystem.
+The
+.I data
+argument is ignored if
+.B MS_REMOUNT
+and
+.B MS_BIND
+are specified.
+The
+.I mountflags
+should specify existing per-mount-point flags,
+except for those parameters
+that are deliberately changed.
 Specifying
 .I mountflags
 as:
@@ -416,8 +428,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
 .EE
 .in
 .P
-will make access through this mountpoint read-only, without affecting
-other mounts.
+will make access through this mountpoint read-only
+(and clear all other per-mount-point flags),
+without affecting
+other mounts
+of this filesystem.
 .\"
 .SS Creating a bind mount
 If
-- 
2.47.2


^ permalink raw reply related

* [PATCH 0/1] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Askar Safin @ 2025-08-22 11:43 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man

My edit is based on experiments and reading Linux code

You will find C code I used for experiments below

Askar Safin (1):
  man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND

 man/man2/mount.2 | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

-- 
2.47.2

// You need to be root in initial user namespace

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/mount.h>
#include <sys/syscall.h>
#include <sys/sysmacros.h>
#include <linux/openat2.h>

#define MY_ASSERT(cond) do { \
    if (!(cond)) { \
        fprintf (stderr, "%d: %s: assertion failed\n", __LINE__, #cond); \
        exit (1); \
    } \
} while (0)

int
main (void)
{
    // Init
    {
        MY_ASSERT (chdir ("/") == 0);
        MY_ASSERT (unshare (CLONE_NEWNS) == 0);
        MY_ASSERT (mount (NULL, "/", NULL, MS_PRIVATE | MS_REC, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp", "tmpfs", 0, NULL) == 0);
    }

    MY_ASSERT (mkdir ("/tmp/a", 0777) == 0);
    MY_ASSERT (mkdir ("/tmp/b", 0777) == 0);

    // MS_REMOUNT sets options for superblock
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // MS_REMOUNT | MS_BIND sets options for vfsmount
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/b/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // fspick sets options for superblock
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        {
            int fsfd = fspick (AT_FDCWD, "/tmp/a", 0);
            MY_ASSERT (fsfd >= 0);
            MY_ASSERT (fsconfig (fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0) == 0);
            MY_ASSERT (fsconfig (fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0) == 0);
            MY_ASSERT (close (fsfd) == 0);
        }
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // mount_setattr sets options for vfsmount
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        {
            struct mount_attr attr;
            memset (&attr, 0, sizeof attr);
            attr.attr_set = MOUNT_ATTR_RDONLY;
            MY_ASSERT (mount_setattr (AT_FDCWD, "/tmp/a", 0, &attr, sizeof attr) == 0);
        }
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/b/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // "ro" as a string works for MS_REMOUNT
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, "ro") == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // "ro" as a string doesn't work for MS_REMOUNT | MS_BIND
    // Option string is ignored
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount ("/tmp/a", "/tmp/b", NULL, MS_BIND, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND, "ro") == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (mkdir ("/tmp/b/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/b/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
        MY_ASSERT (umount ("/tmp/b") == 0);
    }

    // Removing MS_RDONLY makes mount writable again (in case of MS_REMOUNT | MS_BIND)
    // Same for other options (not tested, but I did read code)
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
    }

    // Removing "ro" from option string makes mount writable again (in case of MS_REMOUNT)
    // I. e. mount(2) works exactly as documented
    // This works even if option string is NULL, i. e. NULL works as default option string
    {
        typedef const char *c_string;
        c_string opts[3] = {NULL, "", "rw"};
        for (int i = 0; i != 3; ++i)
            {
                for (int j = 0; j != 3; ++j)
                    {
                        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, opts[i]) == 0);
                        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
                        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
                        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, "ro") == 0);
                        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
                        MY_ASSERT (errno == EROFS);
                        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, opts[j]) == 0);
                        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
                        MY_ASSERT (umount ("/tmp/a") == 0);
                    }
            }
    }

    // Removing MS_RDONLY makes mount writable again (in case of MS_REMOUNT)
    // I. e. mount(2) works exactly as documented
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        MY_ASSERT (errno == EROFS);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (umount ("/tmp/a") == 0);
    }

    // Setting MS_RDONLY (without other flags) removes all other flags, such as MS_NODEV (in case of MS_REMOUNT | MS_BIND)
    {
        MY_ASSERT (mount (NULL, "/tmp/a", "tmpfs", 0, NULL) == 0);
        MY_ASSERT (mknod ("/tmp/a/mynull", S_IFCHR | 0666, makedev (1, 3)) == 0);

        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        {
            int fd = open ("/tmp/a/mynull", O_WRONLY);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (write (fd, "a", 1) == 1);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_NODEV, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == 0);
        MY_ASSERT (rmdir ("/tmp/a/c") == 0);
        MY_ASSERT (open ("/tmp/a/mynull", O_WRONLY) == -1);
        MY_ASSERT (mount (NULL, "/tmp/a", NULL, MS_REMOUNT | MS_BIND | MS_RDONLY, NULL) == 0);
        MY_ASSERT (mkdir ("/tmp/a/c", 0777) == -1);
        {
            int fd = open ("/tmp/a/mynull", O_WRONLY);
            MY_ASSERT (fd >= 0);
            MY_ASSERT (write (fd, "a", 1) == 1);
            MY_ASSERT (close (fd) == 0);
        }
        MY_ASSERT (umount ("/tmp/a") == 0);
    }
    printf ("All tests passed\n");
    exit (0);
}

^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Randy Dunlap @ 2025-08-22  0:44 UTC (permalink / raw)
  To: Rob Landley, Christian Brauner, Lichen Liu
  Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, weilongchen,
	cyphar, linux-api, zohar, stefanb, initramfs, corbet, linux-doc,
	viro, jack
In-Reply-To: <da1b1926-ba18-4a81-93e0-56cb2f85e4dd@landley.net>

Hi Rob,


On 8/21/25 12:02 PM, Rob Landley wrote:
> On 8/21/25 03:24, Christian Brauner wrote:
>> This seems rather useful but I've renamed "rootfsflags" to
> 
> I remember when bikeshedding came in the form of a question.
> 
>> "initramfs_options" because "rootfsflags" is ambiguous and it's not
>> really just about flags.
> 
> The existing config option (applying to the fallback root=/dev/blah filesystem overmounting rootfs) is called "rootflags", the new name differs for the same reason init= and rdinit= differ.
> 
> The name "rootfs" has been around for over 20 years, as evidenced in https://kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt and so on. Over the past decade least three independently authored patches have come up with the same name for this option. Nobody ever suggested a name where people have to remember whether it has _ or - in it.

Either is accepted. From Documentation/admin-guide/kernel-parameters.rst:

Special handling
----------------

Hyphens (dashes) and underscores are equivalent in parameter names, so::

	log_buf_len=1M print-fatal-signals=1

can also be entered as::

	log-buf-len=1M print_fatal_signals=1


> Technically initramfs is the name of the cpio extractor and related plumbing, the filesystem instance identifies itself as "rootfs" in
> /proc/mounts:
> 
> $ head -n 1 /proc/mounts
> rootfs / rootfs rw,size=29444k,nr_inodes=7361 0 0
> 
> I.E. rootfs is an instance of ramfs (or tmpfs) populated by initramfs.
> 
> Given that rdinit= is two letters added to init= it made sense for rootfsflags= to be two letters added to rootflags= to distinguish them.
> 
> (The "rd" was because it's legacy shared infrastructure with the old 1990s initial ramdisk mechanism ala /dev/ram0. The same reason bootloaders like grub have an "initrd" command to load the external cpio.gz for initramfs when it's not statically linked into the kernel image: the delivery mechanism is the same, the kernel inspects the file type to determine how to handle it. This new option _isn't_ legacy, and "rootfs" is already common parlance, so it seemed obvious to everyone with even moderate domain familiarity what to call it.)
> 
>> Other than that I think it would make sense to just raise the limit to
>> 90% for the root_fs_type mount. I'm not sure why this super privileged
>> code would only be allowed 50% by default.
> 
> Because when a ram based filesystem pins all available memory the kernel deadlocks (ramfs always doing this was one of the motivations to use tmpfs, but tmpfs doesn't mean you have swap), because the existing use cases for this come from low memory systems that already micromanage this sort of thing so a different default wouldn't help, because it isn't a domain-specific decision but was inheriting the tmpfs default value so you'd need extra code _to_ specify a different default, because you didn't read the answer to the previous guy who asked this question earlier in this patch's discussion...
> 
> https://lkml.org/lkml/2025/8/8/1050
> 
> Rob

Thanks for the explanations.

> P.S. It's a pity lkml.iu.edu and spinics.net are both down right now, but after vger.kernel.org deleted all reference to them I can't say I'm surprised. Neither lkml.org nor lore.kernel.org have an obvious threaded interface allowing you to find stuff without a keyword search, and lore.kernel.org somehow manages not to list "linux-kernel" in its top level list of "inboxes" at all. The wagons are circled pretty tightly...
Yep, they down for me also. :(
linux-kernel is called lkml of lore. It would be nice if they were synonyms.
If you go to https://lore.kernel.org/lkml/, you can use the search box to look for
"s:rootfsflags" or just use a browser's Search (usually Ctrl-F) to search for
"rootflags". Then the email thread is visible.
Or just do a huge $search_engine search for something close to
the $Subject -- or some text from the body of the message. But you probably
know all of this.


If you go to lkml.org and click on "Last 100 messages", then scroll down to
	Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options	Rob Landley
you can read the email thread for this message (see left side panel).
Or you can find it by date (if you have any idea what the date was).

cheers.

-- 
~Randy


^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Rob Landley @ 2025-08-21 19:02 UTC (permalink / raw)
  To: Christian Brauner, Lichen Liu
  Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, weilongchen,
	cyphar, linux-api, zohar, stefanb, initramfs, corbet, linux-doc,
	viro, jack
In-Reply-To: <20250821-zirkel-leitkultur-2653cba2cd5b@brauner>

On 8/21/25 03:24, Christian Brauner wrote:
> This seems rather useful but I've renamed "rootfsflags" to

I remember when bikeshedding came in the form of a question.

> "initramfs_options" because "rootfsflags" is ambiguous and it's not
> really just about flags.

The existing config option (applying to the fallback root=/dev/blah 
filesystem overmounting rootfs) is called "rootflags", the new name 
differs for the same reason init= and rdinit= differ.

The name "rootfs" has been around for over 20 years, as evidenced in 
https://kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt 
and so on. Over the past decade least three independently authored 
patches have come up with the same name for this option. Nobody ever 
suggested a name where people have to remember whether it has _ or - in it.

Technically initramfs is the name of the cpio extractor and related 
plumbing, the filesystem instance identifies itself as "rootfs" in
/proc/mounts:

$ head -n 1 /proc/mounts
rootfs / rootfs rw,size=29444k,nr_inodes=7361 0 0

I.E. rootfs is an instance of ramfs (or tmpfs) populated by initramfs.

Given that rdinit= is two letters added to init= it made sense for 
rootfsflags= to be two letters added to rootflags= to distinguish them.

(The "rd" was because it's legacy shared infrastructure with the old 
1990s initial ramdisk mechanism ala /dev/ram0. The same reason 
bootloaders like grub have an "initrd" command to load the external 
cpio.gz for initramfs when it's not statically linked into the kernel 
image: the delivery mechanism is the same, the kernel inspects the file 
type to determine how to handle it. This new option _isn't_ legacy, and 
"rootfs" is already common parlance, so it seemed obvious to everyone 
with even moderate domain familiarity what to call it.)

> Other than that I think it would make sense to just raise the limit to
> 90% for the root_fs_type mount. I'm not sure why this super privileged
> code would only be allowed 50% by default.

Because when a ram based filesystem pins all available memory the kernel 
deadlocks (ramfs always doing this was one of the motivations to use 
tmpfs, but tmpfs doesn't mean you have swap), because the existing use 
cases for this come from low memory systems that already micromanage 
this sort of thing so a different default wouldn't help, because it 
isn't a domain-specific decision but was inheriting the tmpfs default 
value so you'd need extra code _to_ specify a different default, because 
you didn't read the answer to the previous guy who asked this question 
earlier in this patch's discussion...

https://lkml.org/lkml/2025/8/8/1050

Rob

P.S. It's a pity lkml.iu.edu and spinics.net are both down right now, 
but after vger.kernel.org deleted all reference to them I can't say I'm 
surprised. Neither lkml.org nor lore.kernel.org have an obvious threaded 
interface allowing you to find stuff without a keyword search, and 
lore.kernel.org somehow manages not to list "linux-kernel" in its top 
level list of "inboxes" at all. The wagons are circled pretty tightly...

^ permalink raw reply

* Re: [PATCH v3 00/12] man2: document "new" mount API
From: Aleksa Sarai @ 2025-08-21 14:21 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <198cc8d3da6.124bd761f86893.6196757670555212232@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 2380 bytes --]

On 2025-08-21, Askar Safin <safinaskar@zohomail.com> wrote:
> There is one particular case when open_tree is more powerful than openat with O_PATH. open_tree supports AT_EMPTY_PATH, and openat supports nothing similar.
> This means that we can convert normal O_RDONLY file descriptor to O_PATH descriptor using open_tree! I. e.:
>   rd = openat(AT_FDCWD, "/tmp/a", O_RDONLY, 0); // Regular file
>   open_tree(rd, "", AT_EMPTY_PATH);
> You can achieve same effect using /proc:
>   rd = openat(AT_FDCWD, "/tmp/a", O_RDONLY, 0); // Regular file
>   snprintf(buf, sizeof(buf), "/proc/self/fd/%d", rd);
>   openat(AT_FDCWD, buf, O_PATH, 0);
> But still I think this has security implications. This means that even if we deny access to /proc for container, it still is able to convert O_RDONLY
> descriptors to O_PATH descriptors using open_tree. I. e. this is yet another thing to think about when creating sandboxes.
> I know you delivered a talk about similar things a lot of time ago: https://lwn.net/Articles/934460/ . (I tested this.)

O_RDONLY -> O_PATH is less of an issue than the other way around. There
isn't much you can do with O_PATH that you can't do with a properly open
file (by design you actually should have strictly less privileges but
some operations are only really possible with O_PATH, but they're not
security-critical in that way).

I was working on a new patchset for resolving this issue (and adding
O_EMPTYPATH support) late last year but other things fell on my plate
and the design was quite difficult to get to a place where everyone
agreed to it.

The core issue is that we would need to block not just re-opening but
also any operation that is a write (or read) in disguise, which kind of
implies you need to have capabilities attached to file descriptors. This
is already slightly shaky ground if you look at the history of projects
like capsicum -- but also my impression was that just adding it to
"file_permission" was not sufficient, you need to put it in
"path_permission" which means we have to either bloat "struct path" or
come up with some extended structure that you need to plumb through
everywhere.

But yes, this is a thing that is still on my list of things to do, but
not in the immediate future.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Askar Safin @ 2025-08-21 13:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Lichen Liu, linux-fsdevel, linux-kernel, kexec, rob, weilongchen,
	cyphar, linux-api, zohar, stefanb, initramfs, corbet, linux-doc,
	viro, jack
In-Reply-To: <20250821-zirkel-leitkultur-2653cba2cd5b@brauner>

 ---- On Thu, 21 Aug 2025 12:24:11 +0400  Christian Brauner <brauner@kernel.org> wrote --- 
 > Applied to the vfs-6.18.misc branch of the vfs/vfs.git tree.
 > Patches in the vfs-6.18.misc branch should appear in linux-next soon.

Applied version contains this:
> Specify mount options for for the initramfs mount

I. e. "for" two times.

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v3 00/12] man2: document "new" mount API
From: Askar Safin @ 2025-08-21 12:14 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-0-f61405c80f34@cyphar.com>

There is one particular case when open_tree is more powerful than openat with O_PATH. open_tree supports AT_EMPTY_PATH, and openat supports nothing similar.
This means that we can convert normal O_RDONLY file descriptor to O_PATH descriptor using open_tree! I. e.:
  rd = openat(AT_FDCWD, "/tmp/a", O_RDONLY, 0); // Regular file
  open_tree(rd, "", AT_EMPTY_PATH);
You can achieve same effect using /proc:
  rd = openat(AT_FDCWD, "/tmp/a", O_RDONLY, 0); // Regular file
  snprintf(buf, sizeof(buf), "/proc/self/fd/%d", rd);
  openat(AT_FDCWD, buf, O_PATH, 0);
But still I think this has security implications. This means that even if we deny access to /proc for container, it still is able to convert O_RDONLY
descriptors to O_PATH descriptors using open_tree. I. e. this is yet another thing to think about when creating sandboxes.
I know you delivered a talk about similar things a lot of time ago: https://lwn.net/Articles/934460/ . (I tested this.)

--
Askar Safin
https://types.pl/@safinaskar

^ permalink raw reply

* Re: [PATCH v3 06/12] man/man2/fsconfig.2: document "new" mount API
From: Askar Safin @ 2025-08-21 11:57 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-08-21.1755776485-stony-another-giggle-rodent-9HLjPO@cyphar.com>

 ---- On Thu, 21 Aug 2025 15:44:42 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > I'm of two minds whether I should fix the behaviour and then re-send
 > man-pages with updated text (delaying the next round of man-page reviews
 > by a month) or just reduce the specificity of this text and then add
 > more details after it has been fixed.

Do what you want.
I'm not in hurry.

CC me if you write any patches, please.
--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v3 00/12] man2: document "new" mount API
From: Aleksa Sarai @ 2025-08-21 11:49 UTC (permalink / raw)
  To: Askar Safin
  Cc: alx, brauner, dhowells, g.branden.robinson, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-man, mtk.manpages, viro,
	Ian Kent, autofs mailing list
In-Reply-To: <198c74541c8.c835b65275081.1338200284666207736@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1181 bytes --]

On 2025-08-20, Askar Safin <safinaskar@zohomail.com> wrote:
>  ---- On Sun, 17 Aug 2025 20:16:04 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
>  > They are not tested by fstests AFAICS, but that's more of a flaw in
>  > fstests (automount requires you to have a running autofs daemon, which
>  > probably makes testing it in fstests or selftests impractical) not the
>  > feature itself.
> 
> I suggest testing automounts in fstests/selftests using "tracing" automount.
> This is what I do in my reproducers.
> 
>  > The automount behaviour of tracefs is different to the general automount
>  > mechanism which is managed by userspace with the autofs daemon.
> 
> Yes. But I still was able to write reproducers using "tracing", so this
> automount point is totally okay for tests. (At least for some tests,
> such as RESOLVE_NO_XDEV.)

Sure, but I don't think people use allyesconfig when running selftests.
I wonder if the automated test runners even enable deprecated features
like that.

In any case, you can definitely write some tests for it. :D

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 06/12] man/man2/fsconfig.2: document "new" mount API
From: Aleksa Sarai @ 2025-08-21 11:47 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <198cc299cd9.eec1817f85794.4679093070969175955@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1361 bytes --]

On 2025-08-21, Askar Safin <safinaskar@zohomail.com> wrote:
> There is a convention: you can pass invalid fd (such as -1) as dfd to *at-syscalls to enforce that the path is absolute.
> This is documented. "man openat" says: "Specifying an invalid file descriptor number in dirfd can be used as a means to ensure that pathname is absolute".
> But fsconfig with FSCONFIG_SET_PATH breaks this convention due to this line: https://elixir.bootlin.com/linux/v6.16/source/fs/fsopen.c#L377 .
> I think this is a bug, and it should be fixed in kernel. Also, it is possible there are a lot of similarly buggy syscalls. All of them should be fixed,
> and moreover a warning should be added to https://docs.kernel.org/process/adding-syscalls.html . And then new fsconfig behavior should be documented.
> (Of course, I'm not saying that *you* should do all these. I'm just saying that this bug exists.) (I tested this.)

Indeed, good catch! I think we discussed this before --
FSCONFIG_SET_PATH actually doesn't work with any parameters today so
it's not very surprising nobody has noticed this until now. I'll include
it in the set of fixes I have for fscontext.

(FWIW, the convention I see more commonly is -EBADF but that's just a
stylistic I suppose.)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 06/12] man/man2/fsconfig.2: document "new" mount API
From: Aleksa Sarai @ 2025-08-21 11:44 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <198cc025823.ea44e3f585444.6907980660506284461@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1750 bytes --]

On 2025-08-21, Askar Safin <safinaskar@zohomail.com> wrote:
>  ---- On Tue, 12 Aug 2025 22:25:40 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
>  > On 2025-08-09, Aleksa Sarai <cyphar@cyphar.com> wrote:
>  > > +Note that the Linux kernel reuses filesystem instances
>  > > +for many filesystems,
>  > > +so (depending on the filesystem being configured and parameters used)
>  > > +it is possible for the filesystem instance "created" by
>  > > +.B \%FSCONFIG_CMD_CREATE
>  > > +to, in fact, be a reference
>  > > +to an existing filesystem instance in the kernel.
>  > > +The kernel will attempt to merge the specified parameters
>  > > +of this filesystem configuration context
>  > > +with those of the filesystem instance being reused,
>  > > +but some parameters may be
>  > > +.IR "silently ignored" .
>  > 
>  > While looking at this again, I realised this explanation is almost
>  > certainly incorrect in a few places (and was based on a misunderstanding
>  > of how sget_fc() works and how it interacts with vfs_get_tree()).
>  > 
>  > I'll rewrite this in the next version.
> 
> This recent patch seems to be relevant:
> https://lore.kernel.org/all/20250816-debugfs-mount-opts-v3-1-d271dad57b5b@posteo.net/

I'm aware of that, I was in one of the previous threads. There are some
deeper consistency issues that I'm writing patches for at the moment.

I'm of two minds whether I should fix the behaviour and then re-send
man-pages with updated text (delaying the next round of man-page reviews
by a month) or just reduce the specificity of this text and then add
more details after it has been fixed.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 09/12] man/man2/open_tree.2: document "new" mount API
From: Christian Brauner @ 2025-08-21 11:33 UTC (permalink / raw)
  To: Askar Safin
  Cc: Aleksa Sarai, Alejandro Colomar, Michael T. Kerrisk,
	Alexander Viro, Jan Kara, G. Branden Robinson, linux-man,
	linux-api, linux-fsdevel, linux-kernel, David Howells
In-Reply-To: <198cc623944.11ea2eb5d86377.2604785241030508275@zohomail.com>

On Thu, Aug 21, 2025 at 03:27:26PM +0400, Askar Safin wrote:
> man open_tree says:
> > mount propagation
> > (as described in
> > .BR mount_namespaces (7))
> > will not be applied to bind-mounts created by
> > .BR open_tree ()
> > until the bind-mount is attached with
> > .BR move_mount (2),
> 
> It seems this is wrong, because this commit exists: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=06b1ce966e3f8bfef261c111feb3d4b33ede0cd8 .
> I'm not sure about this. (I didn't test this.)

No, it's correct. I reverted this because it broke userspace that relies
on this behavior.

^ permalink raw reply

* Re: [PATCH v3 09/12] man/man2/open_tree.2: document "new" mount API
From: Askar Safin @ 2025-08-21 11:27 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-9-f61405c80f34@cyphar.com>

man open_tree says:
> mount propagation
> (as described in
> .BR mount_namespaces (7))
> will not be applied to bind-mounts created by
> .BR open_tree ()
> until the bind-mount is attached with
> .BR move_mount (2),

It seems this is wrong, because this commit exists: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=06b1ce966e3f8bfef261c111feb3d4b33ede0cd8 .
I'm not sure about this. (I didn't test this.)



--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v3 06/12] man/man2/fsconfig.2: document "new" mount API
From: Askar Safin @ 2025-08-21 10:25 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-6-f61405c80f34@cyphar.com>

There is a convention: you can pass invalid fd (such as -1) as dfd to *at-syscalls to enforce that the path is absolute.
This is documented. "man openat" says: "Specifying an invalid file descriptor number in dirfd can be used as a means to ensure that pathname is absolute".
But fsconfig with FSCONFIG_SET_PATH breaks this convention due to this line: https://elixir.bootlin.com/linux/v6.16/source/fs/fsopen.c#L377 .
I think this is a bug, and it should be fixed in kernel. Also, it is possible there are a lot of similarly buggy syscalls. All of them should be fixed,
and moreover a warning should be added to https://docs.kernel.org/process/adding-syscalls.html . And then new fsconfig behavior should be documented.
(Of course, I'm not saying that *you* should do all these. I'm just saying that this bug exists.) (I tested this.)

--
Askar Safin
https://types.pl/@safinaskar

^ permalink raw reply

* Re: [PATCH v3 06/12] man/man2/fsconfig.2: document "new" mount API
From: Askar Safin @ 2025-08-21  9:42 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-08-12.1755022847-yummy-native-bandage-dorm-8U46ME@cyphar.com>

 ---- On Tue, 12 Aug 2025 22:25:40 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > On 2025-08-09, Aleksa Sarai <cyphar@cyphar.com> wrote:
 > > +Note that the Linux kernel reuses filesystem instances
 > > +for many filesystems,
 > > +so (depending on the filesystem being configured and parameters used)
 > > +it is possible for the filesystem instance "created" by
 > > +.B \%FSCONFIG_CMD_CREATE
 > > +to, in fact, be a reference
 > > +to an existing filesystem instance in the kernel.
 > > +The kernel will attempt to merge the specified parameters
 > > +of this filesystem configuration context
 > > +with those of the filesystem instance being reused,
 > > +but some parameters may be
 > > +.IR "silently ignored" .
 > 
 > While looking at this again, I realised this explanation is almost
 > certainly incorrect in a few places (and was based on a misunderstanding
 > of how sget_fc() works and how it interacts with vfs_get_tree()).
 > 
 > I'll rewrite this in the next version.

This recent patch seems to be relevant:
https://lore.kernel.org/all/20250816-debugfs-mount-opts-v3-1-d271dad57b5b@posteo.net/

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Christian Brauner @ 2025-08-21  8:24 UTC (permalink / raw)
  To: Lichen Liu
  Cc: Christian Brauner, linux-fsdevel, linux-kernel, safinaskar, kexec,
	rob, weilongchen, cyphar, linux-api, zohar, stefanb, initramfs,
	corbet, linux-doc, viro, jack
In-Reply-To: <20250815121459.3391223-1-lichliu@redhat.com>

On Fri, 15 Aug 2025 20:14:59 +0800, Lichen Liu wrote:
> When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
> By default, a tmpfs mount is limited to using 50% of the available RAM
> for its content. This can be problematic in memory-constrained
> environments, particularly during a kdump capture.
> 
> In a kdump scenario, the capture kernel boots with a limited amount of
> memory specified by the 'crashkernel' parameter. If the initramfs is
> large, it may fail to unpack into the tmpfs rootfs due to insufficient
> space. This is because to get X MB of usable space in tmpfs, 2*X MB of
> memory must be available for the mount. This leads to an OOM failure
> during the early boot process, preventing a successful crash dump.
> 
> [...]

This seems rather useful but I've renamed "rootfsflags" to
"initramfs_options" because "rootfsflags" is ambiguous and it's not
really just about flags.

Other than that I think it would make sense to just raise the limit to
90% for the root_fs_type mount. I'm not sure why this super privileged
code would only be allowed 50% by default.

---

Applied to the vfs-6.18.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.18.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.18.misc

[1/1] fs: Add 'rootfsflags' to set rootfs mount options
      https://git.kernel.org/vfs/vfs/c/278033a225e1

^ permalink raw reply

* Re: [PATCH v19 2/8] Documentation: userspace-api: Add shadow stack API documentation
From: Randy Dunlap @ 2025-08-20 23:15 UTC (permalink / raw)
  To: Mark Brown, Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy,
	H.J. Lu, Florian Weimer, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Christian Brauner,
	Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, Andrew Morton,
	Yury Khrustalev, Wilco Dijkstra, linux-kselftest, linux-api,
	Kees Cook, Shuah Khan
In-Reply-To: <20250819-clone3-shadow-stack-v19-2-bc957075479b@kernel.org>



On 8/19/25 9:21 AM, Mark Brown wrote:
> There are a number of architectures with shadow stack features which we are
> presenting to userspace with as consistent an API as we can (though there
> are some architecture specifics). Especially given that there are some
> important considerations for userspace code interacting directly with the
> feature let's provide some documentation covering the common aspects.
> 

> ---
>  Documentation/userspace-api/index.rst        |  1 +
>  Documentation/userspace-api/shadow_stack.rst | 44 ++++++++++++++++++++++++++++
>  2 files changed, 45 insertions(+)
> 

> diff --git a/Documentation/userspace-api/shadow_stack.rst b/Documentation/userspace-api/shadow_stack.rst
> new file mode 100644
> index 000000000000..65c665496624
> --- /dev/null
> +++ b/Documentation/userspace-api/shadow_stack.rst
> @@ -0,0 +1,44 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Shadow Stacks
> +=============
> +
> +Introduction
> +============
> +
> +Several architectures have features which provide backward edge
> +control flow protection through a hardware maintained stack, only
> +writeable by userspace through very limited operations.  This feature

$internet says "writable"

> +is referred to as shadow stacks on Linux, on x86 it is part of Intel

                                      Linux. On

> +Control Enforcement Technology (CET), on arm64 it is Guarded Control
> +Stacks feature (FEAT_GCS) and for RISC-V it is the Zicfiss extension.
> +It is expected that this feature will normally be managed by the
> +system dynamic linker and libc in ways broadly transparent to
> +application code, this document covers interfaces and considerations.

               code. This

> +
> +
> +Enabling
> +========
> +
> +Shadow stacks default to disabled when a userspace process is
> +executed, they can be enabled for the current thread with a syscall:

   executed. They

> +
> + - For x86 the ARCH_SHSTK_ENABLE arch_prctl()
> + - For other architectures the PR_SET_SHADOW_STACK_ENABLE prctl()
> +
> +It is expected that this will normally be done by the dynamic linker.
> +Any new threads created by a thread with shadow stacks enabled will
> +themselves have shadow stacks enabled.
> +
> +
> +Enablement considerations
> +=========================
> +
> +- Returning from the function that enables shadow stacks without first
> +  disabling them will cause a shadow stack exception.  This includes
> +  any syscall wrapper or other library functions, the syscall will need

                                          functions; the

> +  to be inlined.
> +- A lock feature allows userspace to prevent disabling of shadow stacks.
> +- Those that change the stack context like longjmp() or use of ucontext
> +  changes on signal return will need support from libc.
> 
-- 
~Randy


^ permalink raw reply

* Re: [PATCH v5 2/3] lsm: introduce security_lsm_config_*_policy hooks
From: Casey Schaufler @ 2025-08-20 15:30 UTC (permalink / raw)
  To: Mickaël Salaün, Maxime Bélair
  Cc: linux-security-module, john.johansen, paul, jmorris, serge, kees,
	stephen.smalley.work, takedakn, penguin-kernel, song, rdunlap,
	linux-api, apparmor, linux-kernel, Casey Schaufler
In-Reply-To: <20250820.Ao3iquoshaiB@digikod.net>

On 8/20/2025 7:21 AM, Mickaël Salaün wrote:
> On Wed, Jul 09, 2025 at 10:00:55AM +0200, Maxime Bélair wrote:
>> Define two new LSM hooks: security_lsm_config_self_policy and
>> security_lsm_config_system_policy and wire them into the corresponding
>> lsm_config_*_policy() syscalls so that LSMs can register a unified
>> interface for policy management. This initial, minimal implementation
>> only supports the LSM_POLICY_LOAD operation to limit changes.
>>
>> Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
>> ---
>>  include/linux/lsm_hook_defs.h |  4 +++
>>  include/linux/security.h      | 20 ++++++++++++
>>  include/uapi/linux/lsm.h      |  8 +++++
>>  security/lsm_syscalls.c       | 17 ++++++++--
>>  security/security.c           | 60 +++++++++++++++++++++++++++++++++++
>>  5 files changed, 107 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
>> index bf3bbac4e02a..fca490444643 100644
>> --- a/include/linux/lsm_hook_defs.h
>> +++ b/include/linux/lsm_hook_defs.h
>> @@ -464,3 +464,7 @@ LSM_HOOK(int, 0, bdev_alloc_security, struct block_device *bdev)
>>  LSM_HOOK(void, LSM_RET_VOID, bdev_free_security, struct block_device *bdev)
>>  LSM_HOOK(int, 0, bdev_setintegrity, struct block_device *bdev,
>>  	 enum lsm_integrity_type type, const void *value, size_t size)
>> +LSM_HOOK(int, -EINVAL, lsm_config_self_policy, u32 lsm_id, u32 op,
>> +	 void __user *buf, size_t size, u32 flags)
>> +LSM_HOOK(int, -EINVAL, lsm_config_system_policy, u32 lsm_id, u32 op,
>> +	 void __user *buf, size_t size, u32 flags)
>> diff --git a/include/linux/security.h b/include/linux/security.h
>> index cc9b54d95d22..54acaee4a994 100644
>> --- a/include/linux/security.h
>> +++ b/include/linux/security.h
>> @@ -581,6 +581,11 @@ void security_bdev_free(struct block_device *bdev);
>>  int security_bdev_setintegrity(struct block_device *bdev,
>>  			       enum lsm_integrity_type type, const void *value,
>>  			       size_t size);
>> +int security_lsm_config_self_policy(u32 lsm_id, u32 op, void __user *buf,
>> +				    size_t size, u32 flags);
>> +int security_lsm_config_system_policy(u32 lsm_id, u32 op, void __user *buf,
>> +				      size_t size, u32 flags);
>> +
>>  #else /* CONFIG_SECURITY */
>>  
>>  /**
>> @@ -1603,6 +1608,21 @@ static inline int security_bdev_setintegrity(struct block_device *bdev,
>>  	return 0;
>>  }
>>  
>> +static inline int security_lsm_config_self_policy(u32 lsm_id, u32 op,
>> +						  void __user *buf,
>> +						  size_t size, u32 flags)
>> +{
>> +
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static inline int security_lsm_config_system_policy(u32 lsm_id, u32 op,
>> +						    void __user *buf,
>> +						    size_t size, u32 flags)
>> +{
>> +
>> +	return -EOPNOTSUPP;
>> +}
>>  #endif	/* CONFIG_SECURITY */
>>  
>>  #if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
>> diff --git a/include/uapi/linux/lsm.h b/include/uapi/linux/lsm.h
>> index 938593dfd5da..2b9432a30cdc 100644
>> --- a/include/uapi/linux/lsm.h
>> +++ b/include/uapi/linux/lsm.h
>> @@ -90,4 +90,12 @@ struct lsm_ctx {
>>   */
>>  #define LSM_FLAG_SINGLE	0x0001
>>  
>> +/*
>> + * LSM_POLICY_XXX definitions identify the different operations
>> + * to configure LSM policies
>> + */
>> +
>> +#define LSM_POLICY_UNDEF	0
>> +#define LSM_POLICY_LOAD		100
> Why the gap between 0 and 100?

It's conventional in LSM syscalls to start identifiers at 100.
No compelling reason other than to appease the LSM maintainer.

>
>> +
>>  #endif /* _UAPI_LINUX_LSM_H */
>> diff --git a/security/lsm_syscalls.c b/security/lsm_syscalls.c
>> index a3cb6dab8102..dd016ba6976c 100644
>> --- a/security/lsm_syscalls.c
>> +++ b/security/lsm_syscalls.c
>> @@ -122,11 +122,24 @@ SYSCALL_DEFINE3(lsm_list_modules, u64 __user *, ids, u32 __user *, size,
>>  SYSCALL_DEFINE5(lsm_config_self_policy, u32, lsm_id, u32, op, void __user *,
>>  		buf, u32 __user *, size, u32, flags)
> Given these are a multiplexor syscalls, I'm wondering if they should not
> have common flags and LSM-specific flags.  Alternatively, the op
> argument could also contains some optional flags.  In either case, the
> documentation should guide LSM developers for flags that may be shared
> amongst LSMs.
>
> Examples of such flags could be to restrict the whole process instead of
> the calling thread.
>
>>  {
>> -	return 0;
>> +	size_t usize;
>> +
>> +	if (get_user(usize, size))
> Size should just be u32, not a pointer.
>
>> +		return -EFAULT;
>> +
>> +	return security_lsm_config_self_policy(lsm_id, op, buf, usize, flags);
>>  }
>>  
>>  SYSCALL_DEFINE5(lsm_config_system_policy, u32, lsm_id, u32, op, void __user *,
>>  		buf, u32 __user *, size, u32, flags)
>>  {
>> -	return 0;
>> +	size_t usize;
>> +
>> +	if (!capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
> I like this mandatory capability check for this specific syscall.  This
> makes the semantic clearer.  However, to avoid the superpower of
> CAP_SYS_ADMIN, I'm wondering how we could use the CAP_MAC_ADMIN instead.
> This syscall could require CAP_MAC_ADMIN, and current LSMs (relying on a
> filesystem interface for policy configuration) could also enforce
> CAP_SYS_ADMIN for compatibility reasons.
>
> In fact, this "system" syscall could be a "namespace" syscall, which
> would take a security/LSM namespace file descriptor as argument.  If the
> namespace is not the initial namespace, any CAP_SYS_ADMIN implemented by
> current LSMs could be avoided.  See
> https://lore.kernel.org/r/CAHC9VhRGMmhxbajwQNfGFy+ZFF1uN=UEBjqQZQ4UBy7yds3eVQ@mail.gmail.com
>
>> +
>> +	if (get_user(usize, size))
> ditto
>
>> +		return -EFAULT;
>> +
>> +	return security_lsm_config_system_policy(lsm_id, op, buf, usize, flags);
>>  }
>> diff --git a/security/security.c b/security/security.c
>> index fb57e8fddd91..166d7d9936d0 100644
>> --- a/security/security.c
>> +++ b/security/security.c
>> @@ -5883,6 +5883,66 @@ int security_bdev_setintegrity(struct block_device *bdev,
>>  }
>>  EXPORT_SYMBOL(security_bdev_setintegrity);
>>  
>> +/**
>> + * security_lsm_config_self_policy() - Configure caller's LSM policies
>> + * @lsm_id: id of the LSM to target
>> + * @op: Operation to perform (one of the LSM_POLICY_XXX values)
>> + * @buf: userspace pointer to policy data
>> + * @size: size of @buf
>> + * @flags: lsm policy configuration flags
>> + *
>> + * Configure the policies of a LSM for the current domain/user. This notably
>> + * allows to update them even when the lsmfs is unavailable or restricted.
>> + * Currently, only LSM_POLICY_LOAD is supported.
>> + *
>> + * Return: Returns 0 on success, error on failure.
>> + */
>> +int security_lsm_config_self_policy(u32 lsm_id, u32 op, void __user *buf,
>> +				 size_t size, u32 flags)
>> +{
>> +	int rc = LSM_RET_DEFAULT(lsm_config_self_policy);
>> +	struct lsm_static_call *scall;
>> +
>> +	lsm_for_each_hook(scall, lsm_config_self_policy) {
>> +		if ((scall->hl->lsmid->id) == lsm_id) {
>> +			rc = scall->hl->hook.lsm_config_self_policy(lsm_id, op, buf, size, flags);
> The lsm_id should not be passed to the hook.
>
> The LSM syscall should manage the argument copy and buffer allocation
> instead of duplicating this code in each LSM hook implementation (see
> other LSM syscalls).
>
>> +			break;
>> +		}
>> +	}
>> +
>> +	return rc;
>> +}
>> +
>> +/**
>> + * security_lsm_config_system_policy() - Configure system LSM policies
>> + * @lsm_id: id of the lsm to target
>> + * @op: Operation to perform (one of the LSM_POLICY_XXX values)
>> + * @buf: userspace pointer to policy data
>> + * @size: size of @buf
>> + * @flags: lsm policy configuration flags
>> + *
>> + * Configure the policies of a LSM for the whole system. This notably allows
>> + * to update them even when the lsmfs is unavailable or restricted. Currently,
>> + * only LSM_POLICY_LOAD is supported.
>> + *
>> + * Return: Returns 0 on success, error on failure.
>> + */
>> +int security_lsm_config_system_policy(u32 lsm_id, u32 op, void __user *buf,
>> +				   size_t size, u32 flags)
>> +{
>> +	int rc = LSM_RET_DEFAULT(lsm_config_system_policy);
>> +	struct lsm_static_call *scall;
>> +
>> +	lsm_for_each_hook(scall, lsm_config_system_policy) {
>> +		if ((scall->hl->lsmid->id) == lsm_id) {
>> +			rc = scall->hl->hook.lsm_config_system_policy(lsm_id, op, buf, size, flags);
> ditto
>
>> +			break;
>> +		}
>> +	}
>> +
>> +	return rc;
>> +}
>> +
>>  #ifdef CONFIG_PERF_EVENTS
>>  /**
>>   * security_perf_event_open() - Check if a perf event open is allowed
>> -- 
>> 2.48.1
>>
>>

^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Lichen Liu @ 2025-08-20 15:17 UTC (permalink / raw)
  To: viro, brauner, jack
  Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, rob, weilongchen,
	cyphar, linux-api, zohar, stefanb, initramfs, corbet, linux-doc
In-Reply-To: <20250815121459.3391223-1-lichliu@redhat.com>

Hi all, do you have any comments for this v2 patch?

Thanks,
Lichen

On Fri, Aug 15, 2025 at 8:15 PM Lichen Liu <lichliu@redhat.com> wrote:
>
> When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
> By default, a tmpfs mount is limited to using 50% of the available RAM
> for its content. This can be problematic in memory-constrained
> environments, particularly during a kdump capture.
>
> In a kdump scenario, the capture kernel boots with a limited amount of
> memory specified by the 'crashkernel' parameter. If the initramfs is
> large, it may fail to unpack into the tmpfs rootfs due to insufficient
> space. This is because to get X MB of usable space in tmpfs, 2*X MB of
> memory must be available for the mount. This leads to an OOM failure
> during the early boot process, preventing a successful crash dump.
>
> This patch introduces a new kernel command-line parameter, rootfsflags,
> which allows passing specific mount options directly to the rootfs when
> it is first mounted. This gives users control over the rootfs behavior.
>
> For example, a user can now specify rootfsflags=size=75% to allow the
> tmpfs to use up to 75% of the available memory. This can significantly
> reduce the memory pressure for kdump.
>
> Consider a practical example:
>
> To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
> the default 50% limit, this requires a memory pool of 96MB to be
> available for the tmpfs mount. The total memory requirement is therefore
> approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
> kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.
>
> By using rootfsflags=size=75%, the memory pool required for the 48MB
> tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory
> requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a
> smaller crashkernel size, such as 192MB.
>
> An alternative approach of reusing the existing rootflags parameter was
> considered. However, a new, dedicated rootfsflags parameter was chosen
> to avoid altering the current behavior of rootflags (which applies to
> the final root filesystem) and to prevent any potential regressions.
>
> Also add documentation for the new kernel parameter "rootfsflags"
>
> This approach is inspired by prior discussions and patches on the topic.
> Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
> Ref: https://landley.net/notes-2015.html#01-01-2015
> Ref: https://lkml.org/lkml/2021/6/29/783
> Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs
>
> Signed-off-by: Lichen Liu <lichliu@redhat.com>
> Tested-by: Rob Landley <rob@landley.net>
> ---
> Changes in v2:
>   - Add documentation for the new kernel parameter.
>
>  Documentation/admin-guide/kernel-parameters.txt |  3 +++
>  fs/namespace.c                                  | 11 ++++++++++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index fb8752b42ec8..0c00f651d431 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6220,6 +6220,9 @@
>
>         rootflags=      [KNL] Set root filesystem mount option string
>
> +       rootfsflags=    [KNL] Set initial root filesystem mount option string
> +                       (e.g. tmpfs for initramfs)
> +
>         rootfstype=     [KNL] Set root filesystem type
>
>         rootwait        [KNL] Wait (indefinitely) for root device to show up.
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 8f1000f9f3df..e484c26d5e3f 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -65,6 +65,15 @@ static int __init set_mphash_entries(char *str)
>  }
>  __setup("mphash_entries=", set_mphash_entries);
>
> +static char * __initdata rootfs_flags;
> +static int __init rootfs_flags_setup(char *str)
> +{
> +       rootfs_flags = str;
> +       return 1;
> +}
> +
> +__setup("rootfsflags=", rootfs_flags_setup);
> +
>  static u64 event;
>  static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
>  static DEFINE_IDA(mnt_group_ida);
> @@ -5677,7 +5686,7 @@ static void __init init_mount_tree(void)
>         struct mnt_namespace *ns;
>         struct path root;
>
> -       mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
> +       mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", rootfs_flags);
>         if (IS_ERR(mnt))
>                 panic("Can't create rootfs");
>
> --
> 2.47.0
>


^ permalink raw reply

* Re: [PATCH v5 2/3] lsm: introduce security_lsm_config_*_policy hooks
From: Mickaël Salaün @ 2025-08-20 14:21 UTC (permalink / raw)
  To: Maxime Bélair
  Cc: linux-security-module, john.johansen, paul, jmorris, serge, kees,
	stephen.smalley.work, casey, takedakn, penguin-kernel, song,
	rdunlap, linux-api, apparmor, linux-kernel
In-Reply-To: <20250709080220.110947-3-maxime.belair@canonical.com>

On Wed, Jul 09, 2025 at 10:00:55AM +0200, Maxime Bélair wrote:
> Define two new LSM hooks: security_lsm_config_self_policy and
> security_lsm_config_system_policy and wire them into the corresponding
> lsm_config_*_policy() syscalls so that LSMs can register a unified
> interface for policy management. This initial, minimal implementation
> only supports the LSM_POLICY_LOAD operation to limit changes.
> 
> Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
> ---
>  include/linux/lsm_hook_defs.h |  4 +++
>  include/linux/security.h      | 20 ++++++++++++
>  include/uapi/linux/lsm.h      |  8 +++++
>  security/lsm_syscalls.c       | 17 ++++++++--
>  security/security.c           | 60 +++++++++++++++++++++++++++++++++++
>  5 files changed, 107 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
> index bf3bbac4e02a..fca490444643 100644
> --- a/include/linux/lsm_hook_defs.h
> +++ b/include/linux/lsm_hook_defs.h
> @@ -464,3 +464,7 @@ LSM_HOOK(int, 0, bdev_alloc_security, struct block_device *bdev)
>  LSM_HOOK(void, LSM_RET_VOID, bdev_free_security, struct block_device *bdev)
>  LSM_HOOK(int, 0, bdev_setintegrity, struct block_device *bdev,
>  	 enum lsm_integrity_type type, const void *value, size_t size)
> +LSM_HOOK(int, -EINVAL, lsm_config_self_policy, u32 lsm_id, u32 op,
> +	 void __user *buf, size_t size, u32 flags)
> +LSM_HOOK(int, -EINVAL, lsm_config_system_policy, u32 lsm_id, u32 op,
> +	 void __user *buf, size_t size, u32 flags)
> diff --git a/include/linux/security.h b/include/linux/security.h
> index cc9b54d95d22..54acaee4a994 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -581,6 +581,11 @@ void security_bdev_free(struct block_device *bdev);
>  int security_bdev_setintegrity(struct block_device *bdev,
>  			       enum lsm_integrity_type type, const void *value,
>  			       size_t size);
> +int security_lsm_config_self_policy(u32 lsm_id, u32 op, void __user *buf,
> +				    size_t size, u32 flags);
> +int security_lsm_config_system_policy(u32 lsm_id, u32 op, void __user *buf,
> +				      size_t size, u32 flags);
> +
>  #else /* CONFIG_SECURITY */
>  
>  /**
> @@ -1603,6 +1608,21 @@ static inline int security_bdev_setintegrity(struct block_device *bdev,
>  	return 0;
>  }
>  
> +static inline int security_lsm_config_self_policy(u32 lsm_id, u32 op,
> +						  void __user *buf,
> +						  size_t size, u32 flags)
> +{
> +
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline int security_lsm_config_system_policy(u32 lsm_id, u32 op,
> +						    void __user *buf,
> +						    size_t size, u32 flags)
> +{
> +
> +	return -EOPNOTSUPP;
> +}
>  #endif	/* CONFIG_SECURITY */
>  
>  #if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
> diff --git a/include/uapi/linux/lsm.h b/include/uapi/linux/lsm.h
> index 938593dfd5da..2b9432a30cdc 100644
> --- a/include/uapi/linux/lsm.h
> +++ b/include/uapi/linux/lsm.h
> @@ -90,4 +90,12 @@ struct lsm_ctx {
>   */
>  #define LSM_FLAG_SINGLE	0x0001
>  
> +/*
> + * LSM_POLICY_XXX definitions identify the different operations
> + * to configure LSM policies
> + */
> +
> +#define LSM_POLICY_UNDEF	0
> +#define LSM_POLICY_LOAD		100

Why the gap between 0 and 100?

> +
>  #endif /* _UAPI_LINUX_LSM_H */
> diff --git a/security/lsm_syscalls.c b/security/lsm_syscalls.c
> index a3cb6dab8102..dd016ba6976c 100644
> --- a/security/lsm_syscalls.c
> +++ b/security/lsm_syscalls.c
> @@ -122,11 +122,24 @@ SYSCALL_DEFINE3(lsm_list_modules, u64 __user *, ids, u32 __user *, size,
>  SYSCALL_DEFINE5(lsm_config_self_policy, u32, lsm_id, u32, op, void __user *,
>  		buf, u32 __user *, size, u32, flags)

Given these are a multiplexor syscalls, I'm wondering if they should not
have common flags and LSM-specific flags.  Alternatively, the op
argument could also contains some optional flags.  In either case, the
documentation should guide LSM developers for flags that may be shared
amongst LSMs.

Examples of such flags could be to restrict the whole process instead of
the calling thread.

>  {
> -	return 0;
> +	size_t usize;
> +
> +	if (get_user(usize, size))

Size should just be u32, not a pointer.

> +		return -EFAULT;
> +
> +	return security_lsm_config_self_policy(lsm_id, op, buf, usize, flags);
>  }
>  
>  SYSCALL_DEFINE5(lsm_config_system_policy, u32, lsm_id, u32, op, void __user *,
>  		buf, u32 __user *, size, u32, flags)
>  {
> -	return 0;
> +	size_t usize;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;

I like this mandatory capability check for this specific syscall.  This
makes the semantic clearer.  However, to avoid the superpower of
CAP_SYS_ADMIN, I'm wondering how we could use the CAP_MAC_ADMIN instead.
This syscall could require CAP_MAC_ADMIN, and current LSMs (relying on a
filesystem interface for policy configuration) could also enforce
CAP_SYS_ADMIN for compatibility reasons.

In fact, this "system" syscall could be a "namespace" syscall, which
would take a security/LSM namespace file descriptor as argument.  If the
namespace is not the initial namespace, any CAP_SYS_ADMIN implemented by
current LSMs could be avoided.  See
https://lore.kernel.org/r/CAHC9VhRGMmhxbajwQNfGFy+ZFF1uN=UEBjqQZQ4UBy7yds3eVQ@mail.gmail.com

> +
> +	if (get_user(usize, size))

ditto

> +		return -EFAULT;
> +
> +	return security_lsm_config_system_policy(lsm_id, op, buf, usize, flags);
>  }
> diff --git a/security/security.c b/security/security.c
> index fb57e8fddd91..166d7d9936d0 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -5883,6 +5883,66 @@ int security_bdev_setintegrity(struct block_device *bdev,
>  }
>  EXPORT_SYMBOL(security_bdev_setintegrity);
>  
> +/**
> + * security_lsm_config_self_policy() - Configure caller's LSM policies
> + * @lsm_id: id of the LSM to target
> + * @op: Operation to perform (one of the LSM_POLICY_XXX values)
> + * @buf: userspace pointer to policy data
> + * @size: size of @buf
> + * @flags: lsm policy configuration flags
> + *
> + * Configure the policies of a LSM for the current domain/user. This notably
> + * allows to update them even when the lsmfs is unavailable or restricted.
> + * Currently, only LSM_POLICY_LOAD is supported.
> + *
> + * Return: Returns 0 on success, error on failure.
> + */
> +int security_lsm_config_self_policy(u32 lsm_id, u32 op, void __user *buf,
> +				 size_t size, u32 flags)
> +{
> +	int rc = LSM_RET_DEFAULT(lsm_config_self_policy);
> +	struct lsm_static_call *scall;
> +
> +	lsm_for_each_hook(scall, lsm_config_self_policy) {
> +		if ((scall->hl->lsmid->id) == lsm_id) {
> +			rc = scall->hl->hook.lsm_config_self_policy(lsm_id, op, buf, size, flags);

The lsm_id should not be passed to the hook.

The LSM syscall should manage the argument copy and buffer allocation
instead of duplicating this code in each LSM hook implementation (see
other LSM syscalls).

> +			break;
> +		}
> +	}
> +
> +	return rc;
> +}
> +
> +/**
> + * security_lsm_config_system_policy() - Configure system LSM policies
> + * @lsm_id: id of the lsm to target
> + * @op: Operation to perform (one of the LSM_POLICY_XXX values)
> + * @buf: userspace pointer to policy data
> + * @size: size of @buf
> + * @flags: lsm policy configuration flags
> + *
> + * Configure the policies of a LSM for the whole system. This notably allows
> + * to update them even when the lsmfs is unavailable or restricted. Currently,
> + * only LSM_POLICY_LOAD is supported.
> + *
> + * Return: Returns 0 on success, error on failure.
> + */
> +int security_lsm_config_system_policy(u32 lsm_id, u32 op, void __user *buf,
> +				   size_t size, u32 flags)
> +{
> +	int rc = LSM_RET_DEFAULT(lsm_config_system_policy);
> +	struct lsm_static_call *scall;
> +
> +	lsm_for_each_hook(scall, lsm_config_system_policy) {
> +		if ((scall->hl->lsmid->id) == lsm_id) {
> +			rc = scall->hl->hook.lsm_config_system_policy(lsm_id, op, buf, size, flags);

ditto

> +			break;
> +		}
> +	}
> +
> +	return rc;
> +}
> +
>  #ifdef CONFIG_PERF_EVENTS
>  /**
>   * security_perf_event_open() - Check if a perf event open is allowed
> -- 
> 2.48.1
> 
> 

^ permalink raw reply

* Re: [PATCH v5 3/3] AppArmor: add support for lsm_config_self_policy and lsm_config_system_policy
From: Mickaël Salaün @ 2025-08-20 14:21 UTC (permalink / raw)
  To: Maxime Bélair
  Cc: linux-security-module, john.johansen, paul, jmorris, serge, kees,
	stephen.smalley.work, casey, takedakn, penguin-kernel, song,
	rdunlap, linux-api, apparmor, linux-kernel
In-Reply-To: <20250709080220.110947-4-maxime.belair@canonical.com>

On Wed, Jul 09, 2025 at 10:00:56AM +0200, Maxime Bélair wrote:
> Enable users to manage AppArmor policies through the new hooks
> lsm_config_self_policy and lsm_config_system_policy.
> 
> lsm_config_self_policy allows stacking existing policies in the kernel.
> This ensures that it can only further restrict the caller and can never
> be used to gain new privileges.
> 
> lsm_config_system_policy allows loading or replacing AppArmor policies in
> any AppArmor namespace.
> 
> Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
> ---
>  security/apparmor/apparmorfs.c         | 31 ++++++++++
>  security/apparmor/include/apparmor.h   |  4 ++
>  security/apparmor/include/apparmorfs.h |  3 +
>  security/apparmor/lsm.c                | 84 ++++++++++++++++++++++++++
>  4 files changed, 122 insertions(+)
> 

> diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
> index 9b6c2f157f83..0ce40290f44e 100644
> --- a/security/apparmor/lsm.c
> +++ b/security/apparmor/lsm.c
> @@ -1275,6 +1275,86 @@ static int apparmor_socket_shutdown(struct socket *sock, int how)
>  	return aa_sock_perm(OP_SHUTDOWN, AA_MAY_SHUTDOWN, sock);
>  }
>  
> +/**
> + * apparmor_lsm_config_self_policy - Stack a profile
> + * @lsm_id: AppArmor ID (LSM_ID_APPARMOR). Unused here
> + * @op: operation to perform. Currently, only LSM_POLICY_LOAD is supported
> + * @buf: buffer containing the user-provided name of the profile to stack
> + * @size: size of @buf
> + * @flags: reserved for future use; must be zero
> + *
> + * Returns: 0 on success, negative value on error
> + */
> +static int apparmor_lsm_config_self_policy(u32 lsm_id, u32 op, void __user *buf,
> +				      size_t size, u32 flags)
> +{
> +	char *name;
> +	long name_size;
> +	int ret;
> +


> +	if (op != LSM_POLICY_LOAD || flags)
> +		return -EOPNOTSUPP;
> +	if (size == 0)
> +		return -EINVAL;
> +	if (size > AA_PROFILE_NAME_MAX_SIZE)
> +		return -E2BIG;
> +
> +	name = kmalloc(size, GFP_KERNEL);
> +	if (!name)
> +		return -ENOMEM;

This hunk should be part of the syscall code and shared amongst LSMs.

> +
> +
> +	name_size = strncpy_from_user(name, buf, size);
> +	if (name_size < 0) {
> +		kfree(name);
> +		return name_size;
> +	}
> +
> +	ret = aa_change_profile(name, AA_CHANGE_STACK);
> +
> +	kfree(name);
> +
> +	return ret;
> +}

^ permalink raw reply

* Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
From: Askar Safin @ 2025-08-20 11:53 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-08-20.1755686261-lurid-sleepy-lime-quarry-j42HLU@cyphar.com>

 ---- On Wed, 20 Aug 2025 14:38:48 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > The reason I wanted to include the comparison is that you can create
 > multiple mount objects from the same underlying object using
 > open_tree(2) but that's not possible with fsmount(2) (at least, not
 > without creating a new filesystem context each time).

Okay, you may write that.

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox