Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v2] vdso: Remove struct getcpu_cache
From: Dave Hansen @ 2025-10-13 14:06 UTC (permalink / raw)
  To: Thomas Weißschuh, Huacai Chen, WANG Xuerui, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Richard Weinberger, Anton Ivanov, Johannes Berg,
	Vincenzo Frascino, Shuah Khan
  Cc: loongarch, linux-kernel, linux-s390, linux-um, linux-api,
	linux-kselftest
In-Reply-To: <20251013-getcpu_cache-v2-1-880fbfa3b7cc@linutronix.de>

On 10/13/25 02:20, Thomas Weißschuh wrote:
> -int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused);
> -int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused)
> +int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused);
> +int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused)
>  {
>  	int cpu_id;

It would ideally be nice to have a _bit_ more history on this about
how it became unused any why there is such high confidence that
userspace never tries to use it.

Let's say someone comes along in a few years and wants to use this
'unused' parameter. Could they?

^ permalink raw reply

* Re: [PATCH v2 2/3] initrd: remove deprecated code path (linuxrc)
From: Askar Safin @ 2025-10-13 10:29 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Andy Shevchenko, Aleksa Sarai, Thomas Weißschuh,
	Julian Stecklina, Gao Xiang, Art Nikpal, Andrew Morton,
	Alexander Graf, Rob Landley, Lennart Poettering, linux-arch,
	linux-block, initramfs, linux-api, linux-doc, Michal Simek,
	Luis Chamberlain, Kees Cook, Thorsten Blum, Heiko Carstens,
	Arnd Bergmann, Dave Young, Christophe Leroy, Krzysztof Kozlowski,
	Borislav Petkov, Jessica Clarke, Nicolas Schichan,
	David Disseldorp, patches
In-Reply-To: <07ae142e-4266-44a3-9aa1-4b2acbd72c1b@infradead.org>

On Fri, Oct 10, 2025 at 10:31 PM Randy Dunlap <rdunlap@infradead.org> wrote:
> There are more places in Documentation/ that refer to "linuxrc".
> Should those also be removed or fixed?
>
> accounting/delay-accounting.rst
> admin-guide/initrd.rst
> driver-api/early-userspace/early_userspace_support.rst
> power/swsusp-dmcrypt.rst
> translations/zh_CN/accounting/delay-accounting.rst

Yes, they should be removed.
I made this patchset minimal to be sure it is easy to revert.
I will remove these linuxrc mentions in cleanup patchset.

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH v2 2/3] initrd: remove deprecated code path (linuxrc)
From: Askar Safin @ 2025-10-13  9:59 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Aleksa Sarai, Thomas Weißschuh, Julian Stecklina,
	Gao Xiang, Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <CAHp75VezkZ7A1VOP8cBH8h0DKVumP66jjUbepMCP87wGOrh+MQ@mail.gmail.com>

On Fri, Oct 10, 2025 at 6:05 PM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> > -       noinitrd        [RAM] Tells the kernel not to load any configured
> > +       noinitrd        [Deprecated,RAM] Tells the kernel not to load any configured
> >                         initial RAM disk.
>
> How one is supposed to run this when just having a kernel is enough?
> At least (ex)colleague of mine was a heavy user of this option for
> testing kernel builds on the real HW.

This option applies to initrd only, not to initramfs.
Except for EFI mode, when it applies to both.

I will remove this option when I remove initrd.

In EFI mode it is easy just not to pass initramfs, so all is okay.

Also I will clarify docs in v3.

Also, please, answer here:
https://lore.kernel.org/regressions/20250918183336.5633-1-safinaskar@gmail.com/

-- 
Askar Safin

^ permalink raw reply

* [PATCH v2] vdso: Remove struct getcpu_cache
From: Thomas Weißschuh @ 2025-10-13  9:20 UTC (permalink / raw)
  To: Huacai Chen, WANG Xuerui, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Vincenzo Frascino, Shuah Khan
  Cc: loongarch, linux-kernel, linux-s390, linux-um, linux-api,
	linux-kselftest, Thomas Weißschuh

The cache parameter of getcpu() is not used by the kernel and no user
ever passes it in anyways.

Remove the struct and its header.

As a side-effect we get rid of an unwanted inclusion of the linux/
header namespace from vDSO code.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
Changes in v2:
- Rebase on v6.18-rc1
- Link to v1: https://lore.kernel.org/r/20250826-getcpu_cache-v1-1-8748318f6141@linutronix.de
---
We could also completely remove the parameter, but I am not sure if
that is a good idea for syscalls and vDSO entrypoints.
---
 arch/loongarch/vdso/vgetcpu.c                   |  5 ++---
 arch/s390/kernel/vdso64/getcpu.c                |  3 +--
 arch/s390/kernel/vdso64/vdso.h                  |  4 +---
 arch/x86/entry/vdso/vgetcpu.c                   |  5 ++---
 arch/x86/include/asm/vdso/processor.h           |  4 +---
 arch/x86/um/vdso/um_vdso.c                      |  7 +++----
 include/linux/getcpu.h                          | 19 -------------------
 include/linux/syscalls.h                        |  3 +--
 kernel/sys.c                                    |  4 +---
 tools/testing/selftests/vDSO/vdso_test_getcpu.c |  4 +---
 10 files changed, 13 insertions(+), 45 deletions(-)

diff --git a/arch/loongarch/vdso/vgetcpu.c b/arch/loongarch/vdso/vgetcpu.c
index 5301cd9d0f839eb0fd7b73a1d36e80aaa75d5e76..aefba899873ed035d70766a95b0b6fea881e94df 100644
--- a/arch/loongarch/vdso/vgetcpu.c
+++ b/arch/loongarch/vdso/vgetcpu.c
@@ -4,7 +4,6 @@
  */
 
 #include <asm/vdso.h>
-#include <linux/getcpu.h>
 
 static __always_inline int read_cpu_id(void)
 {
@@ -20,8 +19,8 @@ static __always_inline int read_cpu_id(void)
 }
 
 extern
-int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused);
-int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused)
+int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused);
+int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused)
 {
 	int cpu_id;
 
diff --git a/arch/s390/kernel/vdso64/getcpu.c b/arch/s390/kernel/vdso64/getcpu.c
index 5c5d4a848b7669436e73df8e3b711e5b876eb3db..1e17665616c5fa766ca66c8de276b212528934bd 100644
--- a/arch/s390/kernel/vdso64/getcpu.c
+++ b/arch/s390/kernel/vdso64/getcpu.c
@@ -2,11 +2,10 @@
 /* Copyright IBM Corp. 2020 */
 
 #include <linux/compiler.h>
-#include <linux/getcpu.h>
 #include <asm/timex.h>
 #include "vdso.h"
 
-int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
+int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, void *unused)
 {
 	union tod_clock clk;
 
diff --git a/arch/s390/kernel/vdso64/vdso.h b/arch/s390/kernel/vdso64/vdso.h
index 9e5397e7b590a23c149ccc6043d0c0b0d5ea8457..cadd307d7a365cabf53f5c8d313be3718625533d 100644
--- a/arch/s390/kernel/vdso64/vdso.h
+++ b/arch/s390/kernel/vdso64/vdso.h
@@ -4,9 +4,7 @@
 
 #include <vdso/datapage.h>
 
-struct getcpu_cache;
-
-int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused);
+int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, void *unused);
 int __s390_vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
 int __s390_vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
 int __s390_vdso_clock_getres(clockid_t clock, struct __kernel_timespec *ts);
diff --git a/arch/x86/entry/vdso/vgetcpu.c b/arch/x86/entry/vdso/vgetcpu.c
index e4640306b2e3c95d74d73037ab6b09294b8e1d6c..6381b472b7c52487bccf3cbf0664c3d7a0e59699 100644
--- a/arch/x86/entry/vdso/vgetcpu.c
+++ b/arch/x86/entry/vdso/vgetcpu.c
@@ -6,17 +6,16 @@
  */
 
 #include <linux/kernel.h>
-#include <linux/getcpu.h>
 #include <asm/segment.h>
 #include <vdso/processor.h>
 
 notrace long
-__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
+__vdso_getcpu(unsigned *cpu, unsigned *node, void *unused)
 {
 	vdso_read_cpunode(cpu, node);
 
 	return 0;
 }
 
-long getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
+long getcpu(unsigned *cpu, unsigned *node, void *tcache)
 	__attribute__((weak, alias("__vdso_getcpu")));
diff --git a/arch/x86/include/asm/vdso/processor.h b/arch/x86/include/asm/vdso/processor.h
index 7000aeb59aa287e2119c3d43ab3eaf82befb59c4..93e0e24e5cb47f7b0056c13f2a7f2304ed4a0595 100644
--- a/arch/x86/include/asm/vdso/processor.h
+++ b/arch/x86/include/asm/vdso/processor.h
@@ -18,9 +18,7 @@ static __always_inline void cpu_relax(void)
 	native_pause();
 }
 
-struct getcpu_cache;
-
-notrace long __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused);
+notrace long __vdso_getcpu(unsigned *cpu, unsigned *node, void *unused);
 
 #endif /* __ASSEMBLER__ */
 
diff --git a/arch/x86/um/vdso/um_vdso.c b/arch/x86/um/vdso/um_vdso.c
index cbae2584124fd0ff0f9d240c33fefb8d213c84cd..9aa2c62cce6b7a07bbaf8441014d347162d1950d 100644
--- a/arch/x86/um/vdso/um_vdso.c
+++ b/arch/x86/um/vdso/um_vdso.c
@@ -10,14 +10,13 @@
 #define DISABLE_BRANCH_PROFILING
 
 #include <linux/time.h>
-#include <linux/getcpu.h>
 #include <asm/unistd.h>
 
 /* workaround for -Wmissing-prototypes warnings */
 int __vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts);
 int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
 __kernel_old_time_t __vdso_time(__kernel_old_time_t *t);
-long __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused);
+long __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused);
 
 int __vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts)
 {
@@ -60,7 +59,7 @@ __kernel_old_time_t __vdso_time(__kernel_old_time_t *t)
 __kernel_old_time_t time(__kernel_old_time_t *t) __attribute__((weak, alias("__vdso_time")));
 
 long
-__vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused)
+__vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused)
 {
 	/*
 	 * UML does not support SMP, we can cheat here. :)
@@ -74,5 +73,5 @@ __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused
 	return 0;
 }
 
-long getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *tcache)
+long getcpu(unsigned int *cpu, unsigned int *node, void *tcache)
 	__attribute__((weak, alias("__vdso_getcpu")));
diff --git a/include/linux/getcpu.h b/include/linux/getcpu.h
deleted file mode 100644
index c304dcdb4eac2a9117080e6a14f4e3f28d07fd56..0000000000000000000000000000000000000000
--- a/include/linux/getcpu.h
+++ /dev/null
@@ -1,19 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_GETCPU_H
-#define _LINUX_GETCPU_H 1
-
-/* Cache for getcpu() to speed it up. Results might be a short time
-   out of date, but will be faster.
-
-   User programs should not refer to the contents of this structure.
-   I repeat they should not refer to it. If they do they will break
-   in future kernels.
-
-   It is only a private cache for vgetcpu(). It will change in future kernels.
-   The user program must store this information per thread (__thread)
-   If you want 100% accurate information pass NULL instead. */
-struct getcpu_cache {
-	unsigned long blob[128 / sizeof(long)];
-};
-
-#endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 66c06fcdfe19e27b99eb9a187c22e022e260802f..403488e5eba906ecf40975fc3cb29ed0402491f2 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -59,7 +59,6 @@ struct compat_stat;
 struct old_timeval32;
 struct robust_list_head;
 struct futex_waitv;
-struct getcpu_cache;
 struct old_linux_dirent;
 struct perf_event_attr;
 struct file_handle;
@@ -714,7 +713,7 @@ asmlinkage long sys_getrusage(int who, struct rusage __user *ru);
 asmlinkage long sys_umask(int mask);
 asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
 			unsigned long arg4, unsigned long arg5);
-asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, void __user *cache);
 asmlinkage long sys_gettimeofday(struct __kernel_old_timeval __user *tv,
 				struct timezone __user *tz);
 asmlinkage long sys_settimeofday(struct __kernel_old_timeval __user *tv,
diff --git a/kernel/sys.c b/kernel/sys.c
index 8b58eece4e580b883d19bb1336aff627ae783a4d..f1780ab132a3fbce6aac937ade5b9a35d9837f13 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -31,7 +31,6 @@
 #include <linux/tty.h>
 #include <linux/signal.h>
 #include <linux/cn_proc.h>
-#include <linux/getcpu.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/seccomp.h>
 #include <linux/cpu.h>
@@ -2876,8 +2875,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	return error;
 }
 
-SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep,
-		struct getcpu_cache __user *, unused)
+SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep, void __user *, unused)
 {
 	int err = 0;
 	int cpu = raw_smp_processor_id();
diff --git a/tools/testing/selftests/vDSO/vdso_test_getcpu.c b/tools/testing/selftests/vDSO/vdso_test_getcpu.c
index cdeaed45fb26c61f6314c58fe1b71fa0be3c0108..994ce569dc37c6689b1a3c79156e3dfc8bf27f22 100644
--- a/tools/testing/selftests/vDSO/vdso_test_getcpu.c
+++ b/tools/testing/selftests/vDSO/vdso_test_getcpu.c
@@ -16,9 +16,7 @@
 #include "vdso_config.h"
 #include "vdso_call.h"
 
-struct getcpu_cache;
-typedef long (*getcpu_t)(unsigned int *, unsigned int *,
-			 struct getcpu_cache *);
+typedef long (*getcpu_t)(unsigned int *, unsigned int *, void *);
 
 int main(int argc, char **argv)
 {

---
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
change-id: 20250825-getcpu_cache-3abcd2e65437

Best regards,
-- 
Thomas Weißschuh <thomas.weissschuh@linutronix.de>


^ permalink raw reply related

* Re: [PATCH v2 1/3] init: remove deprecated "load_ramdisk" and "prompt_ramdisk" command line parameters
From: Askar Safin @ 2025-10-13  6:05 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Aleksa Sarai, Thomas Weißschuh, Julian Stecklina,
	Gao Xiang, Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <CAHp75VeJM_OoCWDX20FhphRi6e7rG9Z4X6zkjx9vFF12n7Ef7A@mail.gmail.com>

On Fri, Oct 10, 2025 at 6:02 PM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> 1) often the last period is missing in the commit messages;
I will fix in v3.

> 2) in this change it's unclear for how long (years) the feature was
> deprecated, i.e. the other patch states that 2020 for something else.
> I wonder if this one has the similar order of oldness.

These two commits were done in 2020, too. I will fix in v3.

--
Askar Safin

^ permalink raw reply

* Re: [PATCH v6 1/5] Wire up lsm_config_self_policy and lsm_config_system_policy syscalls
From: kernel test robot @ 2025-10-11 12:07 UTC (permalink / raw)
  To: Maxime Bélair, linux-security-module
  Cc: oe-kbuild-all, john.johansen, paul, jmorris, serge, mic, kees,
	stephen.smalley.work, casey, takedakn, penguin-kernel, song,
	rdunlap, linux-api, apparmor, linux-kernel, Maxime Bélair
In-Reply-To: <20251010132610.12001-2-maxime.belair@canonical.com>

Hi Maxime,

kernel test robot noticed the following build errors:

[auto build test ERROR on 9c32cda43eb78f78c73aee4aa344b777714e259b]

url:    https://github.com/intel-lab-lkp/linux/commits/Maxime-B-lair/Wire-up-lsm_config_self_policy-and-lsm_config_system_policy-syscalls/20251010-213606
base:   9c32cda43eb78f78c73aee4aa344b777714e259b
patch link:    https://lore.kernel.org/r/20251010132610.12001-2-maxime.belair%40canonical.com
patch subject: [PATCH v6 1/5] Wire up lsm_config_self_policy and lsm_config_system_policy syscalls
config: sh-randconfig-001-20251011 (https://download.01.org/0day-ci/archive/20251011/202510111947.0ObJ6YUH-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251011/202510111947.0ObJ6YUH-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510111947.0ObJ6YUH-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from kernel/umh.c:9:0:
>> include/linux/syscalls.h:994:45: error: expected ';', ',' or ')' before 'u32'
              u32 __user size, u32 common_flags u32 flags);
                                                ^~~
--
   In file included from kernel/fork.c:56:0:
>> include/linux/syscalls.h:994:45: error: expected ';', ',' or ')' before 'u32'
              u32 __user size, u32 common_flags u32 flags);
                                                ^~~
   kernel/fork.c: In function '__do_sys_clone3':
   kernel/fork.c:3135:2: warning: #warning clone3() entry point is missing, please fix [-Wcpp]
    #warning clone3() entry point is missing, please fix
     ^~~~~~~


vim +994 include/linux/syscalls.h

   817	
   818	/* CONFIG_MMU only */
   819	asmlinkage long sys_swapon(const char __user *specialfile, int swap_flags);
   820	asmlinkage long sys_swapoff(const char __user *specialfile);
   821	asmlinkage long sys_mprotect(unsigned long start, size_t len,
   822					unsigned long prot);
   823	asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
   824	asmlinkage long sys_mlock(unsigned long start, size_t len);
   825	asmlinkage long sys_munlock(unsigned long start, size_t len);
   826	asmlinkage long sys_mlockall(int flags);
   827	asmlinkage long sys_munlockall(void);
   828	asmlinkage long sys_mincore(unsigned long start, size_t len,
   829					unsigned char __user * vec);
   830	asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
   831	asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
   832				size_t vlen, int behavior, unsigned int flags);
   833	asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
   834	asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
   835				unsigned long prot, unsigned long pgoff,
   836				unsigned long flags);
   837	asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned long flags);
   838	asmlinkage long sys_mbind(unsigned long start, unsigned long len,
   839					unsigned long mode,
   840					const unsigned long __user *nmask,
   841					unsigned long maxnode,
   842					unsigned flags);
   843	asmlinkage long sys_get_mempolicy(int __user *policy,
   844					unsigned long __user *nmask,
   845					unsigned long maxnode,
   846					unsigned long addr, unsigned long flags);
   847	asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask,
   848					unsigned long maxnode);
   849	asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
   850					const unsigned long __user *from,
   851					const unsigned long __user *to);
   852	asmlinkage long sys_move_pages(pid_t pid, unsigned long nr_pages,
   853					const void __user * __user *pages,
   854					const int __user *nodes,
   855					int __user *status,
   856					int flags);
   857	asmlinkage long sys_rt_tgsigqueueinfo(pid_t tgid, pid_t  pid, int sig,
   858			siginfo_t __user *uinfo);
   859	asmlinkage long sys_perf_event_open(
   860			struct perf_event_attr __user *attr_uptr,
   861			pid_t pid, int cpu, int group_fd, unsigned long flags);
   862	asmlinkage long sys_accept4(int, struct sockaddr __user *, int __user *, int);
   863	asmlinkage long sys_recvmmsg(int fd, struct mmsghdr __user *msg,
   864				     unsigned int vlen, unsigned flags,
   865				     struct __kernel_timespec __user *timeout);
   866	asmlinkage long sys_recvmmsg_time32(int fd, struct mmsghdr __user *msg,
   867				     unsigned int vlen, unsigned flags,
   868				     struct old_timespec32 __user *timeout);
   869	asmlinkage long sys_wait4(pid_t pid, int __user *stat_addr,
   870					int options, struct rusage __user *ru);
   871	asmlinkage long sys_prlimit64(pid_t pid, unsigned int resource,
   872					const struct rlimit64 __user *new_rlim,
   873					struct rlimit64 __user *old_rlim);
   874	asmlinkage long sys_fanotify_init(unsigned int flags, unsigned int event_f_flags);
   875	#if defined(CONFIG_ARCH_SPLIT_ARG64)
   876	asmlinkage long sys_fanotify_mark(int fanotify_fd, unsigned int flags,
   877	                                unsigned int mask_1, unsigned int mask_2,
   878					int dfd, const char  __user * pathname);
   879	#else
   880	asmlinkage long sys_fanotify_mark(int fanotify_fd, unsigned int flags,
   881					  u64 mask, int fd,
   882					  const char  __user *pathname);
   883	#endif
   884	asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name,
   885					      struct file_handle __user *handle,
   886					      void __user *mnt_id, int flag);
   887	asmlinkage long sys_open_by_handle_at(int mountdirfd,
   888					      struct file_handle __user *handle,
   889					      int flags);
   890	asmlinkage long sys_clock_adjtime(clockid_t which_clock,
   891					struct __kernel_timex __user *tx);
   892	asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
   893					struct old_timex32 __user *tx);
   894	asmlinkage long sys_syncfs(int fd);
   895	asmlinkage long sys_setns(int fd, int nstype);
   896	asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
   897	asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
   898				     unsigned int vlen, unsigned flags);
   899	asmlinkage long sys_process_vm_readv(pid_t pid,
   900					     const struct iovec __user *lvec,
   901					     unsigned long liovcnt,
   902					     const struct iovec __user *rvec,
   903					     unsigned long riovcnt,
   904					     unsigned long flags);
   905	asmlinkage long sys_process_vm_writev(pid_t pid,
   906					      const struct iovec __user *lvec,
   907					      unsigned long liovcnt,
   908					      const struct iovec __user *rvec,
   909					      unsigned long riovcnt,
   910					      unsigned long flags);
   911	asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
   912				 unsigned long idx1, unsigned long idx2);
   913	asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
   914	asmlinkage long sys_sched_setattr(pid_t pid,
   915						struct sched_attr __user *attr,
   916						unsigned int flags);
   917	asmlinkage long sys_sched_getattr(pid_t pid,
   918						struct sched_attr __user *attr,
   919						unsigned int size,
   920						unsigned int flags);
   921	asmlinkage long sys_renameat2(int olddfd, const char __user *oldname,
   922				      int newdfd, const char __user *newname,
   923				      unsigned int flags);
   924	asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
   925				    void __user *uargs);
   926	asmlinkage long sys_getrandom(char __user *buf, size_t count,
   927				      unsigned int flags);
   928	asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
   929	asmlinkage long sys_bpf(int cmd, union bpf_attr __user *attr, unsigned int size);
   930	asmlinkage long sys_execveat(int dfd, const char __user *filename,
   931				const char __user *const __user *argv,
   932				const char __user *const __user *envp, int flags);
   933	asmlinkage long sys_userfaultfd(int flags);
   934	asmlinkage long sys_membarrier(int cmd, unsigned int flags, int cpu_id);
   935	asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
   936	asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
   937					    int fd_out, loff_t __user *off_out,
   938					    size_t len, unsigned int flags);
   939	asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
   940				    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
   941				    rwf_t flags);
   942	asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
   943				    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
   944				    rwf_t flags);
   945	asmlinkage long sys_pkey_mprotect(unsigned long start, size_t len,
   946					  unsigned long prot, int pkey);
   947	asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
   948	asmlinkage long sys_pkey_free(int pkey);
   949	asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
   950				  unsigned mask, struct statx __user *buffer);
   951	asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
   952				 int flags, uint32_t sig);
   953	asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
   954	asmlinkage long sys_open_tree_attr(int dfd, const char __user *path,
   955					   unsigned flags,
   956					   struct mount_attr __user *uattr,
   957					   size_t usize);
   958	asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
   959				       int to_dfd, const char __user *to_path,
   960				       unsigned int ms_flags);
   961	asmlinkage long sys_mount_setattr(int dfd, const char __user *path,
   962					  unsigned int flags,
   963					  struct mount_attr __user *uattr, size_t usize);
   964	asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
   965	asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
   966				     const void __user *value, int aux);
   967	asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
   968	asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags);
   969	asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
   970					       siginfo_t __user *info,
   971					       unsigned int flags);
   972	asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
   973	asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr __user *attr,
   974			size_t size, __u32 flags);
   975	asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
   976			const void __user *rule_attr, __u32 flags);
   977	asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
   978	asmlinkage long sys_memfd_secret(unsigned int flags);
   979	asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
   980						    unsigned long home_node,
   981						    unsigned long flags);
   982	asmlinkage long sys_cachestat(unsigned int fd,
   983			struct cachestat_range __user *cstat_range,
   984			struct cachestat __user *cstat, unsigned int flags);
   985	asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
   986	asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
   987					      u32 __user *size, u32 flags);
   988	asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
   989					      u32 size, u32 flags);
   990	asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
   991	asmlinkage long sys_lsm_config_self_policy(u32 lsm_id, u32 op, void __user *buf,
   992						   u32 __user size, u32 common_flags, u32 flags);
   993	asmlinkage long sys_lsm_config_system_policy(u32 lsm_id, u32 op, void __user *buf,
 > 994						     u32 __user size, u32 common_flags u32 flags);
   995	
   996	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Andy Lutomirski @ 2025-10-11  4:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs
In-Reply-To: <aOm0WCB_woFgnv0v@dread.disaster.area>

On Fri, Oct 10, 2025 at 6:35 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Oct 08, 2025 at 02:51:14PM -0700, Andy Lutomirski wrote:
> > On Wed, Oct 8, 2025 at 2:27 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote:
> > > > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > > >
> > > > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> >
> > >
> > > You are conflating "synchronous update" with "blocking".
> > >
> > > Avoiding the need for synchronous timestamp updates is exactly what
> > > the lazytime mount option provides. i.e. lazytime degrades immediate
> > > consistency requirements to eventual consistency similar to how the
> > > default relatime behaviour defers atime updates for eventual
> > > writeback.
> > >
> > > IOWs, we've already largely addressed the synchronous c/mtime update
> > > problem but what we haven't done is made timestamp updates
> > > fully support non-blocking caller semantics. That's a separate
> > > problem...
> >
> > I'm probably missing something, but is this really different?
>
> Yes, and yes.
>
> > Either the mtime update can block or it can't block.
>
> Sure, but that's not the issue we have to deal with.
>
> In many filesystems and fs operations, we have to know if an
> operation is going to block -before- we start the operation. e.g.
> transactional changes cannot be rolled back once we've started the
> modification if they need to block to make progress (e.g. read in
> on-disk metadata).
>
> This foresight, in many cases, is -unknowable-. Even though the
> operation /likely/ won't block, we cannot *guarantee* ahead of time
> that any given instance of the operation will /not/ block.  Hence
> the reliable non-blocking operation that users are asking for is not
> possible with unknowable implementation characteristics like this.
>
> IOWs, a timestamp update implementation can be synchronous and
> reliably non-blocking if it always knows when blocking will occur
> and can return -EAGAIN instead of blocking to complete the
> operation.
>
> If it can't know when/if blocking will occur, then lazytime allows
> us to defer the (potentially) blocking update operation to another
> context that can block. Queuing for async processing can easily be
> made non-blocking, and __mark_inode_dirty(I_DIRTY_TIME) does this
> for us.
>
> So, yeah, it should be pretty obvious at this point that non-blocking
> implementation is completely independent of whether the operation is
> performed synchronously or asynchronously. It's easier to make async
> operations non-blocking, but that doesn't mean "non_blocking" and
> "asynchronous execution" are interchangable terms or behaviours.
>
> > I haven't dug all the
> > way into exactly what happens in __mark_inode_dirty(), but there is a
> > lot going on in there even in the I_DIRTY_TIME path.
>
> It's pretty simple, really.  __mark_inode_dirty(I_DIRTY_TIME) is
> non-blocking and queues the inode on the wb->i_dirty_time queue
> for later processing.
>

First, I apologize if I'm off base here.

Second, I don't think I'm entirely nuts, and I'm moderately confident
that, ten-ish years ago, I tested lazytime in the hopes that it would
solve my old problem, and IIRC it didn't help.  I was running a
production workload on ext4 on regrettably slow spinning rust backed
by a truly atrocious HPE controller.  And I was running latencytop to
generate little traces when my task got blocked, and there was no form
of AIO involved.  (And I don't really understand how AIO is wired up
internally...  And yes, in retrospect I should not have been using
shared-writable mmaps or even file-backed things at all for what I was
doing, but I had unrealistic expectations of how mmap worked when I
wrote that code more like 20 years ago, and I wasn't even using Linux
at the time I wrote it.)

I'm looking at the code now, and I see what you're talking about, and
__mark_inode_dirty(inode, I_DIRTY_TIME) looks fairly polite and like
it won't block.  But the relevant code seems to be:

int generic_update_time(struct inode *inode, int flags)
{
        int updated = inode_update_timestamps(inode, flags);
        int dirty_flags = 0;

        if (updated & (S_ATIME|S_MTIME|S_CTIME))
                dirty_flags = inode->i_sb->s_flags & SB_LAZYTIME ?
I_DIRTY_TIME : I_DIRTY_SYNC;
        if (updated & S_VERSION)
                dirty_flags |= I_DIRTY_SYNC;
        __mark_inode_dirty(inode, dirty_flags);
        ...

inode_update_timestamps does this, where updated != 0 if the timestamp
actually changed (which is subject to some complex coarse-graining
logic so it may only happen some of the time):

                if (IS_I_VERSION(inode) &&
inode_maybe_inc_iversion(inode, updated))
                        updated |= S_VERSION;

IS_I_VERSION seems to be unconditionally true on ext4.
inode_maybe_inc_iversion always returns true if updated is set, so
generic_update_time has a decent chance of doing
__mark_inode_dirty(inode, I_DIRTY_SYNC), which calls
s_op->dirty_inode, which calls ext4_journal_start, which, from my
recollection a decade ago, could easily block for a good second or so
on my delightful, now retired, HP/HPE system.

In my case, I think this is the path that was blocking for me in lots
of do_wp_page calls that would otherwise not have blocked.  I also
don't see any kiocb passed around or any mechanism by which this code
could know that it's supposed to be nonblocking, although I have
approximately no understanding of Linux AIO and I don't really know
what I should be looking for.

I could try to instrument the code a bit and test to see if I've
analyzed it right in a few days.

--Andy
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Dave Chinner @ 2025-10-11  1:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs
In-Reply-To: <CALCETrX-cs5MH3k369q2Fk5Q-pYQfEV6CW3va-4E9vD1CoCaGA@mail.gmail.com>

On Wed, Oct 08, 2025 at 02:51:14PM -0700, Andy Lutomirski wrote:
> On Wed, Oct 8, 2025 at 2:27 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote:
> > > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > >
> > > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> 
> >
> > You are conflating "synchronous update" with "blocking".
> >
> > Avoiding the need for synchronous timestamp updates is exactly what
> > the lazytime mount option provides. i.e. lazytime degrades immediate
> > consistency requirements to eventual consistency similar to how the
> > default relatime behaviour defers atime updates for eventual
> > writeback.
> >
> > IOWs, we've already largely addressed the synchronous c/mtime update
> > problem but what we haven't done is made timestamp updates
> > fully support non-blocking caller semantics. That's a separate
> > problem...
> 
> I'm probably missing something, but is this really different?

Yes, and yes.

> Either the mtime update can block or it can't block.

Sure, but that's not the issue we have to deal with.

In many filesystems and fs operations, we have to know if an
operation is going to block -before- we start the operation. e.g.
transactional changes cannot be rolled back once we've started the
modification if they need to block to make progress (e.g. read in
on-disk metadata).

This foresight, in many cases, is -unknowable-. Even though the
operation /likely/ won't block, we cannot *guarantee* ahead of time
that any given instance of the operation will /not/ block.  Hence
the reliable non-blocking operation that users are asking for is not
possible with unknowable implementation characteristics like this.

IOWs, a timestamp update implementation can be synchronous and
reliably non-blocking if it always knows when blocking will occur
and can return -EAGAIN instead of blocking to complete the
operation.

If it can't know when/if blocking will occur, then lazytime allows
us to defer the (potentially) blocking update operation to another
context that can block. Queuing for async processing can easily be
made non-blocking, and __mark_inode_dirty(I_DIRTY_TIME) does this
for us.

So, yeah, it should be pretty obvious at this point that non-blocking
implementation is completely independent of whether the operation is
performed synchronously or asynchronously. It's easier to make async
operations non-blocking, but that doesn't mean "non_blocking" and
"asynchronous execution" are interchangable terms or behaviours.

> I haven't dug all the
> way into exactly what happens in __mark_inode_dirty(), but there is a
> lot going on in there even in the I_DIRTY_TIME path.

It's pretty simple, really.  __mark_inode_dirty(I_DIRTY_TIME) is
non-blocking and queues the inode on the wb->i_dirty_time queue
for later processing.

> And Pavel is
> saying that AIO and mtime updates don't play along well.

Again: this is exactly why lazytime was added to XFS *ten years
ago*. From 2015 (issue #3):

https://lore.kernel.org/linux-xfs/CAD-J=zZh1dtJsfrW_Gwxjg+qvkZMu7ED-QOXrMMO6B-G0HY2-A@mail.gmail.com/

(Oh, look, a discussion that starts from a user suggestion of
exposing FMODE_NOCMTIME to userspace apps! Sound familiar?)

> > IOWs, with lazytime, writeback already persists timestamp updates
> > when appropriate for best performance.
> 
> I'm probably doing a bad job explaining myself.

No, I think both Christoph and I both understand exactly what you
are trying to describe.

It seems to me that haven't yet understood that lazytime already
does exactly what you are asking for. Hence you think we don't
understand the "lazytime" concept you are proposing and keep trying
to reinvent lazytime to convince us that we need "lazytime"
functionalitying in the kernel...

> > > Thinking out loud, to handle both write_iter and mmap, there might
> > > need to be two bits: one saying "the timestamp needs to be updated"
> > > and another saying "the timestamp has been updated in the in-memory
> > > inode, but the inode hasn't been dirtied yet".
> >
> > The flag that implements the latter is called I_DIRTY_TIME. We have
> > not implemented the former as that's a userspace visible change of
> > behaviour.
> 
> Maybe that change should be done?  Or not -- it wouldn't be terribly
> hard to have a pair of atomic timestamps in struct inode indicating
> what timestamps we want to write the next time we get around to it.

See, you just reinvented the lazytime mechanism. Again. :/

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply

* Re: [PATCH v6 1/5] Wire up lsm_config_self_policy and lsm_config_system_policy syscalls
From: Casey Schaufler @ 2025-10-10 21:13 UTC (permalink / raw)
  To: Song Liu, Maxime Bélair
  Cc: linux-security-module, john.johansen, paul, jmorris, serge, mic,
	kees, stephen.smalley.work, takedakn, penguin-kernel, rdunlap,
	linux-api, apparmor, linux-kernel, Casey Schaufler
In-Reply-To: <CAHzjS_uBq8xGCSmHC_kBWi0j8DCdwsy4XtfkH2iH6NygCcChNw@mail.gmail.com>

On 10/10/2025 11:06 AM, Song Liu wrote:
> On Fri, Oct 10, 2025 at 6:27 AM Maxime Bélair
> <maxime.belair@canonical.com> wrote:
> [...]
>> --- a/security/lsm_syscalls.c
>> +++ b/security/lsm_syscalls.c
>> @@ -118,3 +118,15 @@ SYSCALL_DEFINE3(lsm_list_modules, u64 __user *, ids, u32 __user *, size,
>>
>>         return lsm_active_cnt;
>>  }
>> +
>> +SYSCALL_DEFINE6(lsm_config_self_policy, u32, lsm_id, u32, op, void __user *,
>> +               buf, u32 __user, size, u32, common_flags, u32, flags)
>> +{
>> +       return 0;
>> +}
>> +
>> +SYSCALL_DEFINE6(lsm_config_system_policy, u32, lsm_id, u32, op, void __user *,
>> +               buf, u32 __user, size, u32, common_flags, u32, flags)
>> +{
>> +       return 0;
>> +}
> These two APIs look the same. Why not just keep one API and use
> one bit in the flag to differentiate "self" vs. "system"?

I think that's a valid point.

>
> Thanks,
> Song
>

^ permalink raw reply

* Re: [PATCH v2 2/3] initrd: remove deprecated code path (linuxrc)
From: Randy Dunlap @ 2025-10-10 19:31 UTC (permalink / raw)
  To: Askar Safin, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
	Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
	Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
	Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251010094047.3111495-3-safinaskar@gmail.com>

Hi,

On 10/10/25 2:40 AM, Askar Safin wrote:
> Remove linuxrc initrd code path, which was deprecated in 2020.
> 
> Initramfs and (non-initial) RAM disks (i. e. brd) still work.
> 
> Both built-in and bootloader-supplied initramfs still work.
> 
> Non-linuxrc initrd code path (i. e. using /dev/ram as final root
> filesystem) still works, but I put deprecation message into it
> 
> Signed-off-by: Askar Safin <safinaskar@gmail.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |  4 +-
>  fs/init.c                                     | 14 ---
>  include/linux/init_syscalls.h                 |  1 -
>  include/linux/initrd.h                        |  2 -
>  init/do_mounts.c                              |  4 +-
>  init/do_mounts.h                              | 18 +---
>  init/do_mounts_initrd.c                       | 85 ++-----------------
>  init/do_mounts_rd.c                           | 17 +---
>  8 files changed, 17 insertions(+), 128 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 521ab3425504..24d8899d8a39 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4285,7 +4285,7 @@
>  			Note that this argument takes precedence over
>  			the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
>  
> -	noinitrd	[RAM] Tells the kernel not to load any configured
> +	noinitrd	[Deprecated,RAM] Tells the kernel not to load any configured
>  			initial RAM disk.
>  
>  	nointremap	[X86-64,Intel-IOMMU,EARLY] Do not enable interrupt
> @@ -5299,7 +5299,7 @@
>  	ramdisk_size=	[RAM] Sizes of RAM disks in kilobytes
>  			See Documentation/admin-guide/blockdev/ramdisk.rst.
>  
> -	ramdisk_start=	[RAM] RAM disk image start address
> +	ramdisk_start=	[Deprecated,RAM] RAM disk image start address
>  
>  	random.trust_cpu=off
>  			[KNL,EARLY] Disable trusting the use of the CPU's

There are more places in Documentation/ that refer to "linuxrc".
Should those also be removed or fixed?

accounting/delay-accounting.rst
admin-guide/initrd.rst
driver-api/early-userspace/early_userspace_support.rst
power/swsusp-dmcrypt.rst
translations/zh_CN/accounting/delay-accounting.rst


Thanks.



^ permalink raw reply

* Re: [PATCH v6 1/5] Wire up lsm_config_self_policy and lsm_config_system_policy syscalls
From: Song Liu @ 2025-10-10 18:06 UTC (permalink / raw)
  To: Maxime Bélair
  Cc: linux-security-module, john.johansen, paul, jmorris, serge, mic,
	kees, stephen.smalley.work, casey, takedakn, penguin-kernel, song,
	rdunlap, linux-api, apparmor, linux-kernel
In-Reply-To: <20251010132610.12001-2-maxime.belair@canonical.com>

On Fri, Oct 10, 2025 at 6:27 AM Maxime Bélair
<maxime.belair@canonical.com> wrote:
[...]
> --- a/security/lsm_syscalls.c
> +++ b/security/lsm_syscalls.c
> @@ -118,3 +118,15 @@ SYSCALL_DEFINE3(lsm_list_modules, u64 __user *, ids, u32 __user *, size,
>
>         return lsm_active_cnt;
>  }
> +
> +SYSCALL_DEFINE6(lsm_config_self_policy, u32, lsm_id, u32, op, void __user *,
> +               buf, u32 __user, size, u32, common_flags, u32, flags)
> +{
> +       return 0;
> +}
> +
> +SYSCALL_DEFINE6(lsm_config_system_policy, u32, lsm_id, u32, op, void __user *,
> +               buf, u32 __user, size, u32, common_flags, u32, flags)
> +{
> +       return 0;
> +}

These two APIs look the same. Why not just keep one API and use
one bit in the flag to differentiate "self" vs. "system"?

Thanks,
Song

^ permalink raw reply

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Andy Lutomirski @ 2025-10-10 17:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api,
	linux-xfs
In-Reply-To: <aOiZX9iqZnf9jUdQ@infradead.org>

On Thu, Oct 9, 2025 at 10:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote:
> > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> > > > > Well, we'll need to look into that, including maybe non-blockin
> > > > > timestamp updates.
> > > > >
> > > >
> > > > It's been 12 years (!), but maybe it's time to reconsider this:
> > > >
> > > > https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/
> > >
> > > I don't see how that is relevant here.  Also writes through shared
> > > mmaps are problematic for so many reasons that I'm not sure we want
> > > to encourage people to use that more.
> > >
> >
> > Because the same exact issue exists in the normal non-mmap write path,
> > and I can even quote you upthread :)
>
> The thread that started this is about io_uring nonblock writes, aka
> O_DIRECT.  So there isn't any writeback to defer to.

I haven't followed all the internal details, but RWF_DONTCACHE is
looking pretty good these days, and it does go through the writeback
path.  I wonder if it's getting good enough that most or all O_DIRECT
users could switch to using it.

--Andy

^ permalink raw reply

* Re: [PATCH v6 5/5] Smack: add support for lsm_config_self_policy and lsm_config_system_policy
From: Casey Schaufler @ 2025-10-10 15:15 UTC (permalink / raw)
  To: Maxime Bélair, linux-security-module
  Cc: john.johansen, paul, jmorris, serge, mic, kees,
	stephen.smalley.work, takedakn, penguin-kernel, song, rdunlap,
	linux-api, apparmor, linux-kernel, Casey Schaufler
In-Reply-To: <20251010132610.12001-6-maxime.belair@canonical.com>

On 10/10/2025 6:25 AM, Maxime Bélair wrote:
> Enable users to manage Smack policies through the new hooks
> lsm_config_self_policy and lsm_config_system_policy.
>
> lsm_config_self_policy allows adding Smack policies for the current cred.
> For now it remains restricted to CAP_MAC_ADMIN.
>
> lsm_config_system_policy allows adding globabl Smack policies. This is
> restricted to CAP_MAC_ADMIN.
>
> Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>

I will be reviewing these patches, but will not be able to do so
until early November. I know how frustrating review delays can be,
but it really can't be helped this time around. Thank you for your
patience.

> ---
>  security/smack/smack.h     |  8 +++++
>  security/smack/smack_lsm.c | 73 ++++++++++++++++++++++++++++++++++++++
>  security/smack/smackfs.c   |  2 +-
>  3 files changed, 82 insertions(+), 1 deletion(-)
>
> diff --git a/security/smack/smack.h b/security/smack/smack.h
> index bf6a6ed3946c..3e3d30dfdcf7 100644
> --- a/security/smack/smack.h
> +++ b/security/smack/smack.h
> @@ -275,6 +275,14 @@ struct smk_audit_info {
>  #endif
>  };
>  
> +/*
> + * This function is in smackfs.c
> + */
> +ssize_t smk_write_rules_list(struct file *file, const char __user *buf,
> +			     size_t count, loff_t *ppos,
> +			     struct list_head *rule_list,
> +			     struct mutex *rule_lock, int format);
> +
>  /*
>   * These functions are in smack_access.c
>   */
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index 99833168604e..bf4bb2242768 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -5027,6 +5027,76 @@ static int smack_uring_cmd(struct io_uring_cmd *ioucmd)
>  
>  #endif /* CONFIG_IO_URING */
>  
> +/**
> + * smack_lsm_config_system_policy - Configure a system smack policy
> + * @op: operation to perform. Currently, only LSM_POLICY_LOAD is supported
> + * @buf: User-supplied buffer in the form "<fmt><policy>"
> + *        <fmt> is the 1-byte format of <policy>
> + *        <policy> is the policy to load
> + * @size: size of @buf
> + * @flags: reserved for future use; must be zero
> + *
> + * Returns: number of written rules on success, negative value on error
> + */
> +static int smack_lsm_config_system_policy(u32 op, void __user *buf, size_t size,
> +					  u32 flags)
> +{
> +	loff_t pos = 0;
> +	u8 fmt;
> +
> +	if (op != LSM_POLICY_LOAD || flags)
> +		return -EOPNOTSUPP;
> +
> +	if (size < 2)
> +		return -EINVAL;
> +
> +	if (get_user(fmt, (uint8_t *)buf))
> +		return -EFAULT;
> +
> +	return smk_write_rules_list(NULL, buf + 1, size - 1, &pos, NULL, NULL, fmt);
> +}
> +
> +/**
> + * smack_lsm_config_self_policy - Configure a smack policy for the current cred
> + * @op: operation to perform. Currently, only LSM_POLICY_LOAD is supported
> + * @buf: User-supplied buffer in the form "<fmt><policy>"
> + *        <fmt> is the 1-byte format of <policy>
> + *        <policy> is the policy to load
> + * @size: size of @buf
> + * @flags: reserved for future use; must be zero
> + *
> + * Returns: number of written rules on success, negative value on error
> + */
> +static int smack_lsm_config_self_policy(u32 op, void __user *buf, size_t size,
> +					u32 flags)
> +{
> +	loff_t pos = 0;
> +	u8 fmt;
> +	struct task_smack *tsp;
> +
> +	if (op != LSM_POLICY_LOAD || flags)
> +		return -EOPNOTSUPP;
> +
> +	if (size < 2)
> +		return -EINVAL;
> +
> +	if (get_user(fmt, (uint8_t *)buf))
> +		return -EFAULT;
> +	/**
> +	 * smk_write_rules_list could be used to gain privileges.
> +	 * This function is thus restricted to CAP_MAC_ADMIN.
> +	 * TODO: Ensure that the new rule does not give extra privileges
> +	 * before dropping this CAP_MAC_ADMIN check.
> +	 */
> +	if (!capable(CAP_MAC_ADMIN))
> +		return -EPERM;
> +
> +
> +	tsp = smack_cred(current_cred());
> +	return smk_write_rules_list(NULL, buf + 1, size - 1, &pos, &tsp->smk_rules,
> +				    &tsp->smk_rules_lock, fmt);
> +}
> +
>  struct lsm_blob_sizes smack_blob_sizes __ro_after_init = {
>  	.lbs_cred = sizeof(struct task_smack),
>  	.lbs_file = sizeof(struct smack_known *),
> @@ -5203,6 +5273,9 @@ static struct security_hook_list smack_hooks[] __ro_after_init = {
>  	LSM_HOOK_INIT(uring_sqpoll, smack_uring_sqpoll),
>  	LSM_HOOK_INIT(uring_cmd, smack_uring_cmd),
>  #endif
> +	LSM_HOOK_INIT(lsm_config_self_policy, smack_lsm_config_self_policy),
> +	LSM_HOOK_INIT(lsm_config_system_policy, smack_lsm_config_system_policy),
> +
>  };
>  
>  
> diff --git a/security/smack/smackfs.c b/security/smack/smackfs.c
> index 90a67e410808..ed1814588d56 100644
> --- a/security/smack/smackfs.c
> +++ b/security/smack/smackfs.c
> @@ -441,7 +441,7 @@ static ssize_t smk_parse_long_rule(char *data, struct smack_parsed_rule *rule,
>   *	"subject<whitespace>object<whitespace>
>   *	 acc_enable<whitespace>acc_disable[<whitespace>...]"
>   */
> -static ssize_t smk_write_rules_list(struct file *file, const char __user *buf,
> +ssize_t smk_write_rules_list(struct file *file, const char __user *buf,
>  					size_t count, loff_t *ppos,
>  					struct list_head *rule_list,
>  					struct mutex *rule_lock, int format)

^ permalink raw reply

* Re: [PATCH v2 2/3] initrd: remove deprecated code path (linuxrc)
From: Andy Shevchenko @ 2025-10-10 15:04 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Aleksa Sarai, Thomas Weißschuh, Julian Stecklina,
	Gao Xiang, Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251010094047.3111495-3-safinaskar@gmail.com>

On Fri, Oct 10, 2025 at 12:42 PM Askar Safin <safinaskar@gmail.com> wrote:
>
> Remove linuxrc initrd code path, which was deprecated in 2020.
>
> Initramfs and (non-initial) RAM disks (i. e. brd) still work.
>
> Both built-in and bootloader-supplied initramfs still work.
>
> Non-linuxrc initrd code path (i. e. using /dev/ram as final root
> filesystem) still works, but I put deprecation message into it

...

> -       noinitrd        [RAM] Tells the kernel not to load any configured
> +       noinitrd        [Deprecated,RAM] Tells the kernel not to load any configured
>                         initial RAM disk.

How one is supposed to run this when just having a kernel is enough?
At least (ex)colleague of mine was a heavy user of this option for
testing kernel builds on the real HW.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* Re: [PATCH v2 1/3] init: remove deprecated "load_ramdisk" and "prompt_ramdisk" command line parameters
From: Andy Shevchenko @ 2025-10-10 15:02 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Aleksa Sarai, Thomas Weißschuh, Julian Stecklina,
	Gao Xiang, Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251010094047.3111495-2-safinaskar@gmail.com>

On Fri, Oct 10, 2025 at 12:42 PM Askar Safin <safinaskar@gmail.com> wrote:
>
> ...which do nothing. They were deprecated (in documentation) in
> 6b99e6e6aa62 ("Documentation/admin-guide: blockdev/ramdisk: remove use of
> "rdev"") and in kernel messages in c8376994c86c ("initrd: remove support
> for multiple floppies")

With all the respect to the work and the series I have noted this:
1) often the last period is missing in the commit messages;
2) in this change it's unclear for how long (years) the feature was
deprecated, i.e. the other patch states that 2020 for something else.
I wonder if this one has the similar order of oldness.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-10-10 15:02 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Samiullah Khawaja, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	chrisl, steven.sistare
In-Reply-To: <CA+CK2bBxMpb=jXy3-i19PdBHqxLoLrMMg1sOnditOYwNe1Fr+w@mail.gmail.com>

On Fri, Oct 10, 2025 at 10:58:00AM -0400, Pasha Tatashin wrote:

> With that, I would assume KVM itself would drive the live update and
> would make LUO calls to preserve the resources in an orderly fashion
> and then restore them in the same order during boot.

I don't think so, it should always be sequenced by userspace, and KVM
is not the thing linked to VFIO or IOMMUFD, that's backwards.

Jason

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-10-10 15:01 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu, hughd, skhawaja,
	chrisl, steven.sistare
In-Reply-To: <CA+CK2bB6F634HCw_N5z9E5r_LpbGJrucuFb_5fL4da5_W99e4Q@mail.gmail.com>

On Thu, Oct 09, 2025 at 07:50:12PM -0400, Pasha Tatashin wrote:
> >   This can look something like:
> >
> >   hugetlb_luo_preserve_folio(folio, ...);
> >
> >   Nice and simple.
> >
> >   Compare this with the new proposed API:
> >
> >   liveupdate_fh_global_state_get(h, &hugetlb_data);
> >   // This will have update serialized state now.
> >   hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
> >   liveupdate_fh_global_state_put(h);
> >
> >   We do the same thing but in a very complicated way.
> >
> > - When the system-wide preserve happens, the hugetlb subsystem gets a
> >   callback to serialize. It converts its runtime global state to
> >   serialized state since now it knows no more FDs will be added.
> >
> >   With the new API, this doesn't need to be done since each FD prepare
> >   already updates serialized state.
> >
> > - If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
> >   anything in LUO. This is same as new API.
> >
> > - If some hugetlb FDs are not restored after liveupdate and the finish
> >   event is triggered, the subsystem gets its finish() handler called and
> >   it can free things up.
> >
> >   I don't get how that would work with the new API.
> 
> The new API isn't more complicated; It codifies the common pattern of
> "create on first use, destroy on last use" into a reusable helper,
> saving each file handler from having to reinvent the same reference
> counting and locking scheme. But, as you point out, subsystems provide
> more control, specifically they handle full creation/free instead of
> relying on file-handlers for that.

I'd say hugetlb *should* be doing the more complicated thing. We
should not have global static data for luo floating around the kernel,
this is too easily abused in bad ways.

The above "complicated" sequence forces the caller to have a fd
session handle, and "hides" the global state inside luo so the
subsystem can't just randomly reach into it whenever it likes.

This is a deliberate and violent way to force clean coding practices
and good layering.

Not sure why hugetlb pools would need another xarray??

1) Use a vmalloc and store a list of the PFNs in the pool. Pool becomes
   frozen, can't add/remove PFNs.
2) Require the users of hugetlb memory, like memfd, to
   preserve/restore the folios they are using (using their hugetlb order)
3) Just before kexec run over the PFN list and mark a bit if the folio
   was preserved by KHO or not. Make sure everything gets KHO
   preserved.

Restore puts the PFNs that were not preserved directly in the free
pool, the end user of the folio like the memfd restores and eventually
normally frees the other folios.

It is simple and fits nicely into the infrastructure here, where the
first time you trigger a global state it does the pfn list and
freezing, and the lifecycle and locking for this operation is directly
managed by luo.

The memfd, when it knows it has hugetlb folios inside it, would
trigger this.

Jason

^ permalink raw reply

* Re: [PATCH v5 2/3] lsm: introduce security_lsm_config_*_policy hooks
From: Paul Moore @ 2025-10-10 14:59 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Mickaël Salaün, Maxime Bélair,
	linux-security-module, john.johansen, jmorris, serge, kees,
	stephen.smalley.work, takedakn, penguin-kernel, song, rdunlap,
	linux-api, apparmor, linux-kernel
In-Reply-To: <0c7a19cb-d270-403f-9f97-354405aba746@schaufler-ca.com>

On Wed, Aug 20, 2025 at 11:30 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 8/20/2025 7:21 AM, Mickaël Salaün wrote:
> > On Wed, Jul 09, 2025 at 10:00:55AM +0200, Maxime Bélair wrote:
> >> Define two new LSM hooks: security_lsm_config_self_policy and
> >> security_lsm_config_system_policy and wire them into the corresponding
> >> lsm_config_*_policy() syscalls so that LSMs can register a unified
> >> interface for policy management. This initial, minimal implementation
> >> only supports the LSM_POLICY_LOAD operation to limit changes.
> >>
> >> Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
> >> ---
> >>  include/linux/lsm_hook_defs.h |  4 +++
> >>  include/linux/security.h      | 20 ++++++++++++
> >>  include/uapi/linux/lsm.h      |  8 +++++
> >>  security/lsm_syscalls.c       | 17 ++++++++--
> >>  security/security.c           | 60 +++++++++++++++++++++++++++++++++++
> >>  5 files changed, 107 insertions(+), 2 deletions(-)

...

> >> diff --git a/include/uapi/linux/lsm.h b/include/uapi/linux/lsm.h
> >> index 938593dfd5da..2b9432a30cdc 100644
> >> --- a/include/uapi/linux/lsm.h
> >> +++ b/include/uapi/linux/lsm.h
> >> @@ -90,4 +90,12 @@ struct lsm_ctx {
> >>   */
> >>  #define LSM_FLAG_SINGLE     0x0001
> >>
> >> +/*
> >> + * LSM_POLICY_XXX definitions identify the different operations
> >> + * to configure LSM policies
> >> + */
> >> +
> >> +#define LSM_POLICY_UNDEF    0
> >> +#define LSM_POLICY_LOAD             100
> > Why the gap between 0 and 100?
>
> It's conventional in LSM syscalls to start identifiers at 100.
> No compelling reason other than to appease the LSM maintainer.

If you guys make me repeat all the reasons why, I'm going to get even
crankier than usual :-P

-- 
paul-moore.com

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-10-10 14:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Samiullah Khawaja, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	chrisl, steven.sistare
In-Reply-To: <20251010144248.GB3901471@nvidia.com>

On Fri, Oct 10, 2025 at 10:42 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Oct 09, 2025 at 06:42:09PM -0400, Pasha Tatashin wrote:
> >
> > It looks like the combination of an enforced ordering:
> > Preservation: A->B->C->D
> > Un-preservation: D->C->B->A
> > Retrieval: A->B->C->D
> >
> > and the FLB Global State (where data is automatically created and
> > destroyed when a particular file type participates in a live update)
> > solves the need for this query mechanism. For example, the IOMMU
> > driver/core can add its data only when an iommufd is preserved and add
> > more data as more iommufds are added. The preserved data is also
> > automatically removed once the live update is finished or canceled.
>
> IDK I think we should try to be flexible on the restoration order.

It is easier to be inflexible at first and then relax the requirement
than the other way around. I think it is alright to enforce the order
for now, as it is driven only by userspace.

> Eg, if we project ahead to when we might need to preserve kvm and
> iommufd FDs as well, the order would likely be:
>
> Preservation: memfd -> kvm -> iommufd -> vfio
> Retrieval: iommud_domain (early boot) kvm -> iommufd -> vfio -> memfd

At some point, we will implement orphaned VMs, where a VM can run
without a VMM during the live-update period. This would allow us to
reduce the blackout time and later enable vCPUs to keep running even
during kexec.

With that, I would assume KVM itself would drive the live update and
would make LUO calls to preserve the resources in an orderly fashion
and then restore them in the same order during boot.

Pasha

^ permalink raw reply

* Re: [PATCH v6 4/5] SELinux: add support for lsm_config_system_policy
From: Paul Moore @ 2025-10-10 14:57 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Maxime Bélair, linux-security-module, john.johansen, jmorris,
	serge, mic, kees, casey, takedakn, penguin-kernel, song, rdunlap,
	linux-api, apparmor, linux-kernel, SElinux list, Ondrej Mosnacek
In-Reply-To: <CAEjxPJ6Xcwsic_zyLTPdHHaY9r7-ZTySzyELQ76aVZCFbh8FMQ@mail.gmail.com>

On Fri, Oct 10, 2025 at 9:59 AM Stephen Smalley
<stephen.smalley.work@gmail.com> wrote:
>
> 2. The SELinux namespaces support [1], [2] is based on instantiating a
> separate selinuxfs instance for each namespace; you load a policy for
> a namespace by mounting a new selinuxfs instance after unsharing your
> SELinux namespace and then write to its /sys/fs/selinux/load
> interface, only affecting policy for the new namespace. Your interface
> doesn't appear to support such an approach and IIUC will currently
> always load the init SELinux namespace's policy rather than the
> current process' SELinux namespace.

I'm distracted on other things at the moment, but my current thinking
is that while policy loading and namespace management APIs are largely
separate, there is some minor overlap when it comes to loading policy
as others have mentioned.  For that reason, I think we need to resolve
the namespace API first, keeping in mind the potential for a policy
load API, and then implement the policy loading API, if desired.

-- 
paul-moore.com

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-10-10 14:42 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Samiullah Khawaja, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	chrisl, steven.sistare
In-Reply-To: <CA+CK2bAe3yk4NocURmihcuTNPUcb2-K0JCaQQ5GJ4B58YLEwEw@mail.gmail.com>

On Thu, Oct 09, 2025 at 06:42:09PM -0400, Pasha Tatashin wrote:
> 
> It looks like the combination of an enforced ordering:
> Preservation: A->B->C->D
> Un-preservation: D->C->B->A
> Retrieval: A->B->C->D
> 
> and the FLB Global State (where data is automatically created and
> destroyed when a particular file type participates in a live update)
> solves the need for this query mechanism. For example, the IOMMU
> driver/core can add its data only when an iommufd is preserved and add
> more data as more iommufds are added. The preserved data is also
> automatically removed once the live update is finished or canceled.

IDK I think we should try to be flexible on the restoration order.

Eg, if we project ahead to when we might need to preserve kvm and
iommufd FDs as well, the order would likely be:

Preservation: memfd -> kvm -> iommufd -> vfio
Retrieval: iommud_domain (early boot) kvm -> iommufd -> vfio -> memfd

Just because of how the dependencies work, and the desire to push the
memfd as late as possible.

I don't see an issue with this, the kernel enforcing the ordering
should fall out naturally based on the sanity checks each step will
do.

ie I can't get back the KVM fd if luo says it is out of order.

Jason

^ permalink raw reply

* Re: [PATCH v6 4/5] SELinux: add support for lsm_config_system_policy
From: Stephen Smalley @ 2025-10-10 14:42 UTC (permalink / raw)
  To: Maxime Bélair
  Cc: linux-security-module, john.johansen, paul, jmorris, serge, mic,
	kees, casey, takedakn, penguin-kernel, song, rdunlap, linux-api,
	apparmor, linux-kernel, SElinux list, Ondrej Mosnacek
In-Reply-To: <CAEjxPJ6Xcwsic_zyLTPdHHaY9r7-ZTySzyELQ76aVZCFbh8FMQ@mail.gmail.com>

On Fri, Oct 10, 2025 at 9:58 AM Stephen Smalley
<stephen.smalley.work@gmail.com> wrote:
>
> On Fri, Oct 10, 2025 at 9:27 AM Maxime Bélair
> <maxime.belair@canonical.com> wrote:
> >
> > Enable users to manage SELinux policies through the new hook
> > lsm_config_system_policy. This feature is restricted to CAP_MAC_ADMIN.
>
> (added selinux mailing list and Fedora/Red Hat SELinux kernel maintainer to cc)
>
> A couple of observations:
> 1. We do not currently require CAP_MAC_ADMIN for loading SELinux
> policy, since it was only added later for Smack and SELinux implements
> its own permission checks. When loading policy via selinuxfs, one
> requires uid-0 or CAP_DAC_OVERRIDE to write to /sys/fs/selinux/load
> plus the corresponding SELinux permissions, but this is just an
> artifact of the filesystem-based interface. I'm not opposed to using
> CAP_MAC_ADMIN for loading policy via the new system call but wanted to
> note it as a difference.
>
> 2. The SELinux namespaces support [1], [2] is based on instantiating a
> separate selinuxfs instance for each namespace; you load a policy for
> a namespace by mounting a new selinuxfs instance after unsharing your
> SELinux namespace and then write to its /sys/fs/selinux/load
> interface, only affecting policy for the new namespace. Your interface
> doesn't appear to support such an approach and IIUC will currently
> always load the init SELinux namespace's policy rather than the
> current process' SELinux namespace.

Actually, on second thought, checking CAP_MAC_ADMIN via capable() will
require the process to have that capability in the global/init
namespace, which IIUC would prevent systemd running in a non-init user
namespace from loading the SELinux policy at all. That's problematic
for a different reason since it would prevent us from using this
interface for loading the namespace policy using this system call.
>
> [1] https://github.com/stephensmalley/selinuxns
> [2] https://lore.kernel.org/selinux/20250814132637.1659-1-stephen.smalley.work@gmail.com/
>
> >
> > Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
> > ---
> >  security/selinux/hooks.c            | 27 +++++++++++++++++++++++++++
> >  security/selinux/include/security.h |  7 +++++++
> >  security/selinux/selinuxfs.c        | 16 ++++++++++++----
> >  3 files changed, 46 insertions(+), 4 deletions(-)
> >
> > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > index e7a7dcab81db..3d14d4e47937 100644
> > --- a/security/selinux/hooks.c
> > +++ b/security/selinux/hooks.c
> > @@ -7196,6 +7196,31 @@ static int selinux_uring_allowed(void)
> >  }
> >  #endif /* CONFIG_IO_URING */
> >
> > +/**
> > + * selinux_lsm_config_system_policy - Manage a LSM policy
> > + * @op: operation to perform. Currently, only LSM_POLICY_LOAD is supported
> > + * @buf: User-supplied buffer
> > + * @size: size of @buf
> > + * @flags: reserved for future use; must be zero
> > + *
> > + * Returns: number of written rules on success, negative value on error
> > + */
> > +static int selinux_lsm_config_system_policy(u32 op, void __user *buf,
> > +                                           size_t size, u32 flags)
> > +{
> > +       loff_t pos = 0;
> > +
> > +       if (op != LSM_POLICY_LOAD || flags)
> > +               return -EOPNOTSUPP;
> > +
> > +       if (!selinux_null.dentry || !selinux_null.dentry->d_sb ||
> > +           !selinux_null.dentry->d_sb->s_fs_info)
> > +               return -ENODEV;
> > +
> > +       return __sel_write_load(selinux_null.dentry->d_sb->s_fs_info, buf, size,
> > +                               &pos);
> > +}
> > +
> >  static const struct lsm_id selinux_lsmid = {
> >         .name = "selinux",
> >         .id = LSM_ID_SELINUX,
> > @@ -7499,6 +7524,8 @@ static struct security_hook_list selinux_hooks[] __ro_after_init = {
> >  #ifdef CONFIG_PERF_EVENTS
> >         LSM_HOOK_INIT(perf_event_alloc, selinux_perf_event_alloc),
> >  #endif
> > +       LSM_HOOK_INIT(lsm_config_system_policy, selinux_lsm_config_system_policy),
> > +
> >  };
> >
> >  static __init int selinux_init(void)
> > diff --git a/security/selinux/include/security.h b/security/selinux/include/security.h
> > index e7827ed7be5f..7b779ea43cc3 100644
> > --- a/security/selinux/include/security.h
> > +++ b/security/selinux/include/security.h
> > @@ -389,7 +389,14 @@ struct selinux_kernel_status {
> >  extern void selinux_status_update_setenforce(bool enforcing);
> >  extern void selinux_status_update_policyload(u32 seqno);
> >  extern void selinux_complete_init(void);
> > +
> > +struct selinux_fs_info;
> > +
> >  extern struct path selinux_null;
> > +extern ssize_t __sel_write_load(struct selinux_fs_info *fsi,
> > +                               const char __user *buf, size_t count,
> > +                               loff_t *ppos);
> > +
> >  extern void selnl_notify_setenforce(int val);
> >  extern void selnl_notify_policyload(u32 seqno);
> >  extern int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm);
> > diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
> > index 47480eb2189b..1f7e611d8300 100644
> > --- a/security/selinux/selinuxfs.c
> > +++ b/security/selinux/selinuxfs.c
> > @@ -567,11 +567,11 @@ static int sel_make_policy_nodes(struct selinux_fs_info *fsi,
> >         return ret;
> >  }
> >
> > -static ssize_t sel_write_load(struct file *file, const char __user *buf,
> > -                             size_t count, loff_t *ppos)
> > +ssize_t __sel_write_load(struct selinux_fs_info *fsi,
> > +                        const char __user *buf, size_t count,
> > +                        loff_t *ppos)
> >
> >  {
> > -       struct selinux_fs_info *fsi;
> >         struct selinux_load_state load_state;
> >         ssize_t length;
> >         void *data = NULL;
> > @@ -605,7 +605,6 @@ static ssize_t sel_write_load(struct file *file, const char __user *buf,
> >                 pr_warn_ratelimited("SELinux: failed to load policy\n");
> >                 goto out;
> >         }
> > -       fsi = file_inode(file)->i_sb->s_fs_info;
> >         length = sel_make_policy_nodes(fsi, load_state.policy);
> >         if (length) {
> >                 pr_warn_ratelimited("SELinux: failed to initialize selinuxfs\n");
> > @@ -626,6 +625,15 @@ static ssize_t sel_write_load(struct file *file, const char __user *buf,
> >         return length;
> >  }
> >
> > +static ssize_t sel_write_load(struct file *file, const char __user *buf,
> > +                             size_t count, loff_t *ppos)
> > +{
> > +       struct selinux_fs_info *fsi = file_inode(file)->i_sb->s_fs_info;
> > +
> > +       return __sel_write_load(fsi, buf, count, ppos);
> > +}
> > +
> > +
> >  static const struct file_operations sel_load_ops = {
> >         .write          = sel_write_load,
> >         .llseek         = generic_file_llseek,
> > --
> > 2.48.1
> >

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-10-10 14:35 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Samiullah Khawaja, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	chrisl, steven.sistare
In-Reply-To: <CA+CK2bBtrkdos6YmCatggS19rwWYBXXDLwiUWmUrs2+ye23cXA@mail.gmail.com>

On Thu, Oct 09, 2025 at 02:37:44PM -0400, Pasha Tatashin wrote:
> On Thu, Oct 9, 2025 at 1:39 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Oct 09, 2025 at 11:01:25AM -0400, Pasha Tatashin wrote:
> > > In this case we can enforce strict
> > > ordering during retrieval. If "struct file" can be retrieved by
> > > anything within the kernel, then that could be any kernel process
> > > during boot, meaning that charging is not going to be properly applied
> > > when kernel allocations are performed.
> >
> > Ugh, yeah, OK that's irritating and might burn us, but we did decide
> > on that strategy.
> >
> > > > I would argue it should always cause a preservation...
> > > >
> > > > But this is still backwards, what we need is something like
> > > >
> > > > liveupdate_preserve_file(session, file, &token);
> > > > my_preserve_blob.file_token = token
> > >
> > > We cannot do that, the user should have already preserved that file
> > > and provided us with a token to use, if that file was not preserved by
> > > the user it is a bug. With this proposal, we would have to generate a
> > > token, and it was argued that the kernel should not do that.
> >
> > The token is the label used as ABI across the kexec. Each entity doing
> > a serialization can operate it's labels however it needs.
> >
> > Here I am suggeting that when a kernel entity goes to record a struct
> > file in a kernel ABI structure it can get a kernel generated token for
> > it.
> 
> Sure, we can consider allowing the kernel to preserve dependent FDs
> automatically in the future, but is there a compelling use case that
> requires it right now?

Right now for the three prototype series.. Hmm, yes, I think we can
avoid implementing this.

In the future I suspect iommufd will need to restore the KVM fd since
stuff in the KVM sometimes becomes entangled with the iommu in some
cases on some arches.

The issue here is not order, it is straight up 'what value does
iommufd write to it's kexec ABI struct to refer to the KVM fd'.

Jason

^ permalink raw reply

* Re: [PATCH v6 4/5] SELinux: add support for lsm_config_system_policy
From: Stephen Smalley @ 2025-10-10 13:58 UTC (permalink / raw)
  To: Maxime Bélair
  Cc: linux-security-module, john.johansen, paul, jmorris, serge, mic,
	kees, casey, takedakn, penguin-kernel, song, rdunlap, linux-api,
	apparmor, linux-kernel, SElinux list, Ondrej Mosnacek
In-Reply-To: <20251010132610.12001-5-maxime.belair@canonical.com>

On Fri, Oct 10, 2025 at 9:27 AM Maxime Bélair
<maxime.belair@canonical.com> wrote:
>
> Enable users to manage SELinux policies through the new hook
> lsm_config_system_policy. This feature is restricted to CAP_MAC_ADMIN.

(added selinux mailing list and Fedora/Red Hat SELinux kernel maintainer to cc)

A couple of observations:
1. We do not currently require CAP_MAC_ADMIN for loading SELinux
policy, since it was only added later for Smack and SELinux implements
its own permission checks. When loading policy via selinuxfs, one
requires uid-0 or CAP_DAC_OVERRIDE to write to /sys/fs/selinux/load
plus the corresponding SELinux permissions, but this is just an
artifact of the filesystem-based interface. I'm not opposed to using
CAP_MAC_ADMIN for loading policy via the new system call but wanted to
note it as a difference.

2. The SELinux namespaces support [1], [2] is based on instantiating a
separate selinuxfs instance for each namespace; you load a policy for
a namespace by mounting a new selinuxfs instance after unsharing your
SELinux namespace and then write to its /sys/fs/selinux/load
interface, only affecting policy for the new namespace. Your interface
doesn't appear to support such an approach and IIUC will currently
always load the init SELinux namespace's policy rather than the
current process' SELinux namespace.

[1] https://github.com/stephensmalley/selinuxns
[2] https://lore.kernel.org/selinux/20250814132637.1659-1-stephen.smalley.work@gmail.com/

>
> Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
> ---
>  security/selinux/hooks.c            | 27 +++++++++++++++++++++++++++
>  security/selinux/include/security.h |  7 +++++++
>  security/selinux/selinuxfs.c        | 16 ++++++++++++----
>  3 files changed, 46 insertions(+), 4 deletions(-)
>
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index e7a7dcab81db..3d14d4e47937 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -7196,6 +7196,31 @@ static int selinux_uring_allowed(void)
>  }
>  #endif /* CONFIG_IO_URING */
>
> +/**
> + * selinux_lsm_config_system_policy - Manage a LSM policy
> + * @op: operation to perform. Currently, only LSM_POLICY_LOAD is supported
> + * @buf: User-supplied buffer
> + * @size: size of @buf
> + * @flags: reserved for future use; must be zero
> + *
> + * Returns: number of written rules on success, negative value on error
> + */
> +static int selinux_lsm_config_system_policy(u32 op, void __user *buf,
> +                                           size_t size, u32 flags)
> +{
> +       loff_t pos = 0;
> +
> +       if (op != LSM_POLICY_LOAD || flags)
> +               return -EOPNOTSUPP;
> +
> +       if (!selinux_null.dentry || !selinux_null.dentry->d_sb ||
> +           !selinux_null.dentry->d_sb->s_fs_info)
> +               return -ENODEV;
> +
> +       return __sel_write_load(selinux_null.dentry->d_sb->s_fs_info, buf, size,
> +                               &pos);
> +}
> +
>  static const struct lsm_id selinux_lsmid = {
>         .name = "selinux",
>         .id = LSM_ID_SELINUX,
> @@ -7499,6 +7524,8 @@ static struct security_hook_list selinux_hooks[] __ro_after_init = {
>  #ifdef CONFIG_PERF_EVENTS
>         LSM_HOOK_INIT(perf_event_alloc, selinux_perf_event_alloc),
>  #endif
> +       LSM_HOOK_INIT(lsm_config_system_policy, selinux_lsm_config_system_policy),
> +
>  };
>
>  static __init int selinux_init(void)
> diff --git a/security/selinux/include/security.h b/security/selinux/include/security.h
> index e7827ed7be5f..7b779ea43cc3 100644
> --- a/security/selinux/include/security.h
> +++ b/security/selinux/include/security.h
> @@ -389,7 +389,14 @@ struct selinux_kernel_status {
>  extern void selinux_status_update_setenforce(bool enforcing);
>  extern void selinux_status_update_policyload(u32 seqno);
>  extern void selinux_complete_init(void);
> +
> +struct selinux_fs_info;
> +
>  extern struct path selinux_null;
> +extern ssize_t __sel_write_load(struct selinux_fs_info *fsi,
> +                               const char __user *buf, size_t count,
> +                               loff_t *ppos);
> +
>  extern void selnl_notify_setenforce(int val);
>  extern void selnl_notify_policyload(u32 seqno);
>  extern int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm);
> diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
> index 47480eb2189b..1f7e611d8300 100644
> --- a/security/selinux/selinuxfs.c
> +++ b/security/selinux/selinuxfs.c
> @@ -567,11 +567,11 @@ static int sel_make_policy_nodes(struct selinux_fs_info *fsi,
>         return ret;
>  }
>
> -static ssize_t sel_write_load(struct file *file, const char __user *buf,
> -                             size_t count, loff_t *ppos)
> +ssize_t __sel_write_load(struct selinux_fs_info *fsi,
> +                        const char __user *buf, size_t count,
> +                        loff_t *ppos)
>
>  {
> -       struct selinux_fs_info *fsi;
>         struct selinux_load_state load_state;
>         ssize_t length;
>         void *data = NULL;
> @@ -605,7 +605,6 @@ static ssize_t sel_write_load(struct file *file, const char __user *buf,
>                 pr_warn_ratelimited("SELinux: failed to load policy\n");
>                 goto out;
>         }
> -       fsi = file_inode(file)->i_sb->s_fs_info;
>         length = sel_make_policy_nodes(fsi, load_state.policy);
>         if (length) {
>                 pr_warn_ratelimited("SELinux: failed to initialize selinuxfs\n");
> @@ -626,6 +625,15 @@ static ssize_t sel_write_load(struct file *file, const char __user *buf,
>         return length;
>  }
>
> +static ssize_t sel_write_load(struct file *file, const char __user *buf,
> +                             size_t count, loff_t *ppos)
> +{
> +       struct selinux_fs_info *fsi = file_inode(file)->i_sb->s_fs_info;
> +
> +       return __sel_write_load(fsi, buf, count, ppos);
> +}
> +
> +
>  static const struct file_operations sel_load_ops = {
>         .write          = sel_write_load,
>         .llseek         = generic_file_llseek,
> --
> 2.48.1
>

^ permalink raw reply

* [PATCH v6 0/5] lsm: introduce lsm_config_self_policy() and lsm_config_system_policy() syscalls
From: Maxime Bélair @ 2025-10-10 13:25 UTC (permalink / raw)
  To: linux-security-module
  Cc: john.johansen, paul, jmorris, serge, mic, kees,
	stephen.smalley.work, casey, takedakn, penguin-kernel, song,
	rdunlap, linux-api, apparmor, linux-kernel, Maxime Bélair

This patchset introduces two new syscalls: lsm_config_self_policy(),
lsm_config_system_policy() and the associated Linux Security Module hooks
security_lsm_config_*_policy(), providing a unified interface for loading
and managing LSM policies. These syscalls complement the existing per‑LSM
pseudo‑filesystem mechanism and work even when those filesystems are not
mounted or available.

With these new syscalls, users and administrators may lock down access to
the pseudo‑filesystem yet still manage LSM policies. Two tightly-scoped
entry points then replace the many file operations exposed by those
filesystems, significantly reducing the attack surface. This is
particularly useful in containers or processes already confined by
Landlock, where these pseudo‑filesystems are typically unavailable.

Because they provide a logical and unified interface, these syscalls are
simpler to use than several heterogeneous pseudo‑filesystems and avoid
edge cases such as partially loaded policies. They also eliminates VFS
overhead, yielding performance gains notably when many policies are
loaded, for instance at boot time.

This initial implementation is intentionally minimal to limit the scope
of changes. Currently, only policy loading is supported. This new LSM
hook is currently registered by AppArmor, SELinux and Smack. However, any
LSM can adopt this interface, and future patches could extend this
syscall to support more operations, such as replacing, removing, or
querying loaded policies.

Landlock already provides three Landlock‑specific syscalls (e.g.
landlock_add_rule()) to restrict ambient rights for sets of processes
without touching any pseudo-filesystem. lsm_config_*_policy() generalizes
that approach to the entire LSM layer, so any module can choose to
support either or both of these syscalls, and expose its policy
operations through a uniform interface and reap the advantages outlined
above.

This patchset is available at [1], a minimal user space example
showing how to use lsm_config_system_policy with AppArmor is at [2] and a
performance benchmark of both syscalls is available at [3].

[1] https://github.com/emixam16/linux/tree/lsm_syscall_v6
[2] https://gitlab.com/emixam16/apparmor/tree/lsm_syscall_v6
[3] https://gitlab.com/-/snippets/4864908

---
Changes in v6
 - Add support for SELinux and Smack

Changes in v5
 - Improve syscall input verification
 - Do not export security_lsm_config_*_policy symbols

Changes in v4
 - Make the syscall's maximum buffer size defined per module
 - Fix a memory leak

Changes in v3
 - Fix typos

Changes in v2
 - Split lsm_manage_policy() into two distinct syscalls:
   lsm_config_self_policy() and lsm_config_system_policy()
 - The LSM hook now calls only the appropriate LSM (and not all LSMs)
 - Add a configuration variable to limit the buffer size of these
   syscalls
 - AppArmor now allows stacking policies through lsm_config_self_policy()
   and loading policies in any namespace through
   lsm_config_system_policy()
---

Maxime Bélair (5):
  Wire up lsm_config_self_policy and lsm_config_system_policy syscalls
  lsm: introduce security_lsm_config_*_policy hooks
  AppArmor: add support for lsm_config_self_policy and
    lsm_config_system_policy
  SELinux: add support for lsm_config_system_policy
  Smack: add support for lsm_config_self_policy and
    lsm_config_system_policy

 arch/alpha/kernel/syscalls/syscall.tbl        |  2 +
 arch/arm/tools/syscall.tbl                    |  2 +
 arch/m68k/kernel/syscalls/syscall.tbl         |  2 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |  2 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |  2 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |  2 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |  2 +
 arch/parisc/kernel/syscalls/syscall.tbl       |  2 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |  2 +
 arch/s390/kernel/syscalls/syscall.tbl         |  2 +
 arch/sh/kernel/syscalls/syscall.tbl           |  2 +
 arch/sparc/kernel/syscalls/syscall.tbl        |  2 +
 arch/x86/entry/syscalls/syscall_32.tbl        |  2 +
 arch/x86/entry/syscalls/syscall_64.tbl        |  2 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |  2 +
 include/linux/lsm_hook_defs.h                 |  4 +
 include/linux/security.h                      | 20 +++++
 include/linux/syscalls.h                      |  5 ++
 include/uapi/asm-generic/unistd.h             |  6 +-
 include/uapi/linux/lsm.h                      |  8 ++
 kernel/sys_ni.c                               |  2 +
 security/apparmor/apparmorfs.c                | 31 +++++++
 security/apparmor/include/apparmor.h          |  4 +
 security/apparmor/include/apparmorfs.h        |  3 +
 security/apparmor/lsm.c                       | 84 +++++++++++++++++++
 security/lsm_syscalls.c                       | 21 +++++
 security/security.c                           | 60 +++++++++++++
 security/selinux/hooks.c                      | 27 ++++++
 security/selinux/include/security.h           |  7 ++
 security/selinux/selinuxfs.c                  | 16 +++-
 security/smack/smack.h                        |  8 ++
 security/smack/smack_lsm.c                    | 73 ++++++++++++++++
 security/smack/smackfs.c                      |  2 +-
 tools/include/uapi/asm-generic/unistd.h       |  6 +-
 .../arch/x86/entry/syscalls/syscall_64.tbl    |  2 +
 35 files changed, 412 insertions(+), 7 deletions(-)

base-commit: 9c32cda43eb78f78c73aee4aa344b777714e259b
-- 
2.48.1

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox