Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Mike Rapoport @ 2025-09-02 11:44 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Jason Gunthorpe, Pasha Tatashin, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <mafs0ldmyw1hp.fsf@kernel.org>

Hi Pratyush,

On Mon, Sep 01, 2025 at 07:01:38PM +0200, Pratyush Yadav wrote:
> Hi Mike,
> 
> On Mon, Sep 01 2025, Mike Rapoport wrote:
> 
> > On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
> >> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
> >> 
> >> > +	/*
> >> > +	 * Most of the space should be taken by preserved folios. So take its
> >> > +	 * size, plus a page for other properties.
> >> > +	 */
> >> > +	fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> >> > +	if (!fdt) {
> >> > +		err = -ENOMEM;
> >> > +		goto err_unpin;
> >> > +	}
> >> 
> >> This doesn't seem to have any versioning scheme, it really should..
> >> 
> >> > +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
> >> > +				       (void **)&preserved_folios);
> >> > +	if (err) {
> >> > +		pr_err("Failed to reserve folios property in FDT: %s\n",
> >> > +		       fdt_strerror(err));
> >> > +		err = -ENOMEM;
> >> > +		goto err_free_fdt;
> >> > +	}
> >> 
> >> Yuk.
> >> 
> >> This really wants some luo helper
> >> 
> >> 'luo alloc array'
> >> 'luo restore array'
> >> 'luo free array'
> >
> > We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1
> >
> > Will wait for kbuild and then send proper patches.
> 
> I have been working on something similar, but in a more generic way.
> 
> I have implemented a sparse KHO-preservable array (called kho_array)
> with xarray like properties. It can take in 4-byte aligned pointers and
> supports saving non-pointer values similar to xa_mk_value(). For now it
> doesn't support multi-index entries, but if needed the data format can
> be extended to support it as well.
> 
> The structure is very similar to what you have implemented. It uses a
> linked list of pages with some metadata at the head of each page.
> 
> I have used it for memfd preservation, and I think it is quite
> versatile. For example, your kho_preserve_vmalloc() can be very easily
> built on top of this kho_array by simply saving each physical page
> address at consecutive indices in the array.

I've started to work on something similar to your kho_array for memfd case
and then I thought that since we know the size of the array we can simply
vmalloc it and preserve vmalloc, and that lead me to implementing
preservation of vmalloc :)

I like the idea to have kho_array for cases when we don't know the amount
of data to preserve in advance, but for memfd as it's currently
implemented I think that allocating and preserving vmalloc is simpler.

As for porting kho_preserve_vmalloc() to kho_array, I also feel that it
would just make kho_preserve_vmalloc() more complex and I'd rather simplify
it even more, e.g. with preallocating all the pages that preserve indices
in advance.
 
> The code is still WIP and currently a bit hacky, but I will clean it up
> in a couple days and I think it should be ready for posting. You can
> find the current version at [0][1]. Would be good to hear your thoughts,
> and if you agree with the approach, I can also port
> kho_preserve_vmalloc() to work on top of kho_array as well.
> 
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/commit/?h=kho-array&id=cf4c04c1e9ac854e3297018ad6dada17c54a59af
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/commit/?h=kho-array&id=5eb0d7316274a9c87acaeedd86941979fc4baf96
> 
> -- 
> Regards,
> Pratyush Yadav

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-09-02 11:38 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Mike Rapoport, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bAb6s=gUTCNjMrOqptZ3a_nj3teuVSZs86AvVymvaURQA@mail.gmail.com>

On Mon, Sep 01, 2025 at 07:02:46PM +0000, Pasha Tatashin wrote:
> > >> > This really wants some luo helper
> > >> >
> > >> > 'luo alloc array'
> > >> > 'luo restore array'
> > >> > 'luo free array'
> > >>
> > >> We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1
> > >
> > > The patch looks okay to me, but it doesn't support holes in vmap
> > > areas. While that is likely acceptable for vmalloc, it could be a
> > > problem if we want to preserve memfd with holes and using vmap
> > > preservation as a method, which would require a different approach.
> > > Still, this would help with preserving memfd.
> >
> > I agree. I think we should do it the other way round. Build a sparse
> > array first, and then use that to build vmap preservation. Our emails
> 
> Yes, sparse array support would help both: vmalloc and memfd preservation.

Why? vmalloc is always full popoulated, no sparseness..

And again in real systems we expect memfd to be fully populated too.

I wouldn't invest any time in something like this right now. Just be
inefficient if there is sparseness for some reason.

Jason

^ permalink raw reply

* [PATCH v20 8/8] selftests/clone3: Test shadow stack support
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Shuah Khan
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

Add basic test coverage for specifying the shadow stack for a newly
created thread via clone3(), including coverage of the newly extended
argument structure.  We check that a user specified shadow stack can be
provided, and that invalid combinations of parameters are rejected.

In order to facilitate testing on systems without userspace shadow stack
support we manually enable shadow stacks on startup, this is architecture
specific due to the use of an arch_prctl() on x86. Due to interactions with
potential userspace locking of features we actually detect support for
shadow stacks on the running system by attempting to allocate a shadow
stack page during initialisation using map_shadow_stack(), warning if this
succeeds when the enable failed.

In order to allow testing of user configured shadow stacks on
architectures with that feature we need to ensure that we do not return
from the function where the clone3() syscall is called in the child
process, doing so would trigger a shadow stack underflow.  To do this we
use inline assembly rather than the standard syscall wrapper to call
clone3().  In order to avoid surprises we also use a syscall rather than
the libc exit() function., this should be overly cautious.

Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/clone3/clone3.c           | 143 +++++++++++++++++++++-
 tools/testing/selftests/clone3/clone3_selftests.h |  63 ++++++++++
 2 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index 5b8b7d640e70..6fd2b3238e2c 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -3,6 +3,7 @@
 /* Based on Christian Brauner's clone3() example */
 
 #define _GNU_SOURCE
+#include <asm/mman.h>
 #include <errno.h>
 #include <inttypes.h>
 #include <linux/types.h>
@@ -11,6 +12,7 @@
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <sys/mman.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
 #include <sys/un.h>
@@ -19,8 +21,12 @@
 #include <sched.h>
 
 #include "../kselftest.h"
+#include "../ksft_shstk.h"
 #include "clone3_selftests.h"
 
+static bool shadow_stack_supported;
+static size_t max_supported_args_size;
+
 enum test_mode {
 	CLONE3_ARGS_NO_TEST,
 	CLONE3_ARGS_ALL_0,
@@ -28,6 +34,10 @@ enum test_mode {
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_NEG,
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_CSIG,
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
+	CLONE3_ARGS_SHADOW_STACK,
+	CLONE3_ARGS_SHADOW_STACK_MISALIGNED,
+	CLONE3_ARGS_SHADOW_STACK_NO_TOKEN,
+	CLONE3_ARGS_SHADOW_STACK_NORMAL_MEMORY,
 };
 
 typedef bool (*filter_function)(void);
@@ -44,6 +54,44 @@ struct test {
 	filter_function filter;
 };
 
+
+/*
+ * We check for shadow stack support by attempting to use
+ * map_shadow_stack() since features may have been locked by the
+ * dynamic linker resulting in spurious errors when we attempt to
+ * enable on startup.  We warn if the enable failed.
+ */
+static void test_shadow_stack_supported(void)
+{
+	long ret;
+
+	ret = syscall(__NR_map_shadow_stack, 0, getpagesize(), 0);
+	if (ret == -1) {
+		ksft_print_msg("map_shadow_stack() not supported\n");
+	} else if ((void *)ret == MAP_FAILED) {
+		ksft_print_msg("Failed to map shadow stack\n");
+	} else {
+		ksft_print_msg("Shadow stack supportd\n");
+		shadow_stack_supported = true;
+
+		if (!shadow_stack_enabled)
+			ksft_print_msg("Mapped but did not enable shadow stack\n");
+	}
+}
+
+static void *get_shadow_stack_page(unsigned long flags)
+{
+	unsigned long long page;
+
+	page = syscall(__NR_map_shadow_stack, 0, getpagesize(), flags);
+	if ((void *)page == MAP_FAILED) {
+		ksft_print_msg("map_shadow_stack() failed: %d\n", errno);
+		return 0;
+	}
+
+	return (void *)page;
+}
+
 static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 {
 	struct __clone_args args = {
@@ -57,6 +105,7 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 	} args_ext;
 
 	pid_t pid = -1;
+	void *p;
 	int status;
 
 	memset(&args_ext, 0, sizeof(args_ext));
@@ -89,6 +138,26 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 	case CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG:
 		args.exit_signal = 0x00000000000000f0ULL;
 		break;
+	case CLONE3_ARGS_SHADOW_STACK:
+		p = get_shadow_stack_page(SHADOW_STACK_SET_TOKEN);
+		p += getpagesize() - sizeof(void *);
+		args.shadow_stack_token = (unsigned long long)p;
+		break;
+	case CLONE3_ARGS_SHADOW_STACK_MISALIGNED:
+		p = get_shadow_stack_page(SHADOW_STACK_SET_TOKEN);
+		p += getpagesize() - sizeof(void *) - 1;
+		args.shadow_stack_token = (unsigned long long)p;
+		break;
+	case CLONE3_ARGS_SHADOW_STACK_NORMAL_MEMORY:
+		p = malloc(getpagesize());
+		p += getpagesize() - sizeof(void *);
+		args.shadow_stack_token = (unsigned long long)p;
+		break;
+	case CLONE3_ARGS_SHADOW_STACK_NO_TOKEN:
+		p = get_shadow_stack_page(0);
+		p += getpagesize() - sizeof(void *);
+		args.shadow_stack_token = (unsigned long long)p;
+		break;
 	}
 
 	memcpy(&args_ext.args, &args, sizeof(struct __clone_args));
@@ -102,7 +171,12 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 
 	if (pid == 0) {
 		ksft_print_msg("I am the child, my PID is %d\n", getpid());
-		_exit(EXIT_SUCCESS);
+		/*
+		 * Use a raw syscall to ensure we don't get issues
+		 * with manually specified shadow stack and exit handlers.
+		 */
+		syscall(__NR_exit, EXIT_SUCCESS);
+		ksft_print_msg("CHILD FAILED TO EXIT PID is %d\n", getpid());
 	}
 
 	ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
@@ -184,6 +258,26 @@ static bool no_timenamespace(void)
 	return true;
 }
 
+static bool have_shadow_stack(void)
+{
+	if (shadow_stack_supported) {
+		ksft_print_msg("Shadow stack supported\n");
+		return true;
+	}
+
+	return false;
+}
+
+static bool no_shadow_stack(void)
+{
+	if (!shadow_stack_supported) {
+		ksft_print_msg("Shadow stack not supported\n");
+		return true;
+	}
+
+	return false;
+}
+
 static size_t page_size_plus_8(void)
 {
 	return getpagesize() + 8;
@@ -327,6 +421,50 @@ static const struct test tests[] = {
 		.expected = -EINVAL,
 		.test_mode = CLONE3_ARGS_NO_TEST,
 	},
+	{
+		.name = "Shadow stack on system with shadow stack",
+		.size = 0,
+		.expected = 0,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK,
+		.filter = no_shadow_stack,
+	},
+	{
+		.name = "Shadow stack with misaligned address",
+		.flags = CLONE_VM,
+		.size = 0,
+		.expected = -EINVAL,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK_MISALIGNED,
+		.filter = no_shadow_stack,
+	},
+	{
+		.name = "Shadow stack with normal memory",
+		.flags = CLONE_VM,
+		.size = 0,
+		.expected = -EFAULT,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK_NORMAL_MEMORY,
+		.filter = no_shadow_stack,
+	},
+	{
+		.name = "Shadow stack with no token",
+		.flags = CLONE_VM,
+		.size = 0,
+		.expected = -EINVAL,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK_NO_TOKEN,
+		.filter = no_shadow_stack,
+	},
+	{
+		.name = "Shadow stack on system without shadow stack",
+		.flags = CLONE_VM,
+		.size = 0,
+		.expected = -EFAULT,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK_NORMAL_MEMORY,
+		.filter = have_shadow_stack,
+	},
 };
 
 int main(int argc, char *argv[])
@@ -334,9 +472,12 @@ int main(int argc, char *argv[])
 	size_t size;
 	int i;
 
+	enable_shadow_stack();
+
 	ksft_print_header();
 	ksft_set_plan(ARRAY_SIZE(tests));
 	test_clone3_supported();
+	test_shadow_stack_supported();
 
 	for (i = 0; i < ARRAY_SIZE(tests); i++)
 		test_clone3(&tests[i]);
diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
index 939b26c86d42..8151c4fc971a 100644
--- a/tools/testing/selftests/clone3/clone3_selftests.h
+++ b/tools/testing/selftests/clone3/clone3_selftests.h
@@ -31,12 +31,75 @@ struct __clone_args {
 	__aligned_u64 set_tid;
 	__aligned_u64 set_tid_size;
 	__aligned_u64 cgroup;
+#ifndef CLONE_ARGS_SIZE_VER2
+#define CLONE_ARGS_SIZE_VER2 88	/* sizeof third published struct */
+#endif
+	__aligned_u64 shadow_stack_token;
+#ifndef CLONE_ARGS_SIZE_VER3
+#define CLONE_ARGS_SIZE_VER3 96 /* sizeof fourth published struct */
+#endif
 };
 
+/*
+ * For architectures with shadow stack support we need to be
+ * absolutely sure that the clone3() syscall will be inline and not a
+ * function call so we open code.
+ */
+#ifdef __x86_64__
+static __always_inline pid_t sys_clone3(struct __clone_args *args, size_t size)
+{
+	register long _num  __asm__ ("rax") = __NR_clone3;
+	register long _args __asm__ ("rdi") = (long)(args);
+	register long _size __asm__ ("rsi") = (long)(size);
+	long ret;
+
+	__asm__ volatile (
+		"syscall\n"
+		: "=a"(ret)
+		: "r"(_args), "r"(_size),
+		  "0"(_num)
+		: "rcx", "r11", "memory", "cc"
+	);
+
+	if (ret < 0) {
+		errno = -ret;
+		return -1;
+	}
+
+	return ret;
+}
+#elif defined(__aarch64__)
+static __always_inline pid_t sys_clone3(struct __clone_args *args, size_t size)
+{
+	register long _num  __asm__ ("x8") = __NR_clone3;
+	register long _args __asm__ ("x0") = (long)(args);
+	register long _size __asm__ ("x1") = (long)(size);
+	register long arg2 __asm__ ("x2") = 0;
+	register long arg3 __asm__ ("x3") = 0;
+	register long arg4 __asm__ ("x4") = 0;
+
+	__asm__ volatile (
+		"svc #0\n"
+		: "=r"(_args)
+		: "r"(_args), "r"(_size),
+		  "r"(_num), "r"(arg2),
+		  "r"(arg3), "r"(arg4)
+		: "memory", "cc"
+	);
+
+	if ((int)_args < 0) {
+		errno = -((int)_args);
+		return -1;
+	}
+
+	return _args;
+}
+#else
 static pid_t sys_clone3(struct __clone_args *args, size_t size)
 {
 	return syscall(__NR_clone3, args, size);
 }
+#endif
 
 static inline void test_clone3_supported(void)
 {

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 7/8] selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

The clone_args structure is extensible, with the syscall passing in the
length of the structure. Inside the kernel we use copy_struct_from_user()
to read the struct but this has the unfortunate side effect of silently
accepting some overrun in the structure size providing the extra data is
all zeros. This means that we can't discover the clone3() features that
the running kernel supports by simply probing with various struct sizes.
We need to check this for the benefit of test systems which run newer
kselftests on old kernels.

Add a flag which can be set on a test to indicate that clone3() may return
-E2BIG due to the use of newer struct versions. Currently no tests need
this but it will become an issue for testing clone3() support for shadow
stacks, the support for shadow stacks is already present on x86.

Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/clone3/clone3.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index e066b201fa64..5b8b7d640e70 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -39,6 +39,7 @@ struct test {
 	size_t size;
 	size_function size_function;
 	int expected;
+	bool e2big_valid;
 	enum test_mode test_mode;
 	filter_function filter;
 };
@@ -146,6 +147,11 @@ static void test_clone3(const struct test *test)
 	ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
 			getpid(), ret, test->expected);
 	if (ret != test->expected) {
+		if (test->e2big_valid && ret == -E2BIG) {
+			ksft_print_msg("Test reported -E2BIG\n");
+			ksft_test_result_skip("%s\n", test->name);
+			return;
+		}
 		ksft_print_msg(
 			"[%d] Result (%d) is different than expected (%d)\n",
 			getpid(), ret, test->expected);

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 6/8] selftests/clone3: Factor more of main loop into test_clone3()
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

In order to make it easier to add more configuration for the tests and
more support for runtime detection of when tests can be run pass the
structure describing the tests into test_clone3() rather than picking
the arguments out of it and have that function do all the per-test work.

No functional change.

Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/clone3/clone3.c | 77 ++++++++++++++++-----------------
 1 file changed, 37 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index e61f07973ce5..e066b201fa64 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -30,6 +30,19 @@ enum test_mode {
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
 };
 
+typedef bool (*filter_function)(void);
+typedef size_t (*size_function)(void);
+
+struct test {
+	const char *name;
+	uint64_t flags;
+	size_t size;
+	size_function size_function;
+	int expected;
+	enum test_mode test_mode;
+	filter_function filter;
+};
+
 static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 {
 	struct __clone_args args = {
@@ -109,30 +122,40 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 	return 0;
 }
 
-static bool test_clone3(uint64_t flags, size_t size, int expected,
-			enum test_mode test_mode)
+static void test_clone3(const struct test *test)
 {
+	size_t size;
 	int ret;
 
+	if (test->filter && test->filter()) {
+		ksft_test_result_skip("%s\n", test->name);
+		return;
+	}
+
+	if (test->size_function)
+		size = test->size_function();
+	else
+		size = test->size;
+
+	ksft_print_msg("Running test '%s'\n", test->name);
+
 	ksft_print_msg(
 		"[%d] Trying clone3() with flags %#" PRIx64 " (size %zu)\n",
-		getpid(), flags, size);
-	ret = call_clone3(flags, size, test_mode);
+		getpid(), test->flags, size);
+	ret = call_clone3(test->flags, size, test->test_mode);
 	ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
-			getpid(), ret, expected);
-	if (ret != expected) {
+			getpid(), ret, test->expected);
+	if (ret != test->expected) {
 		ksft_print_msg(
 			"[%d] Result (%d) is different than expected (%d)\n",
-			getpid(), ret, expected);
-		return false;
+			getpid(), ret, test->expected);
+		ksft_test_result_fail("%s\n", test->name);
+		return;
 	}
 
-	return true;
+	ksft_test_result_pass("%s\n", test->name);
 }
 
-typedef bool (*filter_function)(void);
-typedef size_t (*size_function)(void);
-
 static bool not_root(void)
 {
 	if (getuid() != 0) {
@@ -160,16 +183,6 @@ static size_t page_size_plus_8(void)
 	return getpagesize() + 8;
 }
 
-struct test {
-	const char *name;
-	uint64_t flags;
-	size_t size;
-	size_function size_function;
-	int expected;
-	enum test_mode test_mode;
-	filter_function filter;
-};
-
 static const struct test tests[] = {
 	{
 		.name = "simple clone3()",
@@ -319,24 +332,8 @@ int main(int argc, char *argv[])
 	ksft_set_plan(ARRAY_SIZE(tests));
 	test_clone3_supported();
 
-	for (i = 0; i < ARRAY_SIZE(tests); i++) {
-		if (tests[i].filter && tests[i].filter()) {
-			ksft_test_result_skip("%s\n", tests[i].name);
-			continue;
-		}
-
-		if (tests[i].size_function)
-			size = tests[i].size_function();
-		else
-			size = tests[i].size;
-
-		ksft_print_msg("Running test '%s'\n", tests[i].name);
-
-		ksft_test_result(test_clone3(tests[i].flags, size,
-					     tests[i].expected,
-					     tests[i].test_mode),
-				 "%s\n", tests[i].name);
-	}
+	for (i = 0; i < ARRAY_SIZE(tests); i++)
+		test_clone3(&tests[i]);
 
 	ksft_finished();
 }

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 5/8] selftests/clone3: Remove redundant flushes of output streams
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

Since there were widespread issues with output not being flushed the
kselftest framework was modified to explicitly set the output streams
unbuffered in commit 58e2847ad2e6 ("selftests: line buffer test
program's stdout") so there is no need to explicitly flush in the clone3
tests.

Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/clone3/clone3_selftests.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
index eeca8005723f..939b26c86d42 100644
--- a/tools/testing/selftests/clone3/clone3_selftests.h
+++ b/tools/testing/selftests/clone3/clone3_selftests.h
@@ -35,8 +35,6 @@ struct __clone_args {
 
 static pid_t sys_clone3(struct __clone_args *args, size_t size)
 {
-	fflush(stdout);
-	fflush(stderr);
 	return syscall(__NR_clone3, args, size);
 }
 

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

Unlike with the normal stack there is no API for configuring the shadow
stack for a new thread, instead the kernel will dynamically allocate a
new shadow stack with the same size as the normal stack. This appears to
be due to the shadow stack series having been in development since
before the more extensible clone3() was added rather than anything more
deliberate.

Add a parameter to clone3() specifying a shadow stack pointer to use
for the new thread, this is inconsistent with the way we specify the
normal stack but during review concerns were expressed about having to
identify where the shadow stack pointer should be placed especially in
cases where the shadow stack has been previously active.  If no shadow
stack is specified then the existing implicit allocation behaviour is
maintained.

If a shadow stack pointer is specified then it is required to have an
architecture defined token placed on the stack, this will be consumed by
the new task, the shadow stack is specified by pointing to this token.  If
no valid token is present then this will be reported with -EINVAL.  This
token prevents new threads being created pointing at the shadow stack of
an existing running thread.  On architectures with support for userspace
pivoting of shadow stacks it is expected that the same format and placement
of tokens will be used, this is the case for arm64 and x86.

If the architecture does not support shadow stacks the shadow stack
pointer must be not be specified, architectures that do support the
feature are expected to enforce the same requirement on individual
systems that lack shadow stack support.

Update the existing arm64 and x86 implementations to pay attention to
the newly added arguments, in order to maintain compatibility we use the
existing behaviour if no shadow stack is specified. Since we are now
using more fields from the kernel_clone_args we pass that into the
shadow stack code rather than individual fields.

Portions of the x86 architecture code were written by Rick Edgecombe.

Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Tested-by: Yury Khrustalev <yury.khrustalev@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 arch/arm64/mm/gcs.c              | 47 +++++++++++++++++++-
 arch/x86/include/asm/shstk.h     | 11 +++--
 arch/x86/kernel/process.c        |  2 +-
 arch/x86/kernel/shstk.c          | 53 ++++++++++++++++++++---
 include/asm-generic/cacheflush.h | 11 +++++
 include/linux/sched/task.h       | 17 ++++++++
 include/uapi/linux/sched.h       |  9 ++--
 kernel/fork.c                    | 93 ++++++++++++++++++++++++++++++++++------
 8 files changed, 217 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/mm/gcs.c b/arch/arm64/mm/gcs.c
index 3abcbf9adb5c..249ff05bca45 100644
--- a/arch/arm64/mm/gcs.c
+++ b/arch/arm64/mm/gcs.c
@@ -43,8 +43,23 @@ int gcs_alloc_thread_stack(struct task_struct *tsk,
 {
 	unsigned long addr, size;
 
-	if (!system_supports_gcs())
+	if (!system_supports_gcs()) {
+		if (args->shadow_stack_token)
+			return -EINVAL;
+
 		return 0;
+	}
+
+	/*
+	 * If the user specified a GCS then use it, otherwise fall
+	 * back to a default allocation strategy. Validation is done
+	 * in arch_shstk_validate_clone().
+	 */
+	if (args->shadow_stack_token) {
+		tsk->thread.gcs_base = 0;
+		tsk->thread.gcs_size = 0;
+		return 0;
+	}
 
 	if (!task_gcs_el0_enabled(tsk))
 		return 0;
@@ -68,6 +83,36 @@ int gcs_alloc_thread_stack(struct task_struct *tsk,
 	return 0;
 }
 
+static bool gcs_consume_token(struct vm_area_struct *vma, struct page *page,
+			      unsigned long user_addr)
+{
+	u64 expected = GCS_CAP(user_addr);
+	u64 *token = page_address(page) + offset_in_page(user_addr);
+
+	if (!cmpxchg_to_user_page(vma, page, user_addr, token, expected, 0))
+		return false;
+	set_page_dirty_lock(page);
+
+	return true;
+}
+
+int arch_shstk_validate_clone(struct task_struct *tsk,
+			      struct vm_area_struct *vma,
+			      struct page *page,
+			      struct kernel_clone_args *args)
+{
+	unsigned long gcspr_el0;
+	int ret = 0;
+
+	gcspr_el0 = args->shadow_stack_token;
+	if (!gcs_consume_token(vma, page, gcspr_el0))
+		return -EINVAL;
+
+	tsk->thread.gcspr_el0 = gcspr_el0 + sizeof(u64);
+
+	return ret;
+}
+
 SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
 {
 	unsigned long alloc_size;
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index ba6f2fe43848..827e983430aa 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct kernel_clone_args;
 struct ksignal;
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
@@ -16,8 +17,8 @@ struct thread_shstk {
 
 long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
 void reset_thread_features(void);
-unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
-				       unsigned long stack_size);
+unsigned long shstk_alloc_thread_stack(struct task_struct *p,
+				       const struct kernel_clone_args *args);
 void shstk_free(struct task_struct *p);
 int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
@@ -28,8 +29,10 @@ static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
 static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
-						     unsigned long clone_flags,
-						     unsigned long stack_size) { return 0; }
+						     const struct kernel_clone_args *args)
+{
+	return 0;
+}
 static inline void shstk_free(struct task_struct *p) {}
 static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
 static inline int restore_signal_shadow_stack(void) { return 0; }
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1b7960cf6eb0..0a54af6c60df 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -209,7 +209,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	 * is disabled, new_ssp will remain 0, and fpu_clone() will know not to
 	 * update it.
 	 */
-	new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
+	new_ssp = shstk_alloc_thread_stack(p, args);
 	if (IS_ERR_VALUE(new_ssp))
 		return PTR_ERR((void *)new_ssp);
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 2ddf23387c7e..9926d58e5d41 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -191,18 +191,61 @@ void reset_thread_features(void)
 	current->thread.features_locked = 0;
 }
 
-unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
-				       unsigned long stack_size)
+int arch_shstk_validate_clone(struct task_struct *t,
+			      struct vm_area_struct *vma,
+			      struct page *page,
+			      struct kernel_clone_args *args)
+{
+	void *maddr = page_address(page);
+	unsigned long token;
+	int offset;
+	u64 expected;
+
+	/*
+	 * kernel_clone_args() verification assures token address is 8
+	 * byte aligned.
+	 */
+	token = args->shadow_stack_token;
+	expected = (token + SS_FRAME_SIZE) | BIT(0);
+	offset = offset_in_page(token);
+
+	if (!cmpxchg_to_user_page(vma, page, token, (unsigned long *)(maddr + offset),
+				  expected, 0))
+		return -EINVAL;
+	set_page_dirty_lock(page);
+
+	return 0;
+}
+
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+				       const struct kernel_clone_args *args)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
+	unsigned long clone_flags = args->flags;
 	unsigned long addr, size;
 
 	/*
 	 * If shadow stack is not enabled on the new thread, skip any
-	 * switch to a new shadow stack.
+	 * implicit switch to a new shadow stack and reject attempts to
+	 * explicitly specify one.
 	 */
-	if (!features_enabled(ARCH_SHSTK_SHSTK))
+	if (!features_enabled(ARCH_SHSTK_SHSTK)) {
+		if (args->shadow_stack_token)
+			return (unsigned long)ERR_PTR(-EINVAL);
+
 		return 0;
+	}
+
+	/*
+	 * If the user specified a shadow stack then use it, otherwise
+	 * fall back to a default allocation strategy. Validation is
+	 * done in arch_shstk_validate_clone().
+	 */
+	if (args->shadow_stack_token) {
+		shstk->base = 0;
+		shstk->size = 0;
+		return args->shadow_stack_token + 8;
+	}
 
 	/*
 	 * For CLONE_VFORK the child will share the parents shadow stack.
@@ -222,7 +265,7 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long cl
 	if (!(clone_flags & CLONE_VM))
 		return 0;
 
-	size = adjust_shstk_size(stack_size);
+	size = adjust_shstk_size(args->stack_size);
 	addr = alloc_shstk(0, size, 0, false);
 	if (IS_ERR_VALUE(addr))
 		return addr;
diff --git a/include/asm-generic/cacheflush.h b/include/asm-generic/cacheflush.h
index 7ee8a179d103..96cc0c7a5c90 100644
--- a/include/asm-generic/cacheflush.h
+++ b/include/asm-generic/cacheflush.h
@@ -124,4 +124,15 @@ static inline void flush_cache_vunmap(unsigned long start, unsigned long end)
 	} while (0)
 #endif
 
+#ifndef cmpxchg_to_user_page
+#define cmpxchg_to_user_page(vma, page, vaddr, ptr, old, new)  \
+({							  \
+	bool ret;						  \
+								  \
+	ret = try_cmpxchg(ptr, &old, new);			  \
+	flush_icache_user_page(vma, page, vaddr, sizeof(*ptr));	  \
+	ret;							  \
+})
+#endif
+
 #endif /* _ASM_GENERIC_CACHEFLUSH_H */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ea41795a352b..b501f752fc9a 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -16,6 +16,7 @@ struct task_struct;
 struct rusage;
 union thread_union;
 struct css_set;
+struct vm_area_struct;
 
 /* All the bits taken by the old clone syscall. */
 #define CLONE_LEGACY_FLAGS 0xffffffffULL
@@ -44,6 +45,7 @@ struct kernel_clone_args {
 	struct cgroup *cgrp;
 	struct css_set *cset;
 	unsigned int kill_seq;
+	unsigned long shadow_stack_token;
 };
 
 /*
@@ -226,4 +228,19 @@ static inline void task_unlock(struct task_struct *p)
 
 DEFINE_GUARD(task_lock, struct task_struct *, task_lock(_T), task_unlock(_T))
 
+#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
+int arch_shstk_validate_clone(struct task_struct *p,
+			      struct vm_area_struct *vma,
+			      struct page *page,
+			      struct kernel_clone_args *args);
+#else
+static inline int arch_shstk_validate_clone(struct task_struct *p,
+					    struct vm_area_struct *vma,
+					    struct page *page,
+					    struct kernel_clone_args *args)
+{
+	return 0;
+}
+#endif
+
 #endif /* _LINUX_SCHED_TASK_H */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 359a14cc76a4..9cf5c419e109 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -84,6 +84,7 @@
  *                kernel's limit of nested PID namespaces.
  * @cgroup:       If CLONE_INTO_CGROUP is specified set this to
  *                a file descriptor for the cgroup.
+ * @shadow_stack_token: Pointer to shadow stack token at top of stack.
  *
  * The structure is versioned by size and thus extensible.
  * New struct members must go at the end of the struct and
@@ -101,12 +102,14 @@ struct clone_args {
 	__aligned_u64 set_tid;
 	__aligned_u64 set_tid_size;
 	__aligned_u64 cgroup;
+	__aligned_u64 shadow_stack_token;
 };
 #endif
 
-#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
-#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
-#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
+#define CLONE_ARGS_SIZE_VER0  64 /* sizeof first published struct */
+#define CLONE_ARGS_SIZE_VER1  80 /* sizeof second published struct */
+#define CLONE_ARGS_SIZE_VER2  88 /* sizeof third published struct */
+#define CLONE_ARGS_SIZE_VER3  96 /* sizeof fourth published struct */
 
 /*
  * Scheduling policies
diff --git a/kernel/fork.c b/kernel/fork.c
index af673856499d..d484ebeded33 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1907,6 +1907,51 @@ static bool need_futex_hash_allocate_default(u64 clone_flags)
 	return true;
 }
 
+static int shstk_validate_clone(struct task_struct *p,
+				struct kernel_clone_args *args)
+{
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	struct page *page;
+	unsigned long addr;
+	int ret;
+
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK))
+		return 0;
+
+	if (!args->shadow_stack_token)
+		return 0;
+
+	mm = get_task_mm(p);
+	if (!mm)
+		return -EFAULT;
+
+	mmap_read_lock(mm);
+
+	addr = untagged_addr_remote(mm, args->shadow_stack_token);
+	page = get_user_page_vma_remote(mm, addr, FOLL_FORCE | FOLL_WRITE,
+					&vma);
+	if (IS_ERR(page)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (!(vma->vm_flags & VM_SHADOW_STACK) ||
+	    !(vma->vm_flags & VM_WRITE)) {
+		ret = -EFAULT;
+		goto out_page;
+	}
+
+	ret = arch_shstk_validate_clone(p, vma, page, args);
+
+out_page:
+	put_page(page);
+out:
+	mmap_read_unlock(mm);
+	mmput(mm);
+	return ret;
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -2182,6 +2227,9 @@ __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
 	retval = copy_thread(p, args);
+	if (retval)
+		goto bad_fork_cleanup_io;
+	retval = shstk_validate_clone(p, args);
 	if (retval)
 		goto bad_fork_cleanup_io;
 
@@ -2763,7 +2811,9 @@ static noinline int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 		     CLONE_ARGS_SIZE_VER1);
 	BUILD_BUG_ON(offsetofend(struct clone_args, cgroup) !=
 		     CLONE_ARGS_SIZE_VER2);
-	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2);
+	BUILD_BUG_ON(offsetofend(struct clone_args, shadow_stack_token) !=
+		     CLONE_ARGS_SIZE_VER3);
+	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER3);
 
 	if (unlikely(usize > PAGE_SIZE))
 		return -E2BIG;
@@ -2796,16 +2846,17 @@ static noinline int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 		return -EINVAL;
 
 	*kargs = (struct kernel_clone_args){
-		.flags		= args.flags,
-		.pidfd		= u64_to_user_ptr(args.pidfd),
-		.child_tid	= u64_to_user_ptr(args.child_tid),
-		.parent_tid	= u64_to_user_ptr(args.parent_tid),
-		.exit_signal	= args.exit_signal,
-		.stack		= args.stack,
-		.stack_size	= args.stack_size,
-		.tls		= args.tls,
-		.set_tid_size	= args.set_tid_size,
-		.cgroup		= args.cgroup,
+		.flags			= args.flags,
+		.pidfd			= u64_to_user_ptr(args.pidfd),
+		.child_tid		= u64_to_user_ptr(args.child_tid),
+		.parent_tid		= u64_to_user_ptr(args.parent_tid),
+		.exit_signal		= args.exit_signal,
+		.stack			= args.stack,
+		.stack_size		= args.stack_size,
+		.tls			= args.tls,
+		.set_tid_size		= args.set_tid_size,
+		.cgroup			= args.cgroup,
+		.shadow_stack_token	= args.shadow_stack_token,
 	};
 
 	if (args.set_tid &&
@@ -2846,6 +2897,24 @@ static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
 	return true;
 }
 
+/**
+ * clone3_shadow_stack_valid - check and prepare shadow stack
+ * @kargs: kernel clone args
+ *
+ * Verify that shadow stacks are only enabled if supported.
+ */
+static inline bool clone3_shadow_stack_valid(struct kernel_clone_args *kargs)
+{
+	if (!kargs->shadow_stack_token)
+		return true;
+
+	if (!IS_ALIGNED(kargs->shadow_stack_token, sizeof(void *)))
+		return false;
+
+	/* Fail if the kernel wasn't built with shadow stacks */
+	return IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK);
+}
+
 static bool clone3_args_valid(struct kernel_clone_args *kargs)
 {
 	/* Verify that no unknown flags are passed along. */
@@ -2868,7 +2937,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
 	    kargs->exit_signal)
 		return false;
 
-	if (!clone3_stack_valid(kargs))
+	if (!clone3_stack_valid(kargs) || !clone3_shadow_stack_valid(kargs))
 		return false;
 
 	return true;

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 3/8] selftests: Provide helper header for shadow stack testing
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

While almost all users of shadow stacks should be relying on the dynamic
linker and libc to enable the feature there are several low level test
programs where it is useful to enable without any libc support, allowing
testing without full system enablement. This low level testing is helpful
during bringup of the support itself, and also in enabling coverage by
automated testing without needing all system components in the target root
filesystems to have enablement.

Provide a header with helpers for this purpose, intended for use only by
test programs directly exercising shadow stack interfaces.

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/ksft_shstk.h | 98 ++++++++++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)

diff --git a/tools/testing/selftests/ksft_shstk.h b/tools/testing/selftests/ksft_shstk.h
new file mode 100644
index 000000000000..fecf91218ea5
--- /dev/null
+++ b/tools/testing/selftests/ksft_shstk.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Helpers for shadow stack enablement, this is intended to only be
+ * used by low level test programs directly exercising interfaces for
+ * working with shadow stacks.
+ *
+ * Copyright (C) 2024 ARM Ltd.
+ */
+
+#ifndef __KSFT_SHSTK_H
+#define __KSFT_SHSTK_H
+
+#include <asm/mman.h>
+
+/* This is currently only defined for x86 */
+#ifndef SHADOW_STACK_SET_TOKEN
+#define SHADOW_STACK_SET_TOKEN (1ULL << 0)
+#endif
+
+static bool shadow_stack_enabled;
+
+#ifdef __x86_64__
+#define ARCH_SHSTK_ENABLE	0x5001
+#define ARCH_SHSTK_SHSTK	(1ULL <<  0)
+
+#define ARCH_PRCTL(arg1, arg2)					\
+({								\
+	long _ret;						\
+	register long _num  asm("eax") = __NR_arch_prctl;	\
+	register long _arg1 asm("rdi") = (long)(arg1);		\
+	register long _arg2 asm("rsi") = (long)(arg2);		\
+								\
+	asm volatile (						\
+		"syscall\n"					\
+		: "=a"(_ret)					\
+		: "r"(_arg1), "r"(_arg2),			\
+		  "0"(_num)					\
+		: "rcx", "r11", "memory", "cc"			\
+	);							\
+	_ret;							\
+})
+
+#define ENABLE_SHADOW_STACK
+static __always_inline void enable_shadow_stack(void)
+{
+	int ret = ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK);
+	if (ret == 0)
+		shadow_stack_enabled = true;
+}
+
+#endif
+
+#ifdef __aarch64__
+#define PR_SET_SHADOW_STACK_STATUS      75
+# define PR_SHADOW_STACK_ENABLE         (1UL << 0)
+
+#define my_syscall2(num, arg1, arg2)                                          \
+({                                                                            \
+	register long _num  __asm__ ("x8") = (num);                           \
+	register long _arg1 __asm__ ("x0") = (long)(arg1);                    \
+	register long _arg2 __asm__ ("x1") = (long)(arg2);                    \
+	register long _arg3 __asm__ ("x2") = 0;                               \
+	register long _arg4 __asm__ ("x3") = 0;                               \
+	register long _arg5 __asm__ ("x4") = 0;                               \
+									      \
+	__asm__  volatile (                                                   \
+		"svc #0\n"                                                    \
+		: "=r"(_arg1)                                                 \
+		: "r"(_arg1), "r"(_arg2),                                     \
+		  "r"(_arg3), "r"(_arg4),                                     \
+		  "r"(_arg5), "r"(_num)					      \
+		: "memory", "cc"                                              \
+	);                                                                    \
+	_arg1;                                                                \
+})
+
+#define ENABLE_SHADOW_STACK
+static __always_inline void enable_shadow_stack(void)
+{
+	int ret;
+
+	ret = my_syscall2(__NR_prctl, PR_SET_SHADOW_STACK_STATUS,
+			  PR_SHADOW_STACK_ENABLE);
+	if (ret == 0)
+		shadow_stack_enabled = true;
+}
+
+#endif
+
+#ifndef __NR_map_shadow_stack
+#define __NR_map_shadow_stack 453
+#endif
+
+#ifndef ENABLE_SHADOW_STACK
+static inline void enable_shadow_stack(void) { }
+#endif
+
+#endif

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 2/8] Documentation: userspace-api: Add shadow stack API documentation
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

There are a number of architectures with shadow stack features which we are
presenting to userspace with as consistent an API as we can (though there
are some architecture specifics). Especially given that there are some
important considerations for userspace code interacting directly with the
feature let's provide some documentation covering the common aspects.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Reviewed-by: Deepak Gupta <debug@rivosinc.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 Documentation/userspace-api/index.rst        |  1 +
 Documentation/userspace-api/shadow_stack.rst | 44 ++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index b8c73be4fb11..0167e59b541e 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -62,6 +62,7 @@ Everything else
 
    ELF
    netlink/index
+   shadow_stack
    sysfs-platform_profile
    vduse
    futex2
diff --git a/Documentation/userspace-api/shadow_stack.rst b/Documentation/userspace-api/shadow_stack.rst
new file mode 100644
index 000000000000..42617d0470ba
--- /dev/null
+++ b/Documentation/userspace-api/shadow_stack.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Shadow Stacks
+=============
+
+Introduction
+============
+
+Several architectures have features which provide backward edge
+control flow protection through a hardware maintained stack, only
+writable by userspace through very limited operations.  This feature
+is referred to as shadow stacks on Linux, on x86 it is part of Intel
+Control Enforcement Technology (CET), on arm64 it is Guarded Control
+Stacks feature (FEAT_GCS) and for RISC-V it is the Zicfiss extension.
+It is expected that this feature will normally be managed by the
+system dynamic linker and libc in ways broadly transparent to
+application code, this document covers interfaces and considerations.
+
+
+Enabling
+========
+
+Shadow stacks default to disabled when a userspace process is
+executed, they can be enabled for the current thread with a syscall:
+
+ - For x86 the ARCH_SHSTK_ENABLE arch_prctl()
+ - For other architectures the PR_SET_SHADOW_STACK_ENABLE prctl()
+
+It is expected that this will normally be done by the dynamic linker.
+Any new threads created by a thread with shadow stacks enabled will
+themselves have shadow stacks enabled.
+
+
+Enablement considerations
+=========================
+
+- Returning from the function that enables shadow stacks without first
+  disabling them will cause a shadow stack exception.  This includes
+  any syscall wrapper or other library functions, the syscall will need
+  to be inlined.
+- A lock feature allows userspace to prevent disabling of shadow stacks.
+- Those that change the stack context like longjmp() or use of ucontext
+  changes on signal return will need support from libc.

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 1/8] arm64/gcs: Return a success value from gcs_alloc_thread_stack()
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

Currently as a result of templating from x86 code gcs_alloc_thread_stack()
returns a pointer as an unsigned int however on arm64 we don't actually use
this pointer value as anything other than a pass/fail flag. Simplify the
interface to just return an int with 0 on success and a negative error code
on failure.

Acked-by: Deepak Gupta <debug@rivosinc.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 arch/arm64/include/asm/gcs.h | 8 ++++----
 arch/arm64/kernel/process.c  | 8 ++++----
 arch/arm64/mm/gcs.c          | 8 ++++----
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/gcs.h b/arch/arm64/include/asm/gcs.h
index 5bc432234d3a..b4bbec9382a1 100644
--- a/arch/arm64/include/asm/gcs.h
+++ b/arch/arm64/include/asm/gcs.h
@@ -64,8 +64,8 @@ static inline bool task_gcs_el0_enabled(struct task_struct *task)
 void gcs_set_el0_mode(struct task_struct *task);
 void gcs_free(struct task_struct *task);
 void gcs_preserve_current_state(void);
-unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
-				     const struct kernel_clone_args *args);
+int gcs_alloc_thread_stack(struct task_struct *tsk,
+			   const struct kernel_clone_args *args);
 
 static inline int gcs_check_locked(struct task_struct *task,
 				   unsigned long new_val)
@@ -91,8 +91,8 @@ static inline bool task_gcs_el0_enabled(struct task_struct *task)
 static inline void gcs_set_el0_mode(struct task_struct *task) { }
 static inline void gcs_free(struct task_struct *task) { }
 static inline void gcs_preserve_current_state(void) { }
-static inline unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
-						   const struct kernel_clone_args *args)
+static inline int gcs_alloc_thread_stack(struct task_struct *tsk,
+					 const struct kernel_clone_args *args)
 {
 	return -ENOTSUPP;
 }
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 96482a1412c6..f0b1bea9c873 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -299,7 +299,7 @@ static void flush_gcs(void)
 static int copy_thread_gcs(struct task_struct *p,
 			   const struct kernel_clone_args *args)
 {
-	unsigned long gcs;
+	int ret;
 
 	if (!system_supports_gcs())
 		return 0;
@@ -310,9 +310,9 @@ static int copy_thread_gcs(struct task_struct *p,
 	p->thread.gcs_el0_mode = current->thread.gcs_el0_mode;
 	p->thread.gcs_el0_locked = current->thread.gcs_el0_locked;
 
-	gcs = gcs_alloc_thread_stack(p, args);
-	if (IS_ERR_VALUE(gcs))
-		return PTR_ERR((void *)gcs);
+	ret = gcs_alloc_thread_stack(p, args);
+	if (ret != 0)
+		return ret;
 
 	return 0;
 }
diff --git a/arch/arm64/mm/gcs.c b/arch/arm64/mm/gcs.c
index 6e93f78de79b..3abcbf9adb5c 100644
--- a/arch/arm64/mm/gcs.c
+++ b/arch/arm64/mm/gcs.c
@@ -38,8 +38,8 @@ static unsigned long gcs_size(unsigned long size)
 	return max(PAGE_SIZE, size);
 }
 
-unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
-				     const struct kernel_clone_args *args)
+int gcs_alloc_thread_stack(struct task_struct *tsk,
+			   const struct kernel_clone_args *args)
 {
 	unsigned long addr, size;
 
@@ -59,13 +59,13 @@ unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
 	size = gcs_size(size);
 	addr = alloc_gcs(0, size);
 	if (IS_ERR_VALUE(addr))
-		return addr;
+		return PTR_ERR((void *)addr);
 
 	tsk->thread.gcs_base = addr;
 	tsk->thread.gcs_size = size;
 	tsk->thread.gcspr_el0 = addr + size - sizeof(u64);
 
-	return addr;
+	return 0;
 }
 
 SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 0/8] fork: Support shadow stacks in clone3()
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan

[ I think at this point everyone is OK with the ABI, and the x86
  implementation has been tested so hopefully we are near to being
  able to get this merged?  If there are any outstanding issues let
  me know and I can look at addressing them.  The one possible issue
  I am aware of is that the RISC-V shadow stack support was briefly
  in -next but got dropped along with the general RISC-V issues during
  the last merge window, rebasing for that is still in progress.  I
  guess ideally this could be applied on a branch and then pulled into
  the RISC-V tree? ]

The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1].  With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses.  This provides some
protection against ROP attacks and making it easier to collect call
stacks.  These shadow stacks are allocated in the address space of the
userspace process.

Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled.  The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread.  This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces.  As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.

Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process, keeping the current
implicit allocation behaviour if one is not specified either with
clone3() or through the use of clone().  The user must provide a shadow
stack pointer, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with an architecture specified shadow stack
token at the top of the stack.

Yuri Khrustalev has raised questions from the libc side regarding
discoverability of extended clone3() structure sizes[2], this seems like
a general issue with clone3().  There was a suggestion to add a hwcap on
arm64 which isn't ideal but is doable there, though architecture
specific mechanisms would also be needed for x86 (and RISC-V if it's
support gets merged before this does).  The idea has, however, had
strong pushback from the architecture maintainers and it is possible to
detect support for this in clone3() by attempting a call with a
misaligned shadow stack pointer specified so no hwcap has been added.

[1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.org/T/#mc58f97f27461749ccf400ebabf6f9f937116a86b
[2] https://lore.kernel.org/r/aCs65ccRQtJBnZ_5@arm.com

Signed-off-by: Mark Brown <broonie@kernel.org>
---
Changes in v20:
- Comment fixes and clarifications in x86 arch_shstk_validate_clone()
  from Rick Edgecombe.
- Spelling fix in documentation.
- Link to v19: https://lore.kernel.org/r/20250819-clone3-shadow-stack-v19-0-bc957075479b@kernel.org

Changes in v19:
- Rebase onto v6.17-rc1.
- Link to v18: https://lore.kernel.org/r/20250702-clone3-shadow-stack-v18-0-7965d2b694db@kernel.org

Changes in v18:
- Rebase onto v6.16-rc3.
- Thanks to pointers from Yuri Khrustalev this version has been tested
  on x86 so I have removed the RFT tag.
- Clarify clone3_shadow_stack_valid() comment about the Kconfig check.
- Remove redundant GCSB DSYNCs in arm64 code.
- Fix token validation on x86.
- Link to v17: https://lore.kernel.org/r/20250609-clone3-shadow-stack-v17-0-8840ed97ff6f@kernel.org

Changes in v17:
- Rebase onto v6.16-rc1.
- Link to v16: https://lore.kernel.org/r/20250416-clone3-shadow-stack-v16-0-2ffc9ca3917b@kernel.org

Changes in v16:
- Rebase onto v6.15-rc2.
- Roll in fixes from x86 testing from Rick Edgecombe.
- Rework so that the argument is shadow_stack_token.
- Link to v15: https://lore.kernel.org/r/20250408-clone3-shadow-stack-v15-0-3fa245c6e3be@kernel.org

Changes in v15:
- Rebase onto v6.15-rc1.
- Link to v14: https://lore.kernel.org/r/20250206-clone3-shadow-stack-v14-0-805b53af73b9@kernel.org

Changes in v14:
- Rebase onto v6.14-rc1.
- Link to v13: https://lore.kernel.org/r/20241203-clone3-shadow-stack-v13-0-93b89a81a5ed@kernel.org

Changes in v13:
- Rebase onto v6.13-rc1.
- Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@kernel.org

Changes in v12:
- Add the regular prctl() to the userspace API document since arm64
  support is queued in -next.
- Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@kernel.org

Changes in v11:
- Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and
  integrate arm64 support.
- Rework the interface to specify a shadow stack pointer rather than a
  base and size like we do for the regular stack.
- Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@kernel.org

Changes in v10:
- Integrate fixes & improvements for the x86 implementation from Rick
  Edgecombe.
- Require that the shadow stack be VM_WRITE.
- Require that the shadow stack base and size be sizeof(void *) aligned.
- Clean up trailing newline.
- Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@kernel.org

Changes in v9:
- Pull token validation earlier and report problems with an error return
  to parent rather than signal delivery to the child.
- Verify that the top of the supplied shadow stack is VM_SHADOW_STACK.
- Rework token validation to only do the page mapping once.
- Drop no longer needed support for testing for signals in selftest.
- Fix typo in comments.
- Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@kernel.org

Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@kernel.org

Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@kernel.org

Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
  x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
  the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@kernel.org

Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
  map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
  other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@kernel.org

Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
  validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@kernel.org

Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
  CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@kernel.org

Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
  desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@kernel.org

---
Mark Brown (8):
      arm64/gcs: Return a success value from gcs_alloc_thread_stack()
      Documentation: userspace-api: Add shadow stack API documentation
      selftests: Provide helper header for shadow stack testing
      fork: Add shadow stack support to clone3()
      selftests/clone3: Remove redundant flushes of output streams
      selftests/clone3: Factor more of main loop into test_clone3()
      selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
      selftests/clone3: Test shadow stack support

 Documentation/userspace-api/index.rst             |   1 +
 Documentation/userspace-api/shadow_stack.rst      |  44 +++++
 arch/arm64/include/asm/gcs.h                      |   8 +-
 arch/arm64/kernel/process.c                       |   8 +-
 arch/arm64/mm/gcs.c                               |  55 +++++-
 arch/x86/include/asm/shstk.h                      |  11 +-
 arch/x86/kernel/process.c                         |   2 +-
 arch/x86/kernel/shstk.c                           |  53 ++++-
 include/asm-generic/cacheflush.h                  |  11 ++
 include/linux/sched/task.h                        |  17 ++
 include/uapi/linux/sched.h                        |   9 +-
 kernel/fork.c                                     |  93 +++++++--
 tools/testing/selftests/clone3/clone3.c           | 226 ++++++++++++++++++----
 tools/testing/selftests/clone3/clone3_selftests.h |  65 ++++++-
 tools/testing/selftests/ksft_shstk.h              |  98 ++++++++++
 15 files changed, 620 insertions(+), 81 deletions(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20231019-clone3-shadow-stack-15d40d2bf536

Best regards,
--  
Mark Brown <broonie@kernel.org>


^ permalink raw reply

* Re: [PATCH v4 0/4] procfs: make reference pidns more user-visible
From: Christian Brauner @ 2025-09-02 10:02 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest
In-Reply-To: <20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com>

On Tue, Aug 05, 2025 at 03:45:07PM +1000, Aleksa Sarai wrote:
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> 
> /* pidns mount option for procfs */
> 
> This implicit behaviour has historically meant that userspace was
> required to do some special dances in order to configure the pidns of a
> procfs mount as desired. Examples include:
> 
>  * In order to bypass the mnt_too_revealing() check, Kubernetes creates
>    a procfs mount from an empty pidns so that user namespaced containers
>    can be nested (without this, the nested containers would fail to
>    mount procfs). But this requires forking off a helper process because
>    you cannot just one-shot this using mount(2).
> 
>  * Container runtimes in general need to fork into a container before
>    configuring its mounts, which can lead to security issues in the case
>    of shared-pidns containers (a privileged process in the pidns can
>    interact with your container runtime process). While
>    SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
>    strict need for this due to a minor uAPI wart is kind of unfortunate.
> 
> Things would be much easier if there was a way for userspace to just
> specify the pidns they want. Patch 1 implements a new "pidns" argument
> which can be set using fsconfig(2):
> 
>     fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>     fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> 
> or classic mount(2) / mount(8):
> 
>     // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
>     mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> 
> The initial security model I have in this RFC is to be as conservative
> as possible and just mirror the security model for setns(2) -- which
> means that you can only set pidns=... to pid namespaces that your
> current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
> privileges over the pid namespace. This fulfils the requirements of
> container runtimes, but I suspect that this may be too strict for some
> usecases.
> 
> The pidns argument is not displayed in mountinfo -- it's not clear to me
> what value it would make sense to show (maybe we could just use ns_dname
> to provide an identifier for the namespace, but this number would be
> fairly useless to userspace). I'm open to suggestions. Note that
> PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
> information about this outside of mountinfo.
> 
> Note that you cannot change the pidns of an already-created procfs
> instance. The primary reason is that allowing this to be changed would
> require RCU-protecting proc_pid_ns(sb) and thus auditing all of
> fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
> the pid namespace. Since creating procfs instances is very cheap, it
> seems unnecessary to overcomplicate this upfront. Trying to reconfigure
> procfs this way errors out with -EBUSY.
> 
> /* ioctl(PROCFS_GET_PID_NAMESPACE) */
> 
> In addition, being able to figure out what pid namespace is being used
> by a procfs mount is quite useful when you have an administrative
> process (such as a container runtime) which wants to figure out the
> correct way of mapping PIDs between its own namespace and the namespace
> for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> alternative ways to do this, but they all rely on ancillary information
> that third-party libraries and tools do not necessarily have access to.
> 
> To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> can be used to get a reference to the pidns that a procfs is using.
> 
> Rather than copying the (fairly strict) security model for setns(2),
> apply a slightly looser model to better match what userspace can already
> do:
> 
>  * Make the ioctl only valid on the root (meaning that a process without
>    access to the procfs root -- such as only having an fd to a procfs
>    file or some open_tree(2)-like subset -- cannot use this API). This
>    means that the process already has some level of access to the
>    /proc/$pid directories.
> 
>  * If the calling process is in an ancestor pidns, then they can already
>    create pidfd for processes inside the pidns, which is morally
>    equivalent to a pidns file descriptor according to setns(2). So it
>    seems reasonable to just allow it in this case. (The justification
>    for this model was suggested by Christian.)
> 
>  * If the process has access to /proc/1/ns/pid already (i.e. has
>    ptrace-read access to the pidns pid1), then this ioctl is equivalent
>    to just opening a handle to it that way.
> 
>    Ideally we would check for ptrace-read access against all processes
>    in the pidns (which is very likely to be true for at least one
>    process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set
>    by most programs), but this would obviously not scale.
> 
> I'm open to suggestions for whether we need to make this stricter (or
> possibly allow more cases).
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Thanks for the patchset. Being able to specify what pid namespace the
procfs instance is supposed to belong to is super useful and will make
things easier for userspace for sure.

The code you added contains a minor wrinkle that I disliked which I've
changed and you tell me if you can live with this restriction or not.

The way you've implemented it specifying a pid namespace that the caller
holds privilege over would silently also override the user namespace the
filesystem is supposed to belong to.

Specifically, you did something like:

        put_pid_ns(ctx->pid_ns);
        ctx->pid_ns = get_pid_ns(target);
        put_user_ns(fc->user_ns);
        fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);

This silently overrides the user namespace recorded at fsopen() time. I
think that's too subtle and we should just not allow that at all for
now.

Instead I've changed this to:

        if (fc->user_ns != target->user_ns)
                return invalfc(fc, "owning user namespace of pid namespace doesn't match procfs user namespace");

        put_pid_ns(ctx->pid_ns);
        ctx->pid_ns = get_pid_ns(target);

so we just refuse different owernship.

I've also dropped the procfs ioctl because I'm not sure how much value
it will actually add given that you can do this via /proc/1/ns/pid.

If that is something that libpathrs despearately needs I would like to
do it as a separate patch anyways.

Thanks for the excellent cover letter. This was a pleasure merging!

^ permalink raw reply

* Re: (subset) [PATCH v4 0/4] procfs: make reference pidns more user-visible
From: Christian Brauner @ 2025-09-02  9:54 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Christian Brauner, Andy Lutomirski, linux-kernel, linux-fsdevel,
	linux-api, linux-doc, linux-kselftest, Alexander Viro, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com>

On Tue, 05 Aug 2025 15:45:07 +1000, Aleksa Sarai wrote:
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> 
> /* pidns mount option for procfs */
> 
> [...]

Applied to the vfs-6.18.procfs branch of the vfs/vfs.git tree.
Patches in the vfs-6.18.procfs branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.18.procfs

[1/4] pidns: move is-ancestor logic to helper
      https://git.kernel.org/vfs/vfs/c/60d22c6ef41b
[2/4] procfs: add "pidns" mount option
      https://git.kernel.org/vfs/vfs/c/77e211dd1392
[4/4] selftests/proc: add tests for new pidns APIs
      https://git.kernel.org/vfs/vfs/c/568d4239002c

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Roberto Sassu @ 2025-09-02  8:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner,
	Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <3d89a03f31cacb53a2ed8017899f2dab10476b62.camel@huaweicloud.com>

On Mon, 2025-09-01 at 19:01 +0200, Roberto Sassu wrote:
> On Mon, 2025-09-01 at 09:25 -0700, Andy Lutomirski wrote:
> > Can you clarify this a bit for those of us who are not well-versed in
> > exactly what "measurement" does?

Ah, sorry, I missed that.

Measurement refers to the process of collecting the file digest and
storing it in the measurement list, as opposed to appraisal which
instead compares the collected file digest with a reference value
(assumed to be good), and denies access in case of a mismatch.

Integrity violations are detected and reported only for measurement.

Roberto

> > On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu
> > <roberto.sassu@huaweicloud.com> wrote:
> > > > Now, in cases where you have IMA or something and you only permit signed
> > > > binaries to execute, you could argue there is a different race here (an
> > > > attacker creates a malicious script, runs it, and then replaces it with
> > > > a valid script's contents and metadata after the fact to get
> > > > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
> > > 
> > > Uhm, let's consider measurement, I'm more familiar with.
> > > 
> > > I think the race you wanted to express was that the attacker replaces
> > > the good script, verified with AT_EXECVE_CHECK, with the bad script
> > > after the IMA verification but before the interpreter reads it.
> > > 
> > > Fortunately, IMA is able to cope with this situation, since this race
> > > can happen for any file open, where of course a file can be not read-
> > > locked.
> > 
> > I assume you mean that this has nothing specifically to do with
> > scripts, as IMA tries to protect ordinary (non-"execute" file access)
> > as well.  Am I right?
> 
> Yes, correct, violations are checked for all open() and mmap()
> involving regular files. It would not be special to do it for scripts.
> 
> > > If the attacker tries to concurrently open the script for write in this
> > > race window, IMA will report this event (called violation) in the
> > > measurement list, and during remote attestation it will be clear that
> > > the interpreter did not read what was measured.
> > > 
> > > We just need to run the violation check for the BPRM_CHECK hook too
> > > (then, probably for us the O_DENY_WRITE flag or alternative solution
> > > would not be needed, for measurement).
> > 
> > This seems consistent with my interpretation above, but ...
> 
> The comment here [1] seems to be clear on why the violation check it is
> not done for execution (BPRM_CHECK hook). Since the OS read-locks the
> files during execution, this implicitly guarantees that there will not
> be concurrent writes, and thus no IMA violations.
> 
> However, recently, we took advantage of AT_EXECVE_CHECK to also
> evaluate the integrity of scripts (when not executed via ./). Since we
> are using the same hook for both executed files (read-locked) and
> scripts (I guess non-read-locked), then we need to do a violation check
> for BPRM_CHECK too, although it will be redundant for the first
> category.
> 
> > > Please, let us know when you apply patches like 2a010c412853 ("fs:
> > > don't block i_writecount during exec"). We had a discussion [1], but
> > > probably I missed when it was decided to be applied (I saw now it was
> > > in the same thread, but didn't get that at the time). We would have
> > > needed to update our code accordingly. In the future, we will try to
> > > clarify better our expectations from the VFS.
> > 
> > ... I didn't follow this.
> > 
> > Suppose there's some valid contents of /bin/sleep.  I execute
> > /bin/sleep 1m.  While it's running, I modify /bin/sleep (by opening it
> > for write, not by replacing it), and the kernel in question doesn't do
> > ETXTBSY.  Then the sleep process reads (and executes) the modified
> > contents.  Wouldn't a subsequent attestation fail?  Why is ETXTBSY
> > needed?
> 
> Ok, this is actually a good opportunity to explain what it will be
> missing. If you do the operations in the order you proposed, actually a
> violation will be emitted, because the violating operation is an open()
> and the check is done for this system call.
> 
> However, if you do the opposite, first open for write and then
> execution, IMA will not be aware of that since it trusts the OS to not
> make it happen and will not check for violations.
> 
> So yes, in your case the remote attestation will fail (actually it is
> up to the remote verifier to decide...). But in the opposite case, the
> writer could wait for IMA to measure the genuine content and then
> modify the content conveniently. The remote attestation will succeed.
> 
> Adding the violation check on BPRM_CHECK should be sufficient to avoid
> such situation, but I would try to think if there are other
> implications for IMA of not read-locking the files on execution.
> 
> Roberto
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/integrity/ima/ima_main.c?h=v6.17-rc4#n565
> 


^ permalink raw reply

* Re: [PATCH v2] uapi/fcntl: define RENAME_* and AT_RENAME_* macros
From: Amir Goldstein @ 2025-09-02  6:58 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-fsdevel, patches, Jeff Layton, Chuck Lever, Alexander Aring,
	Josef Bacik, Aleksa Sarai, Jan Kara, Christian Brauner,
	Matthew Wilcox, David Howells, linux-api
In-Reply-To: <20250901231457.1179748-1-rdunlap@infradead.org>

On Tue, Sep 2, 2025 at 1:14 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>
> Define the RENAME_* and AT_RENAME_* macros exactly the same as in
> recent glibc <stdio.h> so that duplicate definition build errors in
> both samples/watch_queue/watch_test.c and samples/vfs/test-statx.c
> no longer happen. When they defined in exactly the same way in
> multiple places, the build errors are prevented.
>
> Defining only the AT_RENAME_* macros is not sufficient since they
> depend on the RENAME_* macros, which may not be defined when the
> AT_RENAME_* macros are used.
>
> Build errors being fixed:
>
> for samples/vfs/test-statx.c:
>
> In file included from ../samples/vfs/test-statx.c:23:
> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>   159 | #define AT_RENAME_NOREPLACE     0x0001
> In file included from ../samples/vfs/test-statx.c:13:
> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
> usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
>   160 | #define AT_RENAME_EXCHANGE      0x0002
> /usr/include/stdio.h:173:10: note: this is the location of the previous definition
>   173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
> usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
>   161 | #define AT_RENAME_WHITEOUT      0x0004
> /usr/include/stdio.h:175:10: note: this is the location of the previous definition
>   175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>
> for samples/watch_queue/watch_test.c:
>
> In file included from usr/include/linux/watch_queue.h:6,
>                  from ../samples/watch_queue/watch_test.c:19:
> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>   159 | #define AT_RENAME_NOREPLACE     0x0001
> In file included from ../samples/watch_queue/watch_test.c:11:
> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
> usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
>   160 | #define AT_RENAME_EXCHANGE      0x0002
> /usr/include/stdio.h:173:10: note: this is the location of the previous definition
>   173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
> usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
>   161 | #define AT_RENAME_WHITEOUT      0x0004
> /usr/include/stdio.h:175:10: note: this is the location of the previous definition
>   175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>
> Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> ---
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Chuck Lever <chuck.lever@oracle.com>
> Cc: Alexander Aring <alex.aring@gmail.com>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: David Howells <dhowells@redhat.com>
> CC: linux-api@vger.kernel.org
> To: linux-fsdevel@vger.kernel.org
>
>  include/uapi/linux/fcntl.h |    9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> --- linux-next-20250819.orig/include/uapi/linux/fcntl.h
> +++ linux-next-20250819/include/uapi/linux/fcntl.h
> @@ -156,9 +156,12 @@
>   */
>
>  /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
> -#define AT_RENAME_NOREPLACE    0x0001
> -#define AT_RENAME_EXCHANGE     0x0002
> -#define AT_RENAME_WHITEOUT     0x0004
> +# define RENAME_NOREPLACE (1 << 0)
> +# define AT_RENAME_NOREPLACE RENAME_NOREPLACE
> +# define RENAME_EXCHANGE (1 << 1)
> +# define AT_RENAME_EXCHANGE RENAME_EXCHANGE
> +# define RENAME_WHITEOUT (1 << 2)
> +# define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>

This solution, apart from being terribly wrong (adjust the source to match
to value of its downstream copy), does not address the issue that Mathew
pointed out on v1 discussion [1]:

$ grep -r AT_RENAME_NOREPLACE /usr/include
/usr/include/linux/fcntl.h:#define AT_RENAME_NOREPLACE  0x0001

It's not in stdio.h at all.  This is with libc6 2.41-10

[1] https://lore.kernel.org/linux-fsdevel/aKxfGix_o4glz8-Z@casper.infradead.org/

I don't know how to resolve the mess that glibc has created.

Perhaps like this:

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index f291ab4f94ebc..dde14fa3c2007 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -155,10 +155,16 @@
  * as possible, so we can use them for generic bits in the future if necessary.
  */

-/* Flags for renameat2(2) (must match legacy RENAME_* flags). */
-#define AT_RENAME_NOREPLACE    0x0001
-#define AT_RENAME_EXCHANGE     0x0002
-#define AT_RENAME_WHITEOUT     0x0004
+/*
+ * The legacy renameat2(2) RENAME_* flags are conceptually also
syscall-specific
+ * flags, so it could makes sense to create the AT_RENAME_* aliases
for them and
+ * maybe later add support for generic AT_* flags to this syscall.
+ * However, following a mismatch of definitions in glibc and since no
kernel code
+ * currently uses the AT_RENAME_* aliases, we leave them undefined here.
+#define AT_RENAME_NOREPLACE    RENAME_NOREPLACE
+#define AT_RENAME_EXCHANGE     RENAME_EXCHANGE
+#define AT_RENAME_WHITEOUT     RENAME_WHITEOUT
+*/

 /* Flag for faccessat(2). */
 #define AT_EACCESS             0x200   /* Test access permitted for

^ permalink raw reply related

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Arjun Shankar @ 2025-09-02  2:41 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Aleksa Sarai, Adhemerval Zanella Netto, libc-alpha, linux-api
In-Reply-To: <cbbc9639-0443-4bf8-bbd1-9d3fdcb2fd37@cs.ucla.edu>

Hi Paul,

> On 2025-08-28 01:42, Aleksa Sarai wrote:
> >> I still fail to understand how a hypothetical "give me the supported flags"
> >> openat2 flag would be useful enough to justify complicating the openat2 API
> >> today.
> > My only concern is that it would break recompiles if/when we change it
> > back.
>
> OK, but from what I can see there's no identified possibility that
> openat2 will modify the objects its arguments point to, just as there's
> no identified possibility that plain openat will do so (in a
> hypothetical extension to remove unnecessary slashes from its filename
> argument, say).

While it is true that openat cannot be extended in this way, for
openat2 (whether or not it eventually materializes in Linux) there
already is the RFC patch series proposing CHECK_FIELDS that Aleksa
referred to earlier. And it's not just that: it has been mentioned as
a potential future direction even when the openat2 syscall was
implemented [1]. I think we should interpret this to mean that there
is indeed a possibility for openat2.

> In that case it's pretty clear that glibc should mark the open_how
> argument as pointer-to-const, just as glibc already marks the filename
> argument.

Unless the kernel marks open_how as const, glibc marking it as const
can lead to additional maintenance complications down the line: in the
future if the kernel starts modifying open_how, glibc's openat2
wrapper will no longer align with the kernel's behavior. At that
point, glibc will either need to discard the const (which will cause
any existing users of the wrapper to fail to recompile), or glibc will
need to handle the kernel's new behavior in the wrapper (which will
lead to further divergence from the behavior of the syscall that we
would claim to wrap). Neither of these seems problem-free. On the
other hand, following the kernel's declaration will mean that should
the kernel choose to mark it const, we can easily follow suit in glibc
without breaking recompiles.

Earlier on in this thread, Aleksa mentioned sched_setattr as
establishing precedent for the kernel modifying non-const objects. It
looks like glibc actually does provide a sched_setattr wrapper since
2.41. The relevant argument hasn't been marked as const and the kernel
does modify the contents, and glibc's syscall wrapper simply passes it
through. So we already do this.

Based on all this, I feel that leaving open_how as-is is the easier
and more maintenance-friendly choice for the syscall wrapper.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179

--
Arjun Shankar
he/him/his

^ permalink raw reply

* [PATCH v2] uapi/fcntl: define RENAME_* and AT_RENAME_* macros
From: Randy Dunlap @ 2025-09-01 23:14 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: patches, Randy Dunlap, Amir Goldstein, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, Matthew Wilcox, David Howells, linux-api

Define the RENAME_* and AT_RENAME_* macros exactly the same as in
recent glibc <stdio.h> so that duplicate definition build errors in
both samples/watch_queue/watch_test.c and samples/vfs/test-statx.c
no longer happen. When they defined in exactly the same way in
multiple places, the build errors are prevented.

Defining only the AT_RENAME_* macros is not sufficient since they
depend on the RENAME_* macros, which may not be defined when the
AT_RENAME_* macros are used.

Build errors being fixed:

for samples/vfs/test-statx.c:

In file included from ../samples/vfs/test-statx.c:23:
usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
  159 | #define AT_RENAME_NOREPLACE     0x0001
In file included from ../samples/vfs/test-statx.c:13:
/usr/include/stdio.h:171:10: note: this is the location of the previous definition
  171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
  160 | #define AT_RENAME_EXCHANGE      0x0002
/usr/include/stdio.h:173:10: note: this is the location of the previous definition
  173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
  161 | #define AT_RENAME_WHITEOUT      0x0004
/usr/include/stdio.h:175:10: note: this is the location of the previous definition
  175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT

for samples/watch_queue/watch_test.c:

In file included from usr/include/linux/watch_queue.h:6,
                 from ../samples/watch_queue/watch_test.c:19:
usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
  159 | #define AT_RENAME_NOREPLACE     0x0001
In file included from ../samples/watch_queue/watch_test.c:11:
/usr/include/stdio.h:171:10: note: this is the location of the previous definition
  171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
  160 | #define AT_RENAME_EXCHANGE      0x0002
/usr/include/stdio.h:173:10: note: this is the location of the previous definition
  173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
  161 | #define AT_RENAME_WHITEOUT      0x0004
/usr/include/stdio.h:175:10: note: this is the location of the previous definition
  175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT

Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
---
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
CC: linux-api@vger.kernel.org
To: linux-fsdevel@vger.kernel.org

 include/uapi/linux/fcntl.h |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

--- linux-next-20250819.orig/include/uapi/linux/fcntl.h
+++ linux-next-20250819/include/uapi/linux/fcntl.h
@@ -156,9 +156,12 @@
  */
 
 /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
-#define AT_RENAME_NOREPLACE	0x0001
-#define AT_RENAME_EXCHANGE	0x0002
-#define AT_RENAME_WHITEOUT	0x0004
+# define RENAME_NOREPLACE (1 << 0)
+# define AT_RENAME_NOREPLACE RENAME_NOREPLACE
+# define RENAME_EXCHANGE (1 << 1)
+# define AT_RENAME_EXCHANGE RENAME_EXCHANGE
+# define RENAME_WHITEOUT (1 << 2)
+# define AT_RENAME_WHITEOUT RENAME_WHITEOUT
 
 /* Flag for faccessat(2). */
 #define AT_EACCESS		0x200	/* Test access permitted for

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pasha Tatashin @ 2025-09-01 19:02 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Mike Rapoport, Jason Gunthorpe, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <mafs03496w0kk.fsf@kernel.org>

> >> > This really wants some luo helper
> >> >
> >> > 'luo alloc array'
> >> > 'luo restore array'
> >> > 'luo free array'
> >>
> >> We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1
> >
> > The patch looks okay to me, but it doesn't support holes in vmap
> > areas. While that is likely acceptable for vmalloc, it could be a
> > problem if we want to preserve memfd with holes and using vmap
> > preservation as a method, which would require a different approach.
> > Still, this would help with preserving memfd.
>
> I agree. I think we should do it the other way round. Build a sparse
> array first, and then use that to build vmap preservation. Our emails

Yes, sparse array support would help both: vmalloc and memfd preservation.

> seem to have crossed, but see my reply to Mike [0] that describes my
> idea a bit more, along with WIP code.
>
> [0] https://lore.kernel.org/lkml/mafs0ldmyw1hp.fsf@kernel.org/
>
> >
> > However, I wonder if we should add a separate preservation library on
> > top of the kho and not as part of kho (or at least keep them in a
> > separate file from core logic). This would allow us to preserve more
> > advanced data structures such as this and define preservation version
> > control, similar to Jason's store_object/restore_object proposal.
>
> This is how I have done it in my code: created a separate file called
> kho_array.c. If we have enough such data structures, we can probably
> move it under kernel/liveupdate/lib/.

Yes, let's place it under kernel/liveupdate/lib/. We will add more
preservation types over time.

> As for the store_object/restore_object proposal: see an alternate idea
> at [1].
>
> [1] https://lore.kernel.org/lkml/mafs0h5xmw12a.fsf@kernel.org/

What you are proposing makes sense. We can update the LUO API to be
responsible for passing the compatible string outside of the data
payload. However, I think we first need to settle on the actual API
for storing and restoring a versioned blob of data and place that code
into kernel/liveupdate/lib/. Depending on which API we choose, we can
then modify the LUO to work accordingly.

>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply

* Re: [RFC PATCH v4 1/7] kernel/api: introduce kernel API specification framework
From: Randy Dunlap @ 2025-09-01 17:23 UTC (permalink / raw)
  To: Sasha Levin, linux-api, linux-doc, linux-kernel, tools
In-Reply-To: <20250825181434.3340805-2-sashal@kernel.org>

Hi Sasha,


On 8/25/25 11:14 AM, Sasha Levin wrote:
> Add a comprehensive framework for formally documenting kernel APIs with
> inline specifications. This framework provides:
> 
> - Structured API documentation with parameter specifications, return
>   values, error conditions, and execution context requirements
> - Runtime validation capabilities for debugging (CONFIG_KAPI_RUNTIME_CHECKS)
> - Export of specifications via debugfs for tooling integration
> - Support for both internal kernel APIs and system calls
> 
> The framework stores specifications in a dedicated ELF section and
> provides infrastructure for:
> - Compile-time validation of specifications
> - Runtime querying of API documentation
> - Machine-readable export formats
> - Integration with existing SYSCALL_DEFINE macros
> 
> This commit introduces the core infrastructure without modifying any
> existing APIs. Subsequent patches will add specifications to individual
> subsystems.
> 
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
>  .gitignore                                    |    1 +
>  Documentation/admin-guide/kernel-api-spec.rst |  507 ++++++

To me, none of this feels like Documentation/admin-guide/ material.
I don't think that many sysadmins will be using it.

Maybe Documentation/dev-tools/ ?
Closer to developer material that admin?


>  MAINTAINERS                                   |    9 +
>  arch/um/kernel/dyn.lds.S                      |    3 +
>  arch/um/kernel/uml.lds.S                      |    3 +
>  arch/x86/kernel/vmlinux.lds.S                 |    3 +
>  include/asm-generic/vmlinux.lds.h             |   20 +
>  include/linux/kernel_api_spec.h               | 1559 +++++++++++++++++
>  include/linux/syscall_api_spec.h              |  125 ++
>  include/linux/syscalls.h                      |   38 +
>  init/Kconfig                                  |    2 +
>  kernel/Makefile                               |    1 +
>  kernel/api/Kconfig                            |   35 +
>  kernel/api/Makefile                           |    7 +
>  kernel/api/kernel_api_spec.c                  | 1155 ++++++++++++
>  15 files changed, 3468 insertions(+)
>  create mode 100644 Documentation/admin-guide/kernel-api-spec.rst
>  create mode 100644 include/linux/kernel_api_spec.h
>  create mode 100644 include/linux/syscall_api_spec.h
>  create mode 100644 kernel/api/Kconfig
>  create mode 100644 kernel/api/Makefile
>  create mode 100644 kernel/api/kernel_api_spec.c
thanks.
-- 
~Randy


^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-09-01 17:21 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Mike Rapoport, Jason Gunthorpe, pratyush, jasonmiu, graf,
	changyuanl, dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bC96fxHBb78DvNhyfdjsDfPCLY5J5cN8W0hUDt9KAPBJQ@mail.gmail.com>

Hi Pasha,

On Mon, Sep 01 2025, Pasha Tatashin wrote:

> On Mon, Sep 1, 2025 at 4:23 PM Mike Rapoport <rppt@kernel.org> wrote:
>>
>> On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
>> > On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
>> >
>> > > +   /*
>> > > +    * Most of the space should be taken by preserved folios. So take its
>> > > +    * size, plus a page for other properties.
>> > > +    */
>> > > +   fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
>> > > +   if (!fdt) {
>> > > +           err = -ENOMEM;
>> > > +           goto err_unpin;
>> > > +   }
>> >
>> > This doesn't seem to have any versioning scheme, it really should..
>> >
>> > > +   err = fdt_property_placeholder(fdt, "folios", preserved_size,
>> > > +                                  (void **)&preserved_folios);
>> > > +   if (err) {
>> > > +           pr_err("Failed to reserve folios property in FDT: %s\n",
>> > > +                  fdt_strerror(err));
>> > > +           err = -ENOMEM;
>> > > +           goto err_free_fdt;
>> > > +   }
>> >
>> > Yuk.
>> >
>> > This really wants some luo helper
>> >
>> > 'luo alloc array'
>> > 'luo restore array'
>> > 'luo free array'
>>
>> We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
>> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1
>
> The patch looks okay to me, but it doesn't support holes in vmap
> areas. While that is likely acceptable for vmalloc, it could be a
> problem if we want to preserve memfd with holes and using vmap
> preservation as a method, which would require a different approach.
> Still, this would help with preserving memfd.

I agree. I think we should do it the other way round. Build a sparse
array first, and then use that to build vmap preservation. Our emails
seem to have crossed, but see my reply to Mike [0] that describes my
idea a bit more, along with WIP code.

[0] https://lore.kernel.org/lkml/mafs0ldmyw1hp.fsf@kernel.org/

>
> However, I wonder if we should add a separate preservation library on
> top of the kho and not as part of kho (or at least keep them in a
> separate file from core logic). This would allow us to preserve more
> advanced data structures such as this and define preservation version
> control, similar to Jason's store_object/restore_object proposal.

This is how I have done it in my code: created a separate file called
kho_array.c. If we have enough such data structures, we can probably
move it under kernel/liveupdate/lib/.

As for the store_object/restore_object proposal: see an alternate idea
at [1].

[1] https://lore.kernel.org/lkml/mafs0h5xmw12a.fsf@kernel.org/

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Roberto Sassu @ 2025-09-01 17:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner,
	Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <CALCETrUtJmWxKYSi6QQAGpQR_ETNfoBidCu_VEq8Lx9iJAOyEw@mail.gmail.com>

On Mon, 2025-09-01 at 09:25 -0700, Andy Lutomirski wrote:
> Can you clarify this a bit for those of us who are not well-versed in
> exactly what "measurement" does?
> 
> On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu
> <roberto.sassu@huaweicloud.com> wrote:
> > > Now, in cases where you have IMA or something and you only permit signed
> > > binaries to execute, you could argue there is a different race here (an
> > > attacker creates a malicious script, runs it, and then replaces it with
> > > a valid script's contents and metadata after the fact to get
> > > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
> > 
> > Uhm, let's consider measurement, I'm more familiar with.
> > 
> > I think the race you wanted to express was that the attacker replaces
> > the good script, verified with AT_EXECVE_CHECK, with the bad script
> > after the IMA verification but before the interpreter reads it.
> > 
> > Fortunately, IMA is able to cope with this situation, since this race
> > can happen for any file open, where of course a file can be not read-
> > locked.
> 
> I assume you mean that this has nothing specifically to do with
> scripts, as IMA tries to protect ordinary (non-"execute" file access)
> as well.  Am I right?

Yes, correct, violations are checked for all open() and mmap()
involving regular files. It would not be special to do it for scripts.

> > If the attacker tries to concurrently open the script for write in this
> > race window, IMA will report this event (called violation) in the
> > measurement list, and during remote attestation it will be clear that
> > the interpreter did not read what was measured.
> > 
> > We just need to run the violation check for the BPRM_CHECK hook too
> > (then, probably for us the O_DENY_WRITE flag or alternative solution
> > would not be needed, for measurement).
> 
> This seems consistent with my interpretation above, but ...

The comment here [1] seems to be clear on why the violation check it is
not done for execution (BPRM_CHECK hook). Since the OS read-locks the
files during execution, this implicitly guarantees that there will not
be concurrent writes, and thus no IMA violations.

However, recently, we took advantage of AT_EXECVE_CHECK to also
evaluate the integrity of scripts (when not executed via ./). Since we
are using the same hook for both executed files (read-locked) and
scripts (I guess non-read-locked), then we need to do a violation check
for BPRM_CHECK too, although it will be redundant for the first
category.

> > Please, let us know when you apply patches like 2a010c412853 ("fs:
> > don't block i_writecount during exec"). We had a discussion [1], but
> > probably I missed when it was decided to be applied (I saw now it was
> > in the same thread, but didn't get that at the time). We would have
> > needed to update our code accordingly. In the future, we will try to
> > clarify better our expectations from the VFS.
> 
> ... I didn't follow this.
> 
> Suppose there's some valid contents of /bin/sleep.  I execute
> /bin/sleep 1m.  While it's running, I modify /bin/sleep (by opening it
> for write, not by replacing it), and the kernel in question doesn't do
> ETXTBSY.  Then the sleep process reads (and executes) the modified
> contents.  Wouldn't a subsequent attestation fail?  Why is ETXTBSY
> needed?

Ok, this is actually a good opportunity to explain what it will be
missing. If you do the operations in the order you proposed, actually a
violation will be emitted, because the violating operation is an open()
and the check is done for this system call.

However, if you do the opposite, first open for write and then
execution, IMA will not be aware of that since it trusts the OS to not
make it happen and will not check for violations.

So yes, in your case the remote attestation will fail (actually it is
up to the remote verifier to decide...). But in the opposite case, the
writer could wait for IMA to measure the genuine content and then
modify the content conveniently. The remote attestation will succeed.

Adding the violation check on BPRM_CHECK should be sufficient to avoid
such situation, but I would try to think if there are other
implications for IMA of not read-locking the files on execution.

Roberto

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/integrity/ima/ima_main.c?h=v6.17-rc4#n565

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-09-01 17:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Pasha Tatashin, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250828124320.GB7333@nvidia.com>

Hi Jason,

On Thu, Aug 28 2025, Jason Gunthorpe wrote:

> On Wed, Aug 27, 2025 at 05:03:55PM +0200, Pratyush Yadav wrote:
>
>> I think we need something a luo_xarray data structure that users like
>> memfd (and later hugetlb and guest_memfd and maybe others) can build to
>> make serialization easier. It will cover both contiguous arrays and
>> arrays with some holes in them.
>
> I'm not sure xarray is the right way to go, it is very complex data
> structure and building a kho variation of it seems like it is a huge
> amount of work.
>
> I'd stick with simple kvalloc type approaches until we really run into
> trouble.
>
> You can always map a sparse xarray into a kvalloc linear list by
> including the xarray index in each entry.
>
> Especially for memfd where we don't actually expect any sparsity in
> real uses cases there is no reason to invest a huge effort to optimize
> for it..

Full xarray is too complex, sure. But I think a simple sparse array with
xarray-like properties (4-byte pointers, values using xa_mk_value()) is
fairly simple to implement. More advanced features of xarray like
multi-index entries can be added later if needed.

In fact, I have a WIP version of such an array and have used it for
memfd preservation, and it looks quite alright to me. You can find the
code at [0]. It is roughly 300 lines of code. I still need to clean it
up to make it post-able, but it does work.

Building kvalloc on top of this becomes trivial.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/commit/?h=kho-array&id=cf4c04c1e9ac854e3297018ad6dada17c54a59af

>
>> As I explained above, the versioning is already there. Beyond that, why
>> do you think a raw C struct is better than FDT? It is just another way
>> of expressing the same information. FDT is a bit more cumbersome to
>> write and read, but comes at the benefit of more introspect-ability.
>
> Doesn't have the size limitations, is easier to work list, runs
> faster.
>
>> >  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>> >  luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);
>> 
>> I think what you describe here is essentially how LUO works currently,
>> just that the mechanisms are a bit different.
>
> The bit different is a very important bit though :)
>
> The versioning should be first class, not hidden away as some emergent
> property of registering multiple serializers or something like that.

That makes sense. How about some simple changes to the LUO interfaces to
make the version more prominent:

	int (*prepare)(struct liveupdate_file_handler *handler,
		       struct file *file, u64 *data, char **compatible);

This lets the subsystem fill in the compatible (AKA version) (string
here, but you can make it an integer if you want) when it serialized its
data.

And on restore side, LUO can pass in the compatible:

	int (*retrieve)(struct liveupdate_file_handler *handler,
			u64 data, char *compatible, struct file **file);


-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-09-01 17:01 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Jason Gunthorpe, Pasha Tatashin, pratyush, jasonmiu, graf,
	changyuanl, dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <aLXIcUwt0HVzRpYW@kernel.org>

Hi Mike,

On Mon, Sep 01 2025, Mike Rapoport wrote:

> On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
>> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
>> 
>> > +	/*
>> > +	 * Most of the space should be taken by preserved folios. So take its
>> > +	 * size, plus a page for other properties.
>> > +	 */
>> > +	fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
>> > +	if (!fdt) {
>> > +		err = -ENOMEM;
>> > +		goto err_unpin;
>> > +	}
>> 
>> This doesn't seem to have any versioning scheme, it really should..
>> 
>> > +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
>> > +				       (void **)&preserved_folios);
>> > +	if (err) {
>> > +		pr_err("Failed to reserve folios property in FDT: %s\n",
>> > +		       fdt_strerror(err));
>> > +		err = -ENOMEM;
>> > +		goto err_free_fdt;
>> > +	}
>> 
>> Yuk.
>> 
>> This really wants some luo helper
>> 
>> 'luo alloc array'
>> 'luo restore array'
>> 'luo free array'
>
> We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1
>
> Will wait for kbuild and then send proper patches.

I have been working on something similar, but in a more generic way.

I have implemented a sparse KHO-preservable array (called kho_array)
with xarray like properties. It can take in 4-byte aligned pointers and
supports saving non-pointer values similar to xa_mk_value(). For now it
doesn't support multi-index entries, but if needed the data format can
be extended to support it as well.

The structure is very similar to what you have implemented. It uses a
linked list of pages with some metadata at the head of each page.

I have used it for memfd preservation, and I think it is quite
versatile. For example, your kho_preserve_vmalloc() can be very easily
built on top of this kho_array by simply saving each physical page
address at consecutive indices in the array.

The code is still WIP and currently a bit hacky, but I will clean it up
in a couple days and I think it should be ready for posting. You can
find the current version at [0][1]. Would be good to hear your thoughts,
and if you agree with the approach, I can also port
kho_preserve_vmalloc() to work on top of kho_array as well.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/commit/?h=kho-array&id=cf4c04c1e9ac854e3297018ad6dada17c54a59af
[1] https://git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/commit/?h=kho-array&id=5eb0d7316274a9c87acaeedd86941979fc4baf96

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pasha Tatashin @ 2025-09-01 16:54 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Jason Gunthorpe, pratyush, jasonmiu, graf, changyuanl, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <aLXIcUwt0HVzRpYW@kernel.org>

On Mon, Sep 1, 2025 at 4:23 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
> >
> > > +   /*
> > > +    * Most of the space should be taken by preserved folios. So take its
> > > +    * size, plus a page for other properties.
> > > +    */
> > > +   fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> > > +   if (!fdt) {
> > > +           err = -ENOMEM;
> > > +           goto err_unpin;
> > > +   }
> >
> > This doesn't seem to have any versioning scheme, it really should..
> >
> > > +   err = fdt_property_placeholder(fdt, "folios", preserved_size,
> > > +                                  (void **)&preserved_folios);
> > > +   if (err) {
> > > +           pr_err("Failed to reserve folios property in FDT: %s\n",
> > > +                  fdt_strerror(err));
> > > +           err = -ENOMEM;
> > > +           goto err_free_fdt;
> > > +   }
> >
> > Yuk.
> >
> > This really wants some luo helper
> >
> > 'luo alloc array'
> > 'luo restore array'
> > 'luo free array'
>
> We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1

The patch looks okay to me, but it doesn't support holes in vmap
areas. While that is likely acceptable for vmalloc, it could be a
problem if we want to preserve memfd with holes and using vmap
preservation as a method, which would require a different approach.
Still, this would help with preserving memfd.

However, I wonder if we should add a separate preservation library on
top of the kho and not as part of kho (or at least keep them in a
separate file from core logic). This would allow us to preserve more
advanced data structures such as this and define preservation version
control, similar to Jason's store_object/restore_object proposal.

>
> Will wait for kbuild and then send proper patches.
>
>
> --
> Sincerely yours,
> Mike.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Andy Lutomirski @ 2025-09-01 16:25 UTC (permalink / raw)
  To: Roberto Sassu
  Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner,
	Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <54e27d05bae55749a975bc7cbe109b237b2b1323.camel@huaweicloud.com>

Can you clarify this a bit for those of us who are not well-versed in
exactly what "measurement" does?

On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu
<roberto.sassu@huaweicloud.com> wrote:
> > Now, in cases where you have IMA or something and you only permit signed
> > binaries to execute, you could argue there is a different race here (an
> > attacker creates a malicious script, runs it, and then replaces it with
> > a valid script's contents and metadata after the fact to get
> > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
>
> Uhm, let's consider measurement, I'm more familiar with.
>
> I think the race you wanted to express was that the attacker replaces
> the good script, verified with AT_EXECVE_CHECK, with the bad script
> after the IMA verification but before the interpreter reads it.
>
> Fortunately, IMA is able to cope with this situation, since this race
> can happen for any file open, where of course a file can be not read-
> locked.

I assume you mean that this has nothing specifically to do with
scripts, as IMA tries to protect ordinary (non-"execute" file access)
as well.  Am I right?

>
> If the attacker tries to concurrently open the script for write in this
> race window, IMA will report this event (called violation) in the
> measurement list, and during remote attestation it will be clear that
> the interpreter did not read what was measured.
>
> We just need to run the violation check for the BPRM_CHECK hook too
> (then, probably for us the O_DENY_WRITE flag or alternative solution
> would not be needed, for measurement).

This seems consistent with my interpretation above, but ...

>
> Please, let us know when you apply patches like 2a010c412853 ("fs:
> don't block i_writecount during exec"). We had a discussion [1], but
> probably I missed when it was decided to be applied (I saw now it was
> in the same thread, but didn't get that at the time). We would have
> needed to update our code accordingly. In the future, we will try to
> clarify better our expectations from the VFS.

... I didn't follow this.

Suppose there's some valid contents of /bin/sleep.  I execute
/bin/sleep 1m.  While it's running, I modify /bin/sleep (by opening it
for write, not by replacing it), and the kernel in question doesn't do
ETXTBSY.  Then the sleep process reads (and executes) the modified
contents.  Wouldn't a subsequent attestation fail?  Why is ETXTBSY
needed?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox