Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v3 06/12] man/man2/fsconfig.2: document "new" mount API
From: Aleksa Sarai @ 2025-08-21 11:47 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <198cc299cd9.eec1817f85794.4679093070969175955@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1361 bytes --]

On 2025-08-21, Askar Safin <safinaskar@zohomail.com> wrote:
> There is a convention: you can pass invalid fd (such as -1) as dfd to *at-syscalls to enforce that the path is absolute.
> This is documented. "man openat" says: "Specifying an invalid file descriptor number in dirfd can be used as a means to ensure that pathname is absolute".
> But fsconfig with FSCONFIG_SET_PATH breaks this convention due to this line: https://elixir.bootlin.com/linux/v6.16/source/fs/fsopen.c#L377 .
> I think this is a bug, and it should be fixed in kernel. Also, it is possible there are a lot of similarly buggy syscalls. All of them should be fixed,
> and moreover a warning should be added to https://docs.kernel.org/process/adding-syscalls.html . And then new fsconfig behavior should be documented.
> (Of course, I'm not saying that *you* should do all these. I'm just saying that this bug exists.) (I tested this.)

Indeed, good catch! I think we discussed this before --
FSCONFIG_SET_PATH actually doesn't work with any parameters today so
it's not very surprising nobody has noticed this until now. I'll include
it in the set of fixes I have for fscontext.

(FWIW, the convention I see more commonly is -EBADF but that's just a
stylistic I suppose.)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 06/12] man/man2/fsconfig.2: document "new" mount API
From: Aleksa Sarai @ 2025-08-21 11:44 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <198cc025823.ea44e3f585444.6907980660506284461@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1750 bytes --]

On 2025-08-21, Askar Safin <safinaskar@zohomail.com> wrote:
>  ---- On Tue, 12 Aug 2025 22:25:40 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
>  > On 2025-08-09, Aleksa Sarai <cyphar@cyphar.com> wrote:
>  > > +Note that the Linux kernel reuses filesystem instances
>  > > +for many filesystems,
>  > > +so (depending on the filesystem being configured and parameters used)
>  > > +it is possible for the filesystem instance "created" by
>  > > +.B \%FSCONFIG_CMD_CREATE
>  > > +to, in fact, be a reference
>  > > +to an existing filesystem instance in the kernel.
>  > > +The kernel will attempt to merge the specified parameters
>  > > +of this filesystem configuration context
>  > > +with those of the filesystem instance being reused,
>  > > +but some parameters may be
>  > > +.IR "silently ignored" .
>  > 
>  > While looking at this again, I realised this explanation is almost
>  > certainly incorrect in a few places (and was based on a misunderstanding
>  > of how sget_fc() works and how it interacts with vfs_get_tree()).
>  > 
>  > I'll rewrite this in the next version.
> 
> This recent patch seems to be relevant:
> https://lore.kernel.org/all/20250816-debugfs-mount-opts-v3-1-d271dad57b5b@posteo.net/

I'm aware of that, I was in one of the previous threads. There are some
deeper consistency issues that I'm writing patches for at the moment.

I'm of two minds whether I should fix the behaviour and then re-send
man-pages with updated text (delaying the next round of man-page reviews
by a month) or just reduce the specificity of this text and then add
more details after it has been fixed.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 09/12] man/man2/open_tree.2: document "new" mount API
From: Christian Brauner @ 2025-08-21 11:33 UTC (permalink / raw)
  To: Askar Safin
  Cc: Aleksa Sarai, Alejandro Colomar, Michael T. Kerrisk,
	Alexander Viro, Jan Kara, G. Branden Robinson, linux-man,
	linux-api, linux-fsdevel, linux-kernel, David Howells
In-Reply-To: <198cc623944.11ea2eb5d86377.2604785241030508275@zohomail.com>

On Thu, Aug 21, 2025 at 03:27:26PM +0400, Askar Safin wrote:
> man open_tree says:
> > mount propagation
> > (as described in
> > .BR mount_namespaces (7))
> > will not be applied to bind-mounts created by
> > .BR open_tree ()
> > until the bind-mount is attached with
> > .BR move_mount (2),
> 
> It seems this is wrong, because this commit exists: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=06b1ce966e3f8bfef261c111feb3d4b33ede0cd8 .
> I'm not sure about this. (I didn't test this.)

No, it's correct. I reverted this because it broke userspace that relies
on this behavior.

^ permalink raw reply

* Re: [PATCH v3 09/12] man/man2/open_tree.2: document "new" mount API
From: Askar Safin @ 2025-08-21 11:27 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-9-f61405c80f34@cyphar.com>

man open_tree says:
> mount propagation
> (as described in
> .BR mount_namespaces (7))
> will not be applied to bind-mounts created by
> .BR open_tree ()
> until the bind-mount is attached with
> .BR move_mount (2),

It seems this is wrong, because this commit exists: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=06b1ce966e3f8bfef261c111feb3d4b33ede0cd8 .
I'm not sure about this. (I didn't test this.)



--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v3 06/12] man/man2/fsconfig.2: document "new" mount API
From: Askar Safin @ 2025-08-21 10:25 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250809-new-mount-api-v3-6-f61405c80f34@cyphar.com>

There is a convention: you can pass invalid fd (such as -1) as dfd to *at-syscalls to enforce that the path is absolute.
This is documented. "man openat" says: "Specifying an invalid file descriptor number in dirfd can be used as a means to ensure that pathname is absolute".
But fsconfig with FSCONFIG_SET_PATH breaks this convention due to this line: https://elixir.bootlin.com/linux/v6.16/source/fs/fsopen.c#L377 .
I think this is a bug, and it should be fixed in kernel. Also, it is possible there are a lot of similarly buggy syscalls. All of them should be fixed,
and moreover a warning should be added to https://docs.kernel.org/process/adding-syscalls.html . And then new fsconfig behavior should be documented.
(Of course, I'm not saying that *you* should do all these. I'm just saying that this bug exists.) (I tested this.)

--
Askar Safin
https://types.pl/@safinaskar

^ permalink raw reply

* Re: [PATCH v3 06/12] man/man2/fsconfig.2: document "new" mount API
From: Askar Safin @ 2025-08-21  9:42 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-08-12.1755022847-yummy-native-bandage-dorm-8U46ME@cyphar.com>

 ---- On Tue, 12 Aug 2025 22:25:40 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > On 2025-08-09, Aleksa Sarai <cyphar@cyphar.com> wrote:
 > > +Note that the Linux kernel reuses filesystem instances
 > > +for many filesystems,
 > > +so (depending on the filesystem being configured and parameters used)
 > > +it is possible for the filesystem instance "created" by
 > > +.B \%FSCONFIG_CMD_CREATE
 > > +to, in fact, be a reference
 > > +to an existing filesystem instance in the kernel.
 > > +The kernel will attempt to merge the specified parameters
 > > +of this filesystem configuration context
 > > +with those of the filesystem instance being reused,
 > > +but some parameters may be
 > > +.IR "silently ignored" .
 > 
 > While looking at this again, I realised this explanation is almost
 > certainly incorrect in a few places (and was based on a misunderstanding
 > of how sget_fc() works and how it interacts with vfs_get_tree()).
 > 
 > I'll rewrite this in the next version.

This recent patch seems to be relevant:
https://lore.kernel.org/all/20250816-debugfs-mount-opts-v3-1-d271dad57b5b@posteo.net/

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Christian Brauner @ 2025-08-21  8:24 UTC (permalink / raw)
  To: Lichen Liu
  Cc: Christian Brauner, linux-fsdevel, linux-kernel, safinaskar, kexec,
	rob, weilongchen, cyphar, linux-api, zohar, stefanb, initramfs,
	corbet, linux-doc, viro, jack
In-Reply-To: <20250815121459.3391223-1-lichliu@redhat.com>

On Fri, 15 Aug 2025 20:14:59 +0800, Lichen Liu wrote:
> When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
> By default, a tmpfs mount is limited to using 50% of the available RAM
> for its content. This can be problematic in memory-constrained
> environments, particularly during a kdump capture.
> 
> In a kdump scenario, the capture kernel boots with a limited amount of
> memory specified by the 'crashkernel' parameter. If the initramfs is
> large, it may fail to unpack into the tmpfs rootfs due to insufficient
> space. This is because to get X MB of usable space in tmpfs, 2*X MB of
> memory must be available for the mount. This leads to an OOM failure
> during the early boot process, preventing a successful crash dump.
> 
> [...]

This seems rather useful but I've renamed "rootfsflags" to
"initramfs_options" because "rootfsflags" is ambiguous and it's not
really just about flags.

Other than that I think it would make sense to just raise the limit to
90% for the root_fs_type mount. I'm not sure why this super privileged
code would only be allowed 50% by default.

---

Applied to the vfs-6.18.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.18.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.18.misc

[1/1] fs: Add 'rootfsflags' to set rootfs mount options
      https://git.kernel.org/vfs/vfs/c/278033a225e1

^ permalink raw reply

* Re: [PATCH v19 2/8] Documentation: userspace-api: Add shadow stack API documentation
From: Randy Dunlap @ 2025-08-20 23:15 UTC (permalink / raw)
  To: Mark Brown, Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy,
	H.J. Lu, Florian Weimer, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Christian Brauner,
	Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, Andrew Morton,
	Yury Khrustalev, Wilco Dijkstra, linux-kselftest, linux-api,
	Kees Cook, Shuah Khan
In-Reply-To: <20250819-clone3-shadow-stack-v19-2-bc957075479b@kernel.org>



On 8/19/25 9:21 AM, Mark Brown wrote:
> There are a number of architectures with shadow stack features which we are
> presenting to userspace with as consistent an API as we can (though there
> are some architecture specifics). Especially given that there are some
> important considerations for userspace code interacting directly with the
> feature let's provide some documentation covering the common aspects.
> 

> ---
>  Documentation/userspace-api/index.rst        |  1 +
>  Documentation/userspace-api/shadow_stack.rst | 44 ++++++++++++++++++++++++++++
>  2 files changed, 45 insertions(+)
> 

> diff --git a/Documentation/userspace-api/shadow_stack.rst b/Documentation/userspace-api/shadow_stack.rst
> new file mode 100644
> index 000000000000..65c665496624
> --- /dev/null
> +++ b/Documentation/userspace-api/shadow_stack.rst
> @@ -0,0 +1,44 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Shadow Stacks
> +=============
> +
> +Introduction
> +============
> +
> +Several architectures have features which provide backward edge
> +control flow protection through a hardware maintained stack, only
> +writeable by userspace through very limited operations.  This feature

$internet says "writable"

> +is referred to as shadow stacks on Linux, on x86 it is part of Intel

                                      Linux. On

> +Control Enforcement Technology (CET), on arm64 it is Guarded Control
> +Stacks feature (FEAT_GCS) and for RISC-V it is the Zicfiss extension.
> +It is expected that this feature will normally be managed by the
> +system dynamic linker and libc in ways broadly transparent to
> +application code, this document covers interfaces and considerations.

               code. This

> +
> +
> +Enabling
> +========
> +
> +Shadow stacks default to disabled when a userspace process is
> +executed, they can be enabled for the current thread with a syscall:

   executed. They

> +
> + - For x86 the ARCH_SHSTK_ENABLE arch_prctl()
> + - For other architectures the PR_SET_SHADOW_STACK_ENABLE prctl()
> +
> +It is expected that this will normally be done by the dynamic linker.
> +Any new threads created by a thread with shadow stacks enabled will
> +themselves have shadow stacks enabled.
> +
> +
> +Enablement considerations
> +=========================
> +
> +- Returning from the function that enables shadow stacks without first
> +  disabling them will cause a shadow stack exception.  This includes
> +  any syscall wrapper or other library functions, the syscall will need

                                          functions; the

> +  to be inlined.
> +- A lock feature allows userspace to prevent disabling of shadow stacks.
> +- Those that change the stack context like longjmp() or use of ucontext
> +  changes on signal return will need support from libc.
> 
-- 
~Randy


^ permalink raw reply

* Re: [PATCH v5 2/3] lsm: introduce security_lsm_config_*_policy hooks
From: Casey Schaufler @ 2025-08-20 15:30 UTC (permalink / raw)
  To: Mickaël Salaün, Maxime Bélair
  Cc: linux-security-module, john.johansen, paul, jmorris, serge, kees,
	stephen.smalley.work, takedakn, penguin-kernel, song, rdunlap,
	linux-api, apparmor, linux-kernel, Casey Schaufler
In-Reply-To: <20250820.Ao3iquoshaiB@digikod.net>

On 8/20/2025 7:21 AM, Mickaël Salaün wrote:
> On Wed, Jul 09, 2025 at 10:00:55AM +0200, Maxime Bélair wrote:
>> Define two new LSM hooks: security_lsm_config_self_policy and
>> security_lsm_config_system_policy and wire them into the corresponding
>> lsm_config_*_policy() syscalls so that LSMs can register a unified
>> interface for policy management. This initial, minimal implementation
>> only supports the LSM_POLICY_LOAD operation to limit changes.
>>
>> Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
>> ---
>>  include/linux/lsm_hook_defs.h |  4 +++
>>  include/linux/security.h      | 20 ++++++++++++
>>  include/uapi/linux/lsm.h      |  8 +++++
>>  security/lsm_syscalls.c       | 17 ++++++++--
>>  security/security.c           | 60 +++++++++++++++++++++++++++++++++++
>>  5 files changed, 107 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
>> index bf3bbac4e02a..fca490444643 100644
>> --- a/include/linux/lsm_hook_defs.h
>> +++ b/include/linux/lsm_hook_defs.h
>> @@ -464,3 +464,7 @@ LSM_HOOK(int, 0, bdev_alloc_security, struct block_device *bdev)
>>  LSM_HOOK(void, LSM_RET_VOID, bdev_free_security, struct block_device *bdev)
>>  LSM_HOOK(int, 0, bdev_setintegrity, struct block_device *bdev,
>>  	 enum lsm_integrity_type type, const void *value, size_t size)
>> +LSM_HOOK(int, -EINVAL, lsm_config_self_policy, u32 lsm_id, u32 op,
>> +	 void __user *buf, size_t size, u32 flags)
>> +LSM_HOOK(int, -EINVAL, lsm_config_system_policy, u32 lsm_id, u32 op,
>> +	 void __user *buf, size_t size, u32 flags)
>> diff --git a/include/linux/security.h b/include/linux/security.h
>> index cc9b54d95d22..54acaee4a994 100644
>> --- a/include/linux/security.h
>> +++ b/include/linux/security.h
>> @@ -581,6 +581,11 @@ void security_bdev_free(struct block_device *bdev);
>>  int security_bdev_setintegrity(struct block_device *bdev,
>>  			       enum lsm_integrity_type type, const void *value,
>>  			       size_t size);
>> +int security_lsm_config_self_policy(u32 lsm_id, u32 op, void __user *buf,
>> +				    size_t size, u32 flags);
>> +int security_lsm_config_system_policy(u32 lsm_id, u32 op, void __user *buf,
>> +				      size_t size, u32 flags);
>> +
>>  #else /* CONFIG_SECURITY */
>>  
>>  /**
>> @@ -1603,6 +1608,21 @@ static inline int security_bdev_setintegrity(struct block_device *bdev,
>>  	return 0;
>>  }
>>  
>> +static inline int security_lsm_config_self_policy(u32 lsm_id, u32 op,
>> +						  void __user *buf,
>> +						  size_t size, u32 flags)
>> +{
>> +
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static inline int security_lsm_config_system_policy(u32 lsm_id, u32 op,
>> +						    void __user *buf,
>> +						    size_t size, u32 flags)
>> +{
>> +
>> +	return -EOPNOTSUPP;
>> +}
>>  #endif	/* CONFIG_SECURITY */
>>  
>>  #if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
>> diff --git a/include/uapi/linux/lsm.h b/include/uapi/linux/lsm.h
>> index 938593dfd5da..2b9432a30cdc 100644
>> --- a/include/uapi/linux/lsm.h
>> +++ b/include/uapi/linux/lsm.h
>> @@ -90,4 +90,12 @@ struct lsm_ctx {
>>   */
>>  #define LSM_FLAG_SINGLE	0x0001
>>  
>> +/*
>> + * LSM_POLICY_XXX definitions identify the different operations
>> + * to configure LSM policies
>> + */
>> +
>> +#define LSM_POLICY_UNDEF	0
>> +#define LSM_POLICY_LOAD		100
> Why the gap between 0 and 100?

It's conventional in LSM syscalls to start identifiers at 100.
No compelling reason other than to appease the LSM maintainer.

>
>> +
>>  #endif /* _UAPI_LINUX_LSM_H */
>> diff --git a/security/lsm_syscalls.c b/security/lsm_syscalls.c
>> index a3cb6dab8102..dd016ba6976c 100644
>> --- a/security/lsm_syscalls.c
>> +++ b/security/lsm_syscalls.c
>> @@ -122,11 +122,24 @@ SYSCALL_DEFINE3(lsm_list_modules, u64 __user *, ids, u32 __user *, size,
>>  SYSCALL_DEFINE5(lsm_config_self_policy, u32, lsm_id, u32, op, void __user *,
>>  		buf, u32 __user *, size, u32, flags)
> Given these are a multiplexor syscalls, I'm wondering if they should not
> have common flags and LSM-specific flags.  Alternatively, the op
> argument could also contains some optional flags.  In either case, the
> documentation should guide LSM developers for flags that may be shared
> amongst LSMs.
>
> Examples of such flags could be to restrict the whole process instead of
> the calling thread.
>
>>  {
>> -	return 0;
>> +	size_t usize;
>> +
>> +	if (get_user(usize, size))
> Size should just be u32, not a pointer.
>
>> +		return -EFAULT;
>> +
>> +	return security_lsm_config_self_policy(lsm_id, op, buf, usize, flags);
>>  }
>>  
>>  SYSCALL_DEFINE5(lsm_config_system_policy, u32, lsm_id, u32, op, void __user *,
>>  		buf, u32 __user *, size, u32, flags)
>>  {
>> -	return 0;
>> +	size_t usize;
>> +
>> +	if (!capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
> I like this mandatory capability check for this specific syscall.  This
> makes the semantic clearer.  However, to avoid the superpower of
> CAP_SYS_ADMIN, I'm wondering how we could use the CAP_MAC_ADMIN instead.
> This syscall could require CAP_MAC_ADMIN, and current LSMs (relying on a
> filesystem interface for policy configuration) could also enforce
> CAP_SYS_ADMIN for compatibility reasons.
>
> In fact, this "system" syscall could be a "namespace" syscall, which
> would take a security/LSM namespace file descriptor as argument.  If the
> namespace is not the initial namespace, any CAP_SYS_ADMIN implemented by
> current LSMs could be avoided.  See
> https://lore.kernel.org/r/CAHC9VhRGMmhxbajwQNfGFy+ZFF1uN=UEBjqQZQ4UBy7yds3eVQ@mail.gmail.com
>
>> +
>> +	if (get_user(usize, size))
> ditto
>
>> +		return -EFAULT;
>> +
>> +	return security_lsm_config_system_policy(lsm_id, op, buf, usize, flags);
>>  }
>> diff --git a/security/security.c b/security/security.c
>> index fb57e8fddd91..166d7d9936d0 100644
>> --- a/security/security.c
>> +++ b/security/security.c
>> @@ -5883,6 +5883,66 @@ int security_bdev_setintegrity(struct block_device *bdev,
>>  }
>>  EXPORT_SYMBOL(security_bdev_setintegrity);
>>  
>> +/**
>> + * security_lsm_config_self_policy() - Configure caller's LSM policies
>> + * @lsm_id: id of the LSM to target
>> + * @op: Operation to perform (one of the LSM_POLICY_XXX values)
>> + * @buf: userspace pointer to policy data
>> + * @size: size of @buf
>> + * @flags: lsm policy configuration flags
>> + *
>> + * Configure the policies of a LSM for the current domain/user. This notably
>> + * allows to update them even when the lsmfs is unavailable or restricted.
>> + * Currently, only LSM_POLICY_LOAD is supported.
>> + *
>> + * Return: Returns 0 on success, error on failure.
>> + */
>> +int security_lsm_config_self_policy(u32 lsm_id, u32 op, void __user *buf,
>> +				 size_t size, u32 flags)
>> +{
>> +	int rc = LSM_RET_DEFAULT(lsm_config_self_policy);
>> +	struct lsm_static_call *scall;
>> +
>> +	lsm_for_each_hook(scall, lsm_config_self_policy) {
>> +		if ((scall->hl->lsmid->id) == lsm_id) {
>> +			rc = scall->hl->hook.lsm_config_self_policy(lsm_id, op, buf, size, flags);
> The lsm_id should not be passed to the hook.
>
> The LSM syscall should manage the argument copy and buffer allocation
> instead of duplicating this code in each LSM hook implementation (see
> other LSM syscalls).
>
>> +			break;
>> +		}
>> +	}
>> +
>> +	return rc;
>> +}
>> +
>> +/**
>> + * security_lsm_config_system_policy() - Configure system LSM policies
>> + * @lsm_id: id of the lsm to target
>> + * @op: Operation to perform (one of the LSM_POLICY_XXX values)
>> + * @buf: userspace pointer to policy data
>> + * @size: size of @buf
>> + * @flags: lsm policy configuration flags
>> + *
>> + * Configure the policies of a LSM for the whole system. This notably allows
>> + * to update them even when the lsmfs is unavailable or restricted. Currently,
>> + * only LSM_POLICY_LOAD is supported.
>> + *
>> + * Return: Returns 0 on success, error on failure.
>> + */
>> +int security_lsm_config_system_policy(u32 lsm_id, u32 op, void __user *buf,
>> +				   size_t size, u32 flags)
>> +{
>> +	int rc = LSM_RET_DEFAULT(lsm_config_system_policy);
>> +	struct lsm_static_call *scall;
>> +
>> +	lsm_for_each_hook(scall, lsm_config_system_policy) {
>> +		if ((scall->hl->lsmid->id) == lsm_id) {
>> +			rc = scall->hl->hook.lsm_config_system_policy(lsm_id, op, buf, size, flags);
> ditto
>
>> +			break;
>> +		}
>> +	}
>> +
>> +	return rc;
>> +}
>> +
>>  #ifdef CONFIG_PERF_EVENTS
>>  /**
>>   * security_perf_event_open() - Check if a perf event open is allowed
>> -- 
>> 2.48.1
>>
>>

^ permalink raw reply

* Re: [PATCH v2] fs: Add 'rootfsflags' to set rootfs mount options
From: Lichen Liu @ 2025-08-20 15:17 UTC (permalink / raw)
  To: viro, brauner, jack
  Cc: linux-fsdevel, linux-kernel, safinaskar, kexec, rob, weilongchen,
	cyphar, linux-api, zohar, stefanb, initramfs, corbet, linux-doc
In-Reply-To: <20250815121459.3391223-1-lichliu@redhat.com>

Hi all, do you have any comments for this v2 patch?

Thanks,
Lichen

On Fri, Aug 15, 2025 at 8:15 PM Lichen Liu <lichliu@redhat.com> wrote:
>
> When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
> By default, a tmpfs mount is limited to using 50% of the available RAM
> for its content. This can be problematic in memory-constrained
> environments, particularly during a kdump capture.
>
> In a kdump scenario, the capture kernel boots with a limited amount of
> memory specified by the 'crashkernel' parameter. If the initramfs is
> large, it may fail to unpack into the tmpfs rootfs due to insufficient
> space. This is because to get X MB of usable space in tmpfs, 2*X MB of
> memory must be available for the mount. This leads to an OOM failure
> during the early boot process, preventing a successful crash dump.
>
> This patch introduces a new kernel command-line parameter, rootfsflags,
> which allows passing specific mount options directly to the rootfs when
> it is first mounted. This gives users control over the rootfs behavior.
>
> For example, a user can now specify rootfsflags=size=75% to allow the
> tmpfs to use up to 75% of the available memory. This can significantly
> reduce the memory pressure for kdump.
>
> Consider a practical example:
>
> To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
> the default 50% limit, this requires a memory pool of 96MB to be
> available for the tmpfs mount. The total memory requirement is therefore
> approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
> kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.
>
> By using rootfsflags=size=75%, the memory pool required for the 48MB
> tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory
> requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a
> smaller crashkernel size, such as 192MB.
>
> An alternative approach of reusing the existing rootflags parameter was
> considered. However, a new, dedicated rootfsflags parameter was chosen
> to avoid altering the current behavior of rootflags (which applies to
> the final root filesystem) and to prevent any potential regressions.
>
> Also add documentation for the new kernel parameter "rootfsflags"
>
> This approach is inspired by prior discussions and patches on the topic.
> Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
> Ref: https://landley.net/notes-2015.html#01-01-2015
> Ref: https://lkml.org/lkml/2021/6/29/783
> Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs
>
> Signed-off-by: Lichen Liu <lichliu@redhat.com>
> Tested-by: Rob Landley <rob@landley.net>
> ---
> Changes in v2:
>   - Add documentation for the new kernel parameter.
>
>  Documentation/admin-guide/kernel-parameters.txt |  3 +++
>  fs/namespace.c                                  | 11 ++++++++++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index fb8752b42ec8..0c00f651d431 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6220,6 +6220,9 @@
>
>         rootflags=      [KNL] Set root filesystem mount option string
>
> +       rootfsflags=    [KNL] Set initial root filesystem mount option string
> +                       (e.g. tmpfs for initramfs)
> +
>         rootfstype=     [KNL] Set root filesystem type
>
>         rootwait        [KNL] Wait (indefinitely) for root device to show up.
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 8f1000f9f3df..e484c26d5e3f 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -65,6 +65,15 @@ static int __init set_mphash_entries(char *str)
>  }
>  __setup("mphash_entries=", set_mphash_entries);
>
> +static char * __initdata rootfs_flags;
> +static int __init rootfs_flags_setup(char *str)
> +{
> +       rootfs_flags = str;
> +       return 1;
> +}
> +
> +__setup("rootfsflags=", rootfs_flags_setup);
> +
>  static u64 event;
>  static DEFINE_XARRAY_FLAGS(mnt_id_xa, XA_FLAGS_ALLOC);
>  static DEFINE_IDA(mnt_group_ida);
> @@ -5677,7 +5686,7 @@ static void __init init_mount_tree(void)
>         struct mnt_namespace *ns;
>         struct path root;
>
> -       mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL);
> +       mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", rootfs_flags);
>         if (IS_ERR(mnt))
>                 panic("Can't create rootfs");
>
> --
> 2.47.0
>


^ permalink raw reply

* Re: [PATCH v5 2/3] lsm: introduce security_lsm_config_*_policy hooks
From: Mickaël Salaün @ 2025-08-20 14:21 UTC (permalink / raw)
  To: Maxime Bélair
  Cc: linux-security-module, john.johansen, paul, jmorris, serge, kees,
	stephen.smalley.work, casey, takedakn, penguin-kernel, song,
	rdunlap, linux-api, apparmor, linux-kernel
In-Reply-To: <20250709080220.110947-3-maxime.belair@canonical.com>

On Wed, Jul 09, 2025 at 10:00:55AM +0200, Maxime Bélair wrote:
> Define two new LSM hooks: security_lsm_config_self_policy and
> security_lsm_config_system_policy and wire them into the corresponding
> lsm_config_*_policy() syscalls so that LSMs can register a unified
> interface for policy management. This initial, minimal implementation
> only supports the LSM_POLICY_LOAD operation to limit changes.
> 
> Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
> ---
>  include/linux/lsm_hook_defs.h |  4 +++
>  include/linux/security.h      | 20 ++++++++++++
>  include/uapi/linux/lsm.h      |  8 +++++
>  security/lsm_syscalls.c       | 17 ++++++++--
>  security/security.c           | 60 +++++++++++++++++++++++++++++++++++
>  5 files changed, 107 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
> index bf3bbac4e02a..fca490444643 100644
> --- a/include/linux/lsm_hook_defs.h
> +++ b/include/linux/lsm_hook_defs.h
> @@ -464,3 +464,7 @@ LSM_HOOK(int, 0, bdev_alloc_security, struct block_device *bdev)
>  LSM_HOOK(void, LSM_RET_VOID, bdev_free_security, struct block_device *bdev)
>  LSM_HOOK(int, 0, bdev_setintegrity, struct block_device *bdev,
>  	 enum lsm_integrity_type type, const void *value, size_t size)
> +LSM_HOOK(int, -EINVAL, lsm_config_self_policy, u32 lsm_id, u32 op,
> +	 void __user *buf, size_t size, u32 flags)
> +LSM_HOOK(int, -EINVAL, lsm_config_system_policy, u32 lsm_id, u32 op,
> +	 void __user *buf, size_t size, u32 flags)
> diff --git a/include/linux/security.h b/include/linux/security.h
> index cc9b54d95d22..54acaee4a994 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -581,6 +581,11 @@ void security_bdev_free(struct block_device *bdev);
>  int security_bdev_setintegrity(struct block_device *bdev,
>  			       enum lsm_integrity_type type, const void *value,
>  			       size_t size);
> +int security_lsm_config_self_policy(u32 lsm_id, u32 op, void __user *buf,
> +				    size_t size, u32 flags);
> +int security_lsm_config_system_policy(u32 lsm_id, u32 op, void __user *buf,
> +				      size_t size, u32 flags);
> +
>  #else /* CONFIG_SECURITY */
>  
>  /**
> @@ -1603,6 +1608,21 @@ static inline int security_bdev_setintegrity(struct block_device *bdev,
>  	return 0;
>  }
>  
> +static inline int security_lsm_config_self_policy(u32 lsm_id, u32 op,
> +						  void __user *buf,
> +						  size_t size, u32 flags)
> +{
> +
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline int security_lsm_config_system_policy(u32 lsm_id, u32 op,
> +						    void __user *buf,
> +						    size_t size, u32 flags)
> +{
> +
> +	return -EOPNOTSUPP;
> +}
>  #endif	/* CONFIG_SECURITY */
>  
>  #if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
> diff --git a/include/uapi/linux/lsm.h b/include/uapi/linux/lsm.h
> index 938593dfd5da..2b9432a30cdc 100644
> --- a/include/uapi/linux/lsm.h
> +++ b/include/uapi/linux/lsm.h
> @@ -90,4 +90,12 @@ struct lsm_ctx {
>   */
>  #define LSM_FLAG_SINGLE	0x0001
>  
> +/*
> + * LSM_POLICY_XXX definitions identify the different operations
> + * to configure LSM policies
> + */
> +
> +#define LSM_POLICY_UNDEF	0
> +#define LSM_POLICY_LOAD		100

Why the gap between 0 and 100?

> +
>  #endif /* _UAPI_LINUX_LSM_H */
> diff --git a/security/lsm_syscalls.c b/security/lsm_syscalls.c
> index a3cb6dab8102..dd016ba6976c 100644
> --- a/security/lsm_syscalls.c
> +++ b/security/lsm_syscalls.c
> @@ -122,11 +122,24 @@ SYSCALL_DEFINE3(lsm_list_modules, u64 __user *, ids, u32 __user *, size,
>  SYSCALL_DEFINE5(lsm_config_self_policy, u32, lsm_id, u32, op, void __user *,
>  		buf, u32 __user *, size, u32, flags)

Given these are a multiplexor syscalls, I'm wondering if they should not
have common flags and LSM-specific flags.  Alternatively, the op
argument could also contains some optional flags.  In either case, the
documentation should guide LSM developers for flags that may be shared
amongst LSMs.

Examples of such flags could be to restrict the whole process instead of
the calling thread.

>  {
> -	return 0;
> +	size_t usize;
> +
> +	if (get_user(usize, size))

Size should just be u32, not a pointer.

> +		return -EFAULT;
> +
> +	return security_lsm_config_self_policy(lsm_id, op, buf, usize, flags);
>  }
>  
>  SYSCALL_DEFINE5(lsm_config_system_policy, u32, lsm_id, u32, op, void __user *,
>  		buf, u32 __user *, size, u32, flags)
>  {
> -	return 0;
> +	size_t usize;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;

I like this mandatory capability check for this specific syscall.  This
makes the semantic clearer.  However, to avoid the superpower of
CAP_SYS_ADMIN, I'm wondering how we could use the CAP_MAC_ADMIN instead.
This syscall could require CAP_MAC_ADMIN, and current LSMs (relying on a
filesystem interface for policy configuration) could also enforce
CAP_SYS_ADMIN for compatibility reasons.

In fact, this "system" syscall could be a "namespace" syscall, which
would take a security/LSM namespace file descriptor as argument.  If the
namespace is not the initial namespace, any CAP_SYS_ADMIN implemented by
current LSMs could be avoided.  See
https://lore.kernel.org/r/CAHC9VhRGMmhxbajwQNfGFy+ZFF1uN=UEBjqQZQ4UBy7yds3eVQ@mail.gmail.com

> +
> +	if (get_user(usize, size))

ditto

> +		return -EFAULT;
> +
> +	return security_lsm_config_system_policy(lsm_id, op, buf, usize, flags);
>  }
> diff --git a/security/security.c b/security/security.c
> index fb57e8fddd91..166d7d9936d0 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -5883,6 +5883,66 @@ int security_bdev_setintegrity(struct block_device *bdev,
>  }
>  EXPORT_SYMBOL(security_bdev_setintegrity);
>  
> +/**
> + * security_lsm_config_self_policy() - Configure caller's LSM policies
> + * @lsm_id: id of the LSM to target
> + * @op: Operation to perform (one of the LSM_POLICY_XXX values)
> + * @buf: userspace pointer to policy data
> + * @size: size of @buf
> + * @flags: lsm policy configuration flags
> + *
> + * Configure the policies of a LSM for the current domain/user. This notably
> + * allows to update them even when the lsmfs is unavailable or restricted.
> + * Currently, only LSM_POLICY_LOAD is supported.
> + *
> + * Return: Returns 0 on success, error on failure.
> + */
> +int security_lsm_config_self_policy(u32 lsm_id, u32 op, void __user *buf,
> +				 size_t size, u32 flags)
> +{
> +	int rc = LSM_RET_DEFAULT(lsm_config_self_policy);
> +	struct lsm_static_call *scall;
> +
> +	lsm_for_each_hook(scall, lsm_config_self_policy) {
> +		if ((scall->hl->lsmid->id) == lsm_id) {
> +			rc = scall->hl->hook.lsm_config_self_policy(lsm_id, op, buf, size, flags);

The lsm_id should not be passed to the hook.

The LSM syscall should manage the argument copy and buffer allocation
instead of duplicating this code in each LSM hook implementation (see
other LSM syscalls).

> +			break;
> +		}
> +	}
> +
> +	return rc;
> +}
> +
> +/**
> + * security_lsm_config_system_policy() - Configure system LSM policies
> + * @lsm_id: id of the lsm to target
> + * @op: Operation to perform (one of the LSM_POLICY_XXX values)
> + * @buf: userspace pointer to policy data
> + * @size: size of @buf
> + * @flags: lsm policy configuration flags
> + *
> + * Configure the policies of a LSM for the whole system. This notably allows
> + * to update them even when the lsmfs is unavailable or restricted. Currently,
> + * only LSM_POLICY_LOAD is supported.
> + *
> + * Return: Returns 0 on success, error on failure.
> + */
> +int security_lsm_config_system_policy(u32 lsm_id, u32 op, void __user *buf,
> +				   size_t size, u32 flags)
> +{
> +	int rc = LSM_RET_DEFAULT(lsm_config_system_policy);
> +	struct lsm_static_call *scall;
> +
> +	lsm_for_each_hook(scall, lsm_config_system_policy) {
> +		if ((scall->hl->lsmid->id) == lsm_id) {
> +			rc = scall->hl->hook.lsm_config_system_policy(lsm_id, op, buf, size, flags);

ditto

> +			break;
> +		}
> +	}
> +
> +	return rc;
> +}
> +
>  #ifdef CONFIG_PERF_EVENTS
>  /**
>   * security_perf_event_open() - Check if a perf event open is allowed
> -- 
> 2.48.1
> 
> 

^ permalink raw reply

* Re: [PATCH v5 3/3] AppArmor: add support for lsm_config_self_policy and lsm_config_system_policy
From: Mickaël Salaün @ 2025-08-20 14:21 UTC (permalink / raw)
  To: Maxime Bélair
  Cc: linux-security-module, john.johansen, paul, jmorris, serge, kees,
	stephen.smalley.work, casey, takedakn, penguin-kernel, song,
	rdunlap, linux-api, apparmor, linux-kernel
In-Reply-To: <20250709080220.110947-4-maxime.belair@canonical.com>

On Wed, Jul 09, 2025 at 10:00:56AM +0200, Maxime Bélair wrote:
> Enable users to manage AppArmor policies through the new hooks
> lsm_config_self_policy and lsm_config_system_policy.
> 
> lsm_config_self_policy allows stacking existing policies in the kernel.
> This ensures that it can only further restrict the caller and can never
> be used to gain new privileges.
> 
> lsm_config_system_policy allows loading or replacing AppArmor policies in
> any AppArmor namespace.
> 
> Signed-off-by: Maxime Bélair <maxime.belair@canonical.com>
> ---
>  security/apparmor/apparmorfs.c         | 31 ++++++++++
>  security/apparmor/include/apparmor.h   |  4 ++
>  security/apparmor/include/apparmorfs.h |  3 +
>  security/apparmor/lsm.c                | 84 ++++++++++++++++++++++++++
>  4 files changed, 122 insertions(+)
> 

> diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
> index 9b6c2f157f83..0ce40290f44e 100644
> --- a/security/apparmor/lsm.c
> +++ b/security/apparmor/lsm.c
> @@ -1275,6 +1275,86 @@ static int apparmor_socket_shutdown(struct socket *sock, int how)
>  	return aa_sock_perm(OP_SHUTDOWN, AA_MAY_SHUTDOWN, sock);
>  }
>  
> +/**
> + * apparmor_lsm_config_self_policy - Stack a profile
> + * @lsm_id: AppArmor ID (LSM_ID_APPARMOR). Unused here
> + * @op: operation to perform. Currently, only LSM_POLICY_LOAD is supported
> + * @buf: buffer containing the user-provided name of the profile to stack
> + * @size: size of @buf
> + * @flags: reserved for future use; must be zero
> + *
> + * Returns: 0 on success, negative value on error
> + */
> +static int apparmor_lsm_config_self_policy(u32 lsm_id, u32 op, void __user *buf,
> +				      size_t size, u32 flags)
> +{
> +	char *name;
> +	long name_size;
> +	int ret;
> +


> +	if (op != LSM_POLICY_LOAD || flags)
> +		return -EOPNOTSUPP;
> +	if (size == 0)
> +		return -EINVAL;
> +	if (size > AA_PROFILE_NAME_MAX_SIZE)
> +		return -E2BIG;
> +
> +	name = kmalloc(size, GFP_KERNEL);
> +	if (!name)
> +		return -ENOMEM;

This hunk should be part of the syscall code and shared amongst LSMs.

> +
> +
> +	name_size = strncpy_from_user(name, buf, size);
> +	if (name_size < 0) {
> +		kfree(name);
> +		return name_size;
> +	}
> +
> +	ret = aa_change_profile(name, AA_CHANGE_STACK);
> +
> +	kfree(name);
> +
> +	return ret;
> +}

^ permalink raw reply

* Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
From: Askar Safin @ 2025-08-20 11:53 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-08-20.1755686261-lurid-sleepy-lime-quarry-j42HLU@cyphar.com>

 ---- On Wed, 20 Aug 2025 14:38:48 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > The reason I wanted to include the comparison is that you can create
 > multiple mount objects from the same underlying object using
 > open_tree(2) but that's not possible with fsmount(2) (at least, not
 > without creating a new filesystem context each time).

Okay, you may write that.

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v3 00/12] man2: document "new" mount API
From: Askar Safin @ 2025-08-20 11:37 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: alx, brauner, dhowells, g.branden.robinson, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-man, mtk.manpages, viro,
	Ian Kent, autofs mailing list
In-Reply-To: <2025-08-17.1755446479-rotten-curled-charms-robe-vWOBH5@cyphar.com>

 ---- On Sun, 17 Aug 2025 20:16:04 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > They are not tested by fstests AFAICS, but that's more of a flaw in
 > fstests (automount requires you to have a running autofs daemon, which
 > probably makes testing it in fstests or selftests impractical) not the
 > feature itself.

I suggest testing automounts in fstests/selftests using "tracing" automount.
This is what I do in my reproducers.

 > The automount behaviour of tracefs is different to the general automount
 > mechanism which is managed by userspace with the autofs daemon.

Yes. But I still was able to write reproducers using "tracing", so this
automount point is totally okay for tests. (At least for some tests,
such as RESOLVE_NO_XDEV.)

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v5 3/3] man/man2/mremap.2: describe previously undocumented shrink behaviour
From: Lorenzo Stoakes @ 2025-08-20 11:24 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <jmjlamoqau6rs32ldggra4ojiwim7b7h7xtgzh63m64xejsid4@cdnae7mjdeh3>

On Wed, Aug 20, 2025 at 12:50:25PM +0200, Alejandro Colomar wrote:
> Hi Lorenzo,
>
> On Tue, Aug 19, 2025 at 09:37:39PM +0100, Lorenzo Stoakes wrote:
> > > > diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> > > > index 6d14bf627..53d4eda29 100644
> > > > --- a/man/man2/mremap.2
> > > > +++ b/man/man2/mremap.2
> > > > @@ -47,8 +47,35 @@ The
> > > >  .B MREMAP_DONTUNMAP
> > > >  flag may also be specified.
> > > >  .P
> > > > -If the operation is not
> > > > -simply moving mappings,
> > > > +Equally, if the operation performs a shrink,
>
> I've changed s/Equally/Similarly/
>
> > > > +that is if
> > >
> > > Missing comma.
> >
> > Could you fix that up? Thnks!
> >
> > >
> > > > +.I old_size
> > > > +is greater than
> > > > +.IR new_size ,
> > > > +then
> > > > +.I old_size
> > > > +may also span multiple mappings
> > > > +which do not have to be
> > > > +adjacent to one another.
> > >
> > > I'm wondering if there's a missing comma or not before 'which'.
> > > The meaning of the sentence would be different.
> > >
> > > So, I should ask:
> > >
> > > Does old_size > new_size mean that old_size may span multiple mappings
> > > and you're commenting that multiple mappings need not be adjacent?
> >
> > Yes.
> >
> > >
> > > Or are multiple mappings always allowed and old_size > new_size allows
> > > non-adjacent ones?
> >
> > No.
> >
> > >
> > > I suspect it's the former, right?  Then, it's missing a comma, right?
> >
> > Yes could you fix that up?
>
> Yup.
>
> >
> > >
> > >
> > > Other than this, the patch looks good.
> >
> > Thanks!
>
> Thanks for the patch!  I've applied it with those amendments.
> <https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=21b1c7188ce1d66ef294eb1681361315a35e8070>

All sounds good, thanks for doing that and also for applies for other patches in
series :)

Cheers, Lorenzo

>
>
> Have a lovely day!
> Alex
>
> --
> <https://www.alejandro-colomar.es/>

^ permalink raw reply

* Re: [PATCH v3 08/12] man/man2/move_mount.2: document "new" mount API
From: Askar Safin @ 2025-08-20 11:18 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-08-12.1755009210-quick-best-oranges-coats-BNJpCV@cyphar.com>

 ---- On Tue, 12 Aug 2025 18:36:53 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > > "Filesystem root" can be understood as "root of superblock".
 > > So, please, change this to "root directory" or something.
 > 
 > Maybe I should borrow the "root mount" terminology from pivot_root(2)?

I don't like this. For me "root mount" is initial root mount, i. e. initramfs.
It is not what you mean here.

 > I didn't like using "rootfs" as
 > shorthand in a man-page.

I agree.

What you mean by "filesystem root" here? "Thing, which is changed by chroot(2)", right?
Then, please, write "root directory" (or "root"), this is standard term for that thing.
Or you can just write "/".

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v5 3/3] man/man2/mremap.2: describe previously undocumented shrink behaviour
From: Alejandro Colomar @ 2025-08-20 10:50 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <fb880acd-25df-4386-a13c-9b68640477ef@lucifer.local>

[-- Attachment #1: Type: text/plain, Size: 1714 bytes --]

Hi Lorenzo,

On Tue, Aug 19, 2025 at 09:37:39PM +0100, Lorenzo Stoakes wrote:
> > > diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> > > index 6d14bf627..53d4eda29 100644
> > > --- a/man/man2/mremap.2
> > > +++ b/man/man2/mremap.2
> > > @@ -47,8 +47,35 @@ The
> > >  .B MREMAP_DONTUNMAP
> > >  flag may also be specified.
> > >  .P
> > > -If the operation is not
> > > -simply moving mappings,
> > > +Equally, if the operation performs a shrink,

I've changed s/Equally/Similarly/

> > > +that is if
> >
> > Missing comma.
> 
> Could you fix that up? Thnks!
> 
> >
> > > +.I old_size
> > > +is greater than
> > > +.IR new_size ,
> > > +then
> > > +.I old_size
> > > +may also span multiple mappings
> > > +which do not have to be
> > > +adjacent to one another.
> >
> > I'm wondering if there's a missing comma or not before 'which'.
> > The meaning of the sentence would be different.
> >
> > So, I should ask:
> >
> > Does old_size > new_size mean that old_size may span multiple mappings
> > and you're commenting that multiple mappings need not be adjacent?
> 
> Yes.
> 
> >
> > Or are multiple mappings always allowed and old_size > new_size allows
> > non-adjacent ones?
> 
> No.
> 
> >
> > I suspect it's the former, right?  Then, it's missing a comma, right?
> 
> Yes could you fix that up?

Yup.

> 
> >
> >
> > Other than this, the patch looks good.
> 
> Thanks!

Thanks for the patch!  I've applied it with those amendments.
<https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=21b1c7188ce1d66ef294eb1681361315a35e8070>


Have a lovely day!
Alex

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
From: Aleksa Sarai @ 2025-08-20 10:38 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <198c6e76d3e.113f774e874302.5490092759974557634@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 831 bytes --]

On 2025-08-20, Askar Safin <safinaskar@zohomail.com> wrote:
>  ---- On Tue, 12 Aug 2025 18:33:04 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
>  >   Unlike open_tree(2) with OPEN_TREE_CLONE, fsmount() can only be called
>  >   once in the lifetime of a filesystem context.
> 
> Weird. open_tree doesn't get filesystem context as argument at all.
> I suggest just this:
> 
>   fsmount() can only be called
>   once in the lifetime of a filesystem context.

The reason I wanted to include the comparison is that you can create
multiple mount objects from the same underlying object using
open_tree(2) but that's not possible with fsmount(2) (at least, not
without creating a new filesystem context each time).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
From: Askar Safin @ 2025-08-20  9:55 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-08-12.1755007445-rural-feudal-spacebar-forehead-28QkCN@cyphar.com>

 ---- On Tue, 12 Aug 2025 18:33:04 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 >   Unlike open_tree(2) with OPEN_TREE_CLONE, fsmount() can only be called
 >   once in the lifetime of a filesystem context.

Weird. open_tree doesn't get filesystem context as argument at all.
I suggest just this:

  fsmount() can only be called
  once in the lifetime of a filesystem context.

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v5 3/3] man/man2/mremap.2: describe previously undocumented shrink behaviour
From: Lorenzo Stoakes @ 2025-08-19 20:37 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <uvh2e2jjdk44tdwrhmnd46atwgdzwwmny4kczxqv2vm33gjqpp@63lsupn6y2u6>

On Fri, Aug 15, 2025 at 11:36:26PM +0200, Alejandro Colomar wrote:
> Hi Lorenzo,
>
> On Mon, Aug 11, 2025 at 03:59:39PM +0100, Lorenzo Stoakes wrote:
> > There is pre-existing logic that appears to be undocumented for an mremap()
> > shrink operation, where it turns out that the usual 'input range must span
> > a single mapping' requirement no longer applies.
> >
> > In fact, it turns out that the input range specified by [old_address,
> > old_address + old_size) may span any number of mappings.
> >
> > If shrinking in-place (that is, neither the MREMAP_FIXED nor
> > MREMAP_DONTUNMAP flags are specified), then the new span may also span any
> > number of VMAs - [old_address, old_address + new_size).
> >
> > If shrinking and moving, the range specified by [old_address, old_address +
> > new_size) must span a single VMA.
> >
> > There must be at least one VMA contained within the [old_address,
> > old_address + old_size) range, and old_address must be within the range of
> > a VMA.
> >
> > Explicitly document this.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  man/man2/mremap.2 | 31 +++++++++++++++++++++++++++++--
> >  1 file changed, 29 insertions(+), 2 deletions(-)
> >
> > diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> > index 6d14bf627..53d4eda29 100644
> > --- a/man/man2/mremap.2
> > +++ b/man/man2/mremap.2
> > @@ -47,8 +47,35 @@ The
> >  .B MREMAP_DONTUNMAP
> >  flag may also be specified.
> >  .P
> > -If the operation is not
> > -simply moving mappings,
> > +Equally, if the operation performs a shrink,
> > +that is if
>
> Missing comma.

Could you fix that up? Thnks!

>
> > +.I old_size
> > +is greater than
> > +.IR new_size ,
> > +then
> > +.I old_size
> > +may also span multiple mappings
> > +which do not have to be
> > +adjacent to one another.
>
> I'm wondering if there's a missing comma or not before 'which'.
> The meaning of the sentence would be different.
>
> So, I should ask:
>
> Does old_size > new_size mean that old_size may span multiple mappings
> and you're commenting that multiple mappings need not be adjacent?

Yes.

>
> Or are multiple mappings always allowed and old_size > new_size allows
> non-adjacent ones?

No.

>
> I suspect it's the former, right?  Then, it's missing a comma, right?

Yes could you fix that up?

>
>
> Other than this, the patch looks good.

Thanks!

>
>
> Have a lovely night!
> Alex
>
> > +If this shrink is performed
> > +in-place,
> > +that is,
> > +neither
> > +.BR MREMAP_FIXED ,
> > +nor
> > +.B MREMAP_DONTUNMAP
> > +are specified,
> > +.I new_size
> > +may also span multiple VMAs.
> > +However, if the range is moved,
> > +then
> > +.I new_size
> > +must span only a single mapping.
> > +.P
> > +If the operation is neither a
> > +.B MREMAP_FIXED
> > +move
> > +nor a shrink,
> >  then
> >  .I old_size
> >  must span only a single mapping.
> > --
> > 2.50.1
> >
>
> --
> <https://www.alejandro-colomar.es/>

^ permalink raw reply

* [PATCH v19 8/8] selftests/clone3: Test shadow stack support
From: Mark Brown @ 2025-08-19 16:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Shuah Khan
In-Reply-To: <20250819-clone3-shadow-stack-v19-0-bc957075479b@kernel.org>

Add basic test coverage for specifying the shadow stack for a newly
created thread via clone3(), including coverage of the newly extended
argument structure.  We check that a user specified shadow stack can be
provided, and that invalid combinations of parameters are rejected.

In order to facilitate testing on systems without userspace shadow stack
support we manually enable shadow stacks on startup, this is architecture
specific due to the use of an arch_prctl() on x86. Due to interactions with
potential userspace locking of features we actually detect support for
shadow stacks on the running system by attempting to allocate a shadow
stack page during initialisation using map_shadow_stack(), warning if this
succeeds when the enable failed.

In order to allow testing of user configured shadow stacks on
architectures with that feature we need to ensure that we do not return
from the function where the clone3() syscall is called in the child
process, doing so would trigger a shadow stack underflow.  To do this we
use inline assembly rather than the standard syscall wrapper to call
clone3().  In order to avoid surprises we also use a syscall rather than
the libc exit() function., this should be overly cautious.

Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/clone3/clone3.c           | 143 +++++++++++++++++++++-
 tools/testing/selftests/clone3/clone3_selftests.h |  63 ++++++++++
 2 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index 5b8b7d640e70..6fd2b3238e2c 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -3,6 +3,7 @@
 /* Based on Christian Brauner's clone3() example */
 
 #define _GNU_SOURCE
+#include <asm/mman.h>
 #include <errno.h>
 #include <inttypes.h>
 #include <linux/types.h>
@@ -11,6 +12,7 @@
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <sys/mman.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
 #include <sys/un.h>
@@ -19,8 +21,12 @@
 #include <sched.h>
 
 #include "../kselftest.h"
+#include "../ksft_shstk.h"
 #include "clone3_selftests.h"
 
+static bool shadow_stack_supported;
+static size_t max_supported_args_size;
+
 enum test_mode {
 	CLONE3_ARGS_NO_TEST,
 	CLONE3_ARGS_ALL_0,
@@ -28,6 +34,10 @@ enum test_mode {
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_NEG,
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_CSIG,
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
+	CLONE3_ARGS_SHADOW_STACK,
+	CLONE3_ARGS_SHADOW_STACK_MISALIGNED,
+	CLONE3_ARGS_SHADOW_STACK_NO_TOKEN,
+	CLONE3_ARGS_SHADOW_STACK_NORMAL_MEMORY,
 };
 
 typedef bool (*filter_function)(void);
@@ -44,6 +54,44 @@ struct test {
 	filter_function filter;
 };
 
+
+/*
+ * We check for shadow stack support by attempting to use
+ * map_shadow_stack() since features may have been locked by the
+ * dynamic linker resulting in spurious errors when we attempt to
+ * enable on startup.  We warn if the enable failed.
+ */
+static void test_shadow_stack_supported(void)
+{
+	long ret;
+
+	ret = syscall(__NR_map_shadow_stack, 0, getpagesize(), 0);
+	if (ret == -1) {
+		ksft_print_msg("map_shadow_stack() not supported\n");
+	} else if ((void *)ret == MAP_FAILED) {
+		ksft_print_msg("Failed to map shadow stack\n");
+	} else {
+		ksft_print_msg("Shadow stack supportd\n");
+		shadow_stack_supported = true;
+
+		if (!shadow_stack_enabled)
+			ksft_print_msg("Mapped but did not enable shadow stack\n");
+	}
+}
+
+static void *get_shadow_stack_page(unsigned long flags)
+{
+	unsigned long long page;
+
+	page = syscall(__NR_map_shadow_stack, 0, getpagesize(), flags);
+	if ((void *)page == MAP_FAILED) {
+		ksft_print_msg("map_shadow_stack() failed: %d\n", errno);
+		return 0;
+	}
+
+	return (void *)page;
+}
+
 static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 {
 	struct __clone_args args = {
@@ -57,6 +105,7 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 	} args_ext;
 
 	pid_t pid = -1;
+	void *p;
 	int status;
 
 	memset(&args_ext, 0, sizeof(args_ext));
@@ -89,6 +138,26 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 	case CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG:
 		args.exit_signal = 0x00000000000000f0ULL;
 		break;
+	case CLONE3_ARGS_SHADOW_STACK:
+		p = get_shadow_stack_page(SHADOW_STACK_SET_TOKEN);
+		p += getpagesize() - sizeof(void *);
+		args.shadow_stack_token = (unsigned long long)p;
+		break;
+	case CLONE3_ARGS_SHADOW_STACK_MISALIGNED:
+		p = get_shadow_stack_page(SHADOW_STACK_SET_TOKEN);
+		p += getpagesize() - sizeof(void *) - 1;
+		args.shadow_stack_token = (unsigned long long)p;
+		break;
+	case CLONE3_ARGS_SHADOW_STACK_NORMAL_MEMORY:
+		p = malloc(getpagesize());
+		p += getpagesize() - sizeof(void *);
+		args.shadow_stack_token = (unsigned long long)p;
+		break;
+	case CLONE3_ARGS_SHADOW_STACK_NO_TOKEN:
+		p = get_shadow_stack_page(0);
+		p += getpagesize() - sizeof(void *);
+		args.shadow_stack_token = (unsigned long long)p;
+		break;
 	}
 
 	memcpy(&args_ext.args, &args, sizeof(struct __clone_args));
@@ -102,7 +171,12 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 
 	if (pid == 0) {
 		ksft_print_msg("I am the child, my PID is %d\n", getpid());
-		_exit(EXIT_SUCCESS);
+		/*
+		 * Use a raw syscall to ensure we don't get issues
+		 * with manually specified shadow stack and exit handlers.
+		 */
+		syscall(__NR_exit, EXIT_SUCCESS);
+		ksft_print_msg("CHILD FAILED TO EXIT PID is %d\n", getpid());
 	}
 
 	ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
@@ -184,6 +258,26 @@ static bool no_timenamespace(void)
 	return true;
 }
 
+static bool have_shadow_stack(void)
+{
+	if (shadow_stack_supported) {
+		ksft_print_msg("Shadow stack supported\n");
+		return true;
+	}
+
+	return false;
+}
+
+static bool no_shadow_stack(void)
+{
+	if (!shadow_stack_supported) {
+		ksft_print_msg("Shadow stack not supported\n");
+		return true;
+	}
+
+	return false;
+}
+
 static size_t page_size_plus_8(void)
 {
 	return getpagesize() + 8;
@@ -327,6 +421,50 @@ static const struct test tests[] = {
 		.expected = -EINVAL,
 		.test_mode = CLONE3_ARGS_NO_TEST,
 	},
+	{
+		.name = "Shadow stack on system with shadow stack",
+		.size = 0,
+		.expected = 0,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK,
+		.filter = no_shadow_stack,
+	},
+	{
+		.name = "Shadow stack with misaligned address",
+		.flags = CLONE_VM,
+		.size = 0,
+		.expected = -EINVAL,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK_MISALIGNED,
+		.filter = no_shadow_stack,
+	},
+	{
+		.name = "Shadow stack with normal memory",
+		.flags = CLONE_VM,
+		.size = 0,
+		.expected = -EFAULT,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK_NORMAL_MEMORY,
+		.filter = no_shadow_stack,
+	},
+	{
+		.name = "Shadow stack with no token",
+		.flags = CLONE_VM,
+		.size = 0,
+		.expected = -EINVAL,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK_NO_TOKEN,
+		.filter = no_shadow_stack,
+	},
+	{
+		.name = "Shadow stack on system without shadow stack",
+		.flags = CLONE_VM,
+		.size = 0,
+		.expected = -EFAULT,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK_NORMAL_MEMORY,
+		.filter = have_shadow_stack,
+	},
 };
 
 int main(int argc, char *argv[])
@@ -334,9 +472,12 @@ int main(int argc, char *argv[])
 	size_t size;
 	int i;
 
+	enable_shadow_stack();
+
 	ksft_print_header();
 	ksft_set_plan(ARRAY_SIZE(tests));
 	test_clone3_supported();
+	test_shadow_stack_supported();
 
 	for (i = 0; i < ARRAY_SIZE(tests); i++)
 		test_clone3(&tests[i]);
diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
index 939b26c86d42..8151c4fc971a 100644
--- a/tools/testing/selftests/clone3/clone3_selftests.h
+++ b/tools/testing/selftests/clone3/clone3_selftests.h
@@ -31,12 +31,75 @@ struct __clone_args {
 	__aligned_u64 set_tid;
 	__aligned_u64 set_tid_size;
 	__aligned_u64 cgroup;
+#ifndef CLONE_ARGS_SIZE_VER2
+#define CLONE_ARGS_SIZE_VER2 88	/* sizeof third published struct */
+#endif
+	__aligned_u64 shadow_stack_token;
+#ifndef CLONE_ARGS_SIZE_VER3
+#define CLONE_ARGS_SIZE_VER3 96 /* sizeof fourth published struct */
+#endif
 };
 
+/*
+ * For architectures with shadow stack support we need to be
+ * absolutely sure that the clone3() syscall will be inline and not a
+ * function call so we open code.
+ */
+#ifdef __x86_64__
+static __always_inline pid_t sys_clone3(struct __clone_args *args, size_t size)
+{
+	register long _num  __asm__ ("rax") = __NR_clone3;
+	register long _args __asm__ ("rdi") = (long)(args);
+	register long _size __asm__ ("rsi") = (long)(size);
+	long ret;
+
+	__asm__ volatile (
+		"syscall\n"
+		: "=a"(ret)
+		: "r"(_args), "r"(_size),
+		  "0"(_num)
+		: "rcx", "r11", "memory", "cc"
+	);
+
+	if (ret < 0) {
+		errno = -ret;
+		return -1;
+	}
+
+	return ret;
+}
+#elif defined(__aarch64__)
+static __always_inline pid_t sys_clone3(struct __clone_args *args, size_t size)
+{
+	register long _num  __asm__ ("x8") = __NR_clone3;
+	register long _args __asm__ ("x0") = (long)(args);
+	register long _size __asm__ ("x1") = (long)(size);
+	register long arg2 __asm__ ("x2") = 0;
+	register long arg3 __asm__ ("x3") = 0;
+	register long arg4 __asm__ ("x4") = 0;
+
+	__asm__ volatile (
+		"svc #0\n"
+		: "=r"(_args)
+		: "r"(_args), "r"(_size),
+		  "r"(_num), "r"(arg2),
+		  "r"(arg3), "r"(arg4)
+		: "memory", "cc"
+	);
+
+	if ((int)_args < 0) {
+		errno = -((int)_args);
+		return -1;
+	}
+
+	return _args;
+}
+#else
 static pid_t sys_clone3(struct __clone_args *args, size_t size)
 {
 	return syscall(__NR_clone3, args, size);
 }
+#endif
 
 static inline void test_clone3_supported(void)
 {

-- 
2.39.5


^ permalink raw reply related

* [PATCH v19 7/8] selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
From: Mark Brown @ 2025-08-19 16:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250819-clone3-shadow-stack-v19-0-bc957075479b@kernel.org>

The clone_args structure is extensible, with the syscall passing in the
length of the structure. Inside the kernel we use copy_struct_from_user()
to read the struct but this has the unfortunate side effect of silently
accepting some overrun in the structure size providing the extra data is
all zeros. This means that we can't discover the clone3() features that
the running kernel supports by simply probing with various struct sizes.
We need to check this for the benefit of test systems which run newer
kselftests on old kernels.

Add a flag which can be set on a test to indicate that clone3() may return
-E2BIG due to the use of newer struct versions. Currently no tests need
this but it will become an issue for testing clone3() support for shadow
stacks, the support for shadow stacks is already present on x86.

Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/clone3/clone3.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index e066b201fa64..5b8b7d640e70 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -39,6 +39,7 @@ struct test {
 	size_t size;
 	size_function size_function;
 	int expected;
+	bool e2big_valid;
 	enum test_mode test_mode;
 	filter_function filter;
 };
@@ -146,6 +147,11 @@ static void test_clone3(const struct test *test)
 	ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
 			getpid(), ret, test->expected);
 	if (ret != test->expected) {
+		if (test->e2big_valid && ret == -E2BIG) {
+			ksft_print_msg("Test reported -E2BIG\n");
+			ksft_test_result_skip("%s\n", test->name);
+			return;
+		}
 		ksft_print_msg(
 			"[%d] Result (%d) is different than expected (%d)\n",
 			getpid(), ret, test->expected);

-- 
2.39.5


^ permalink raw reply related

* [PATCH v19 6/8] selftests/clone3: Factor more of main loop into test_clone3()
From: Mark Brown @ 2025-08-19 16:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250819-clone3-shadow-stack-v19-0-bc957075479b@kernel.org>

In order to make it easier to add more configuration for the tests and
more support for runtime detection of when tests can be run pass the
structure describing the tests into test_clone3() rather than picking
the arguments out of it and have that function do all the per-test work.

No functional change.

Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/clone3/clone3.c | 77 ++++++++++++++++-----------------
 1 file changed, 37 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index e61f07973ce5..e066b201fa64 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -30,6 +30,19 @@ enum test_mode {
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
 };
 
+typedef bool (*filter_function)(void);
+typedef size_t (*size_function)(void);
+
+struct test {
+	const char *name;
+	uint64_t flags;
+	size_t size;
+	size_function size_function;
+	int expected;
+	enum test_mode test_mode;
+	filter_function filter;
+};
+
 static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 {
 	struct __clone_args args = {
@@ -109,30 +122,40 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 	return 0;
 }
 
-static bool test_clone3(uint64_t flags, size_t size, int expected,
-			enum test_mode test_mode)
+static void test_clone3(const struct test *test)
 {
+	size_t size;
 	int ret;
 
+	if (test->filter && test->filter()) {
+		ksft_test_result_skip("%s\n", test->name);
+		return;
+	}
+
+	if (test->size_function)
+		size = test->size_function();
+	else
+		size = test->size;
+
+	ksft_print_msg("Running test '%s'\n", test->name);
+
 	ksft_print_msg(
 		"[%d] Trying clone3() with flags %#" PRIx64 " (size %zu)\n",
-		getpid(), flags, size);
-	ret = call_clone3(flags, size, test_mode);
+		getpid(), test->flags, size);
+	ret = call_clone3(test->flags, size, test->test_mode);
 	ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
-			getpid(), ret, expected);
-	if (ret != expected) {
+			getpid(), ret, test->expected);
+	if (ret != test->expected) {
 		ksft_print_msg(
 			"[%d] Result (%d) is different than expected (%d)\n",
-			getpid(), ret, expected);
-		return false;
+			getpid(), ret, test->expected);
+		ksft_test_result_fail("%s\n", test->name);
+		return;
 	}
 
-	return true;
+	ksft_test_result_pass("%s\n", test->name);
 }
 
-typedef bool (*filter_function)(void);
-typedef size_t (*size_function)(void);
-
 static bool not_root(void)
 {
 	if (getuid() != 0) {
@@ -160,16 +183,6 @@ static size_t page_size_plus_8(void)
 	return getpagesize() + 8;
 }
 
-struct test {
-	const char *name;
-	uint64_t flags;
-	size_t size;
-	size_function size_function;
-	int expected;
-	enum test_mode test_mode;
-	filter_function filter;
-};
-
 static const struct test tests[] = {
 	{
 		.name = "simple clone3()",
@@ -319,24 +332,8 @@ int main(int argc, char *argv[])
 	ksft_set_plan(ARRAY_SIZE(tests));
 	test_clone3_supported();
 
-	for (i = 0; i < ARRAY_SIZE(tests); i++) {
-		if (tests[i].filter && tests[i].filter()) {
-			ksft_test_result_skip("%s\n", tests[i].name);
-			continue;
-		}
-
-		if (tests[i].size_function)
-			size = tests[i].size_function();
-		else
-			size = tests[i].size;
-
-		ksft_print_msg("Running test '%s'\n", tests[i].name);
-
-		ksft_test_result(test_clone3(tests[i].flags, size,
-					     tests[i].expected,
-					     tests[i].test_mode),
-				 "%s\n", tests[i].name);
-	}
+	for (i = 0; i < ARRAY_SIZE(tests); i++)
+		test_clone3(&tests[i]);
 
 	ksft_finished();
 }

-- 
2.39.5


^ permalink raw reply related

* [PATCH v19 5/8] selftests/clone3: Remove redundant flushes of output streams
From: Mark Brown @ 2025-08-19 16:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250819-clone3-shadow-stack-v19-0-bc957075479b@kernel.org>

Since there were widespread issues with output not being flushed the
kselftest framework was modified to explicitly set the output streams
unbuffered in commit 58e2847ad2e6 ("selftests: line buffer test
program's stdout") so there is no need to explicitly flush in the clone3
tests.

Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/clone3/clone3_selftests.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
index eeca8005723f..939b26c86d42 100644
--- a/tools/testing/selftests/clone3/clone3_selftests.h
+++ b/tools/testing/selftests/clone3/clone3_selftests.h
@@ -35,8 +35,6 @@ struct __clone_args {
 
 static pid_t sys_clone3(struct __clone_args *args, size_t size)
 {
-	fflush(stdout);
-	fflush(stderr);
 	return syscall(__NR_clone3, args, size);
 }
 

-- 
2.39.5


^ permalink raw reply related

* [PATCH v19 4/8] fork: Add shadow stack support to clone3()
From: Mark Brown @ 2025-08-19 16:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook
In-Reply-To: <20250819-clone3-shadow-stack-v19-0-bc957075479b@kernel.org>

Unlike with the normal stack there is no API for configuring the shadow
stack for a new thread, instead the kernel will dynamically allocate a
new shadow stack with the same size as the normal stack. This appears to
be due to the shadow stack series having been in development since
before the more extensible clone3() was added rather than anything more
deliberate.

Add a parameter to clone3() specifying a shadow stack pointer to use
for the new thread, this is inconsistent with the way we specify the
normal stack but during review concerns were expressed about having to
identify where the shadow stack pointer should be placed especially in
cases where the shadow stack has been previously active.  If no shadow
stack is specified then the existing implicit allocation behaviour is
maintained.

If a shadow stack pointer is specified then it is required to have an
architecture defined token placed on the stack, this will be consumed by
the new task, the shadow stack is specified by pointing to this token.  If
no valid token is present then this will be reported with -EINVAL.  This
token prevents new threads being created pointing at the shadow stack of
an existing running thread.  On architectures with support for userspace
pivoting of shadow stacks it is expected that the same format and placement
of tokens will be used, this is the case for arm64 and x86.

If the architecture does not support shadow stacks the shadow stack
pointer must be not be specified, architectures that do support the
feature are expected to enforce the same requirement on individual
systems that lack shadow stack support.

Update the existing arm64 and x86 implementations to pay attention to
the newly added arguments, in order to maintain compatibility we use the
existing behaviour if no shadow stack is specified. Since we are now
using more fields from the kernel_clone_args we pass that into the
shadow stack code rather than individual fields.

Portions of the x86 architecture code were written by Rick Edgecombe.

Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Tested-by: Yury Khrustalev <yury.khrustalev@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 arch/arm64/mm/gcs.c              | 47 +++++++++++++++++++-
 arch/x86/include/asm/shstk.h     | 11 +++--
 arch/x86/kernel/process.c        |  2 +-
 arch/x86/kernel/shstk.c          | 53 ++++++++++++++++++++---
 include/asm-generic/cacheflush.h | 11 +++++
 include/linux/sched/task.h       | 17 ++++++++
 include/uapi/linux/sched.h       |  9 ++--
 kernel/fork.c                    | 93 ++++++++++++++++++++++++++++++++++------
 8 files changed, 217 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/mm/gcs.c b/arch/arm64/mm/gcs.c
index 3abcbf9adb5c..249ff05bca45 100644
--- a/arch/arm64/mm/gcs.c
+++ b/arch/arm64/mm/gcs.c
@@ -43,8 +43,23 @@ int gcs_alloc_thread_stack(struct task_struct *tsk,
 {
 	unsigned long addr, size;
 
-	if (!system_supports_gcs())
+	if (!system_supports_gcs()) {
+		if (args->shadow_stack_token)
+			return -EINVAL;
+
 		return 0;
+	}
+
+	/*
+	 * If the user specified a GCS then use it, otherwise fall
+	 * back to a default allocation strategy. Validation is done
+	 * in arch_shstk_validate_clone().
+	 */
+	if (args->shadow_stack_token) {
+		tsk->thread.gcs_base = 0;
+		tsk->thread.gcs_size = 0;
+		return 0;
+	}
 
 	if (!task_gcs_el0_enabled(tsk))
 		return 0;
@@ -68,6 +83,36 @@ int gcs_alloc_thread_stack(struct task_struct *tsk,
 	return 0;
 }
 
+static bool gcs_consume_token(struct vm_area_struct *vma, struct page *page,
+			      unsigned long user_addr)
+{
+	u64 expected = GCS_CAP(user_addr);
+	u64 *token = page_address(page) + offset_in_page(user_addr);
+
+	if (!cmpxchg_to_user_page(vma, page, user_addr, token, expected, 0))
+		return false;
+	set_page_dirty_lock(page);
+
+	return true;
+}
+
+int arch_shstk_validate_clone(struct task_struct *tsk,
+			      struct vm_area_struct *vma,
+			      struct page *page,
+			      struct kernel_clone_args *args)
+{
+	unsigned long gcspr_el0;
+	int ret = 0;
+
+	gcspr_el0 = args->shadow_stack_token;
+	if (!gcs_consume_token(vma, page, gcspr_el0))
+		return -EINVAL;
+
+	tsk->thread.gcspr_el0 = gcspr_el0 + sizeof(u64);
+
+	return ret;
+}
+
 SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
 {
 	unsigned long alloc_size;
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index ba6f2fe43848..827e983430aa 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct kernel_clone_args;
 struct ksignal;
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
@@ -16,8 +17,8 @@ struct thread_shstk {
 
 long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
 void reset_thread_features(void);
-unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
-				       unsigned long stack_size);
+unsigned long shstk_alloc_thread_stack(struct task_struct *p,
+				       const struct kernel_clone_args *args);
 void shstk_free(struct task_struct *p);
 int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
@@ -28,8 +29,10 @@ static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
 static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
-						     unsigned long clone_flags,
-						     unsigned long stack_size) { return 0; }
+						     const struct kernel_clone_args *args)
+{
+	return 0;
+}
 static inline void shstk_free(struct task_struct *p) {}
 static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
 static inline int restore_signal_shadow_stack(void) { return 0; }
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1b7960cf6eb0..0a54af6c60df 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -209,7 +209,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	 * is disabled, new_ssp will remain 0, and fpu_clone() will know not to
 	 * update it.
 	 */
-	new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
+	new_ssp = shstk_alloc_thread_stack(p, args);
 	if (IS_ERR_VALUE(new_ssp))
 		return PTR_ERR((void *)new_ssp);
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 2ddf23387c7e..88ca9eaebc96 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -191,18 +191,61 @@ void reset_thread_features(void)
 	current->thread.features_locked = 0;
 }
 
-unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
-				       unsigned long stack_size)
+int arch_shstk_validate_clone(struct task_struct *t,
+			      struct vm_area_struct *vma,
+			      struct page *page,
+			      struct kernel_clone_args *args)
+{
+	/*
+	 * SSP is aligned, so reserved bits and mode bit are a zero, just mark
+	 * the token 64-bit.
+	 */
+	void *maddr = page_address(page);
+	unsigned long token;
+	int offset;
+	u64 expected;
+
+	token = args->shadow_stack_token;
+	expected = (token + SS_FRAME_SIZE) | BIT(0);
+	offset = offset_in_page(token);
+
+	if (!cmpxchg_to_user_page(vma, page, token, (unsigned long *)(maddr + offset),
+				  expected, 0))
+		return -EINVAL;
+	set_page_dirty_lock(page);
+
+	return 0;
+}
+
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+				       const struct kernel_clone_args *args)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
+	unsigned long clone_flags = args->flags;
 	unsigned long addr, size;
 
 	/*
 	 * If shadow stack is not enabled on the new thread, skip any
-	 * switch to a new shadow stack.
+	 * implicit switch to a new shadow stack and reject attempts to
+	 * explicitly specify one.
 	 */
-	if (!features_enabled(ARCH_SHSTK_SHSTK))
+	if (!features_enabled(ARCH_SHSTK_SHSTK)) {
+		if (args->shadow_stack_token)
+			return (unsigned long)ERR_PTR(-EINVAL);
+
 		return 0;
+	}
+
+	/*
+	 * If the user specified a shadow stack then use it, otherwise
+	 * fall back to a default allocation strategy. Validation is
+	 * done in arch_shstk_validate_clone().
+	 */
+	if (args->shadow_stack_token) {
+		shstk->base = 0;
+		shstk->size = 0;
+		return args->shadow_stack_token + 8;
+	}
 
 	/*
 	 * For CLONE_VFORK the child will share the parents shadow stack.
@@ -222,7 +265,7 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long cl
 	if (!(clone_flags & CLONE_VM))
 		return 0;
 
-	size = adjust_shstk_size(stack_size);
+	size = adjust_shstk_size(args->stack_size);
 	addr = alloc_shstk(0, size, 0, false);
 	if (IS_ERR_VALUE(addr))
 		return addr;
diff --git a/include/asm-generic/cacheflush.h b/include/asm-generic/cacheflush.h
index 7ee8a179d103..96cc0c7a5c90 100644
--- a/include/asm-generic/cacheflush.h
+++ b/include/asm-generic/cacheflush.h
@@ -124,4 +124,15 @@ static inline void flush_cache_vunmap(unsigned long start, unsigned long end)
 	} while (0)
 #endif
 
+#ifndef cmpxchg_to_user_page
+#define cmpxchg_to_user_page(vma, page, vaddr, ptr, old, new)  \
+({							  \
+	bool ret;						  \
+								  \
+	ret = try_cmpxchg(ptr, &old, new);			  \
+	flush_icache_user_page(vma, page, vaddr, sizeof(*ptr));	  \
+	ret;							  \
+})
+#endif
+
 #endif /* _ASM_GENERIC_CACHEFLUSH_H */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ea41795a352b..b501f752fc9a 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -16,6 +16,7 @@ struct task_struct;
 struct rusage;
 union thread_union;
 struct css_set;
+struct vm_area_struct;
 
 /* All the bits taken by the old clone syscall. */
 #define CLONE_LEGACY_FLAGS 0xffffffffULL
@@ -44,6 +45,7 @@ struct kernel_clone_args {
 	struct cgroup *cgrp;
 	struct css_set *cset;
 	unsigned int kill_seq;
+	unsigned long shadow_stack_token;
 };
 
 /*
@@ -226,4 +228,19 @@ static inline void task_unlock(struct task_struct *p)
 
 DEFINE_GUARD(task_lock, struct task_struct *, task_lock(_T), task_unlock(_T))
 
+#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
+int arch_shstk_validate_clone(struct task_struct *p,
+			      struct vm_area_struct *vma,
+			      struct page *page,
+			      struct kernel_clone_args *args);
+#else
+static inline int arch_shstk_validate_clone(struct task_struct *p,
+					    struct vm_area_struct *vma,
+					    struct page *page,
+					    struct kernel_clone_args *args)
+{
+	return 0;
+}
+#endif
+
 #endif /* _LINUX_SCHED_TASK_H */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 359a14cc76a4..9cf5c419e109 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -84,6 +84,7 @@
  *                kernel's limit of nested PID namespaces.
  * @cgroup:       If CLONE_INTO_CGROUP is specified set this to
  *                a file descriptor for the cgroup.
+ * @shadow_stack_token: Pointer to shadow stack token at top of stack.
  *
  * The structure is versioned by size and thus extensible.
  * New struct members must go at the end of the struct and
@@ -101,12 +102,14 @@ struct clone_args {
 	__aligned_u64 set_tid;
 	__aligned_u64 set_tid_size;
 	__aligned_u64 cgroup;
+	__aligned_u64 shadow_stack_token;
 };
 #endif
 
-#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
-#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
-#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
+#define CLONE_ARGS_SIZE_VER0  64 /* sizeof first published struct */
+#define CLONE_ARGS_SIZE_VER1  80 /* sizeof second published struct */
+#define CLONE_ARGS_SIZE_VER2  88 /* sizeof third published struct */
+#define CLONE_ARGS_SIZE_VER3  96 /* sizeof fourth published struct */
 
 /*
  * Scheduling policies
diff --git a/kernel/fork.c b/kernel/fork.c
index af673856499d..d484ebeded33 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1907,6 +1907,51 @@ static bool need_futex_hash_allocate_default(u64 clone_flags)
 	return true;
 }
 
+static int shstk_validate_clone(struct task_struct *p,
+				struct kernel_clone_args *args)
+{
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	struct page *page;
+	unsigned long addr;
+	int ret;
+
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK))
+		return 0;
+
+	if (!args->shadow_stack_token)
+		return 0;
+
+	mm = get_task_mm(p);
+	if (!mm)
+		return -EFAULT;
+
+	mmap_read_lock(mm);
+
+	addr = untagged_addr_remote(mm, args->shadow_stack_token);
+	page = get_user_page_vma_remote(mm, addr, FOLL_FORCE | FOLL_WRITE,
+					&vma);
+	if (IS_ERR(page)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (!(vma->vm_flags & VM_SHADOW_STACK) ||
+	    !(vma->vm_flags & VM_WRITE)) {
+		ret = -EFAULT;
+		goto out_page;
+	}
+
+	ret = arch_shstk_validate_clone(p, vma, page, args);
+
+out_page:
+	put_page(page);
+out:
+	mmap_read_unlock(mm);
+	mmput(mm);
+	return ret;
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -2182,6 +2227,9 @@ __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
 	retval = copy_thread(p, args);
+	if (retval)
+		goto bad_fork_cleanup_io;
+	retval = shstk_validate_clone(p, args);
 	if (retval)
 		goto bad_fork_cleanup_io;
 
@@ -2763,7 +2811,9 @@ static noinline int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 		     CLONE_ARGS_SIZE_VER1);
 	BUILD_BUG_ON(offsetofend(struct clone_args, cgroup) !=
 		     CLONE_ARGS_SIZE_VER2);
-	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2);
+	BUILD_BUG_ON(offsetofend(struct clone_args, shadow_stack_token) !=
+		     CLONE_ARGS_SIZE_VER3);
+	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER3);
 
 	if (unlikely(usize > PAGE_SIZE))
 		return -E2BIG;
@@ -2796,16 +2846,17 @@ static noinline int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 		return -EINVAL;
 
 	*kargs = (struct kernel_clone_args){
-		.flags		= args.flags,
-		.pidfd		= u64_to_user_ptr(args.pidfd),
-		.child_tid	= u64_to_user_ptr(args.child_tid),
-		.parent_tid	= u64_to_user_ptr(args.parent_tid),
-		.exit_signal	= args.exit_signal,
-		.stack		= args.stack,
-		.stack_size	= args.stack_size,
-		.tls		= args.tls,
-		.set_tid_size	= args.set_tid_size,
-		.cgroup		= args.cgroup,
+		.flags			= args.flags,
+		.pidfd			= u64_to_user_ptr(args.pidfd),
+		.child_tid		= u64_to_user_ptr(args.child_tid),
+		.parent_tid		= u64_to_user_ptr(args.parent_tid),
+		.exit_signal		= args.exit_signal,
+		.stack			= args.stack,
+		.stack_size		= args.stack_size,
+		.tls			= args.tls,
+		.set_tid_size		= args.set_tid_size,
+		.cgroup			= args.cgroup,
+		.shadow_stack_token	= args.shadow_stack_token,
 	};
 
 	if (args.set_tid &&
@@ -2846,6 +2897,24 @@ static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
 	return true;
 }
 
+/**
+ * clone3_shadow_stack_valid - check and prepare shadow stack
+ * @kargs: kernel clone args
+ *
+ * Verify that shadow stacks are only enabled if supported.
+ */
+static inline bool clone3_shadow_stack_valid(struct kernel_clone_args *kargs)
+{
+	if (!kargs->shadow_stack_token)
+		return true;
+
+	if (!IS_ALIGNED(kargs->shadow_stack_token, sizeof(void *)))
+		return false;
+
+	/* Fail if the kernel wasn't built with shadow stacks */
+	return IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK);
+}
+
 static bool clone3_args_valid(struct kernel_clone_args *kargs)
 {
 	/* Verify that no unknown flags are passed along. */
@@ -2868,7 +2937,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
 	    kargs->exit_signal)
 		return false;
 
-	if (!clone3_stack_valid(kargs))
+	if (!clone3_stack_valid(kargs) || !clone3_shadow_stack_valid(kargs))
 		return false;
 
 	return true;

-- 
2.39.5


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox