Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 2/2] vfs: output mount_too_revealing() errors to fscontext
From: Aleksa Sarai @ 2025-08-06  6:06 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Jan Kara, David Howells, linux-api,
	linux-kernel, linux-fsdevel
In-Reply-To: <20250806054116.GE222315@ZenIV>

[-- Attachment #1: Type: text/plain, Size: 597 bytes --]

On 2025-08-06, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Aug 06, 2025 at 02:48:30PM +1000, Aleksa Sarai wrote:
> 
> >  	error = security_sb_kern_mount(sb);
> > -	if (!error && mount_too_revealing(sb, &mnt_flags))
> > +	if (!error && mount_too_revealing(sb, &mnt_flags)) {
> >  		error = -EPERM;
> > +		errorfcp(fc, "VFS", "Mount too revealing");
> > +	}
> 
> Hmm...  For aesthetics sake, I'd probably do logging first; otherwise
> fine by me.

Good point, I'll send a v2.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH 2/2] vfs: output mount_too_revealing() errors to fscontext
From: Al Viro @ 2025-08-06  5:41 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Christian Brauner, Jan Kara, David Howells, linux-api,
	linux-kernel, linux-fsdevel
In-Reply-To: <20250806-errorfc-mount-too-revealing-v1-2-536540f51560@cyphar.com>

On Wed, Aug 06, 2025 at 02:48:30PM +1000, Aleksa Sarai wrote:

>  	error = security_sb_kern_mount(sb);
> -	if (!error && mount_too_revealing(sb, &mnt_flags))
> +	if (!error && mount_too_revealing(sb, &mnt_flags)) {
>  		error = -EPERM;
> +		errorfcp(fc, "VFS", "Mount too revealing");
> +	}

Hmm...  For aesthetics sake, I'd probably do logging first; otherwise
fine by me.

^ permalink raw reply

* [PATCH 2/2] vfs: output mount_too_revealing() errors to fscontext
From: Aleksa Sarai @ 2025-08-06  4:48 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara
  Cc: David Howells, linux-api, linux-kernel, linux-fsdevel,
	Aleksa Sarai
In-Reply-To: <20250806-errorfc-mount-too-revealing-v1-0-536540f51560@cyphar.com>

It makes little sense for fsmount() to output the warning message when
mount_too_revealing() is violated to kmsg. Instead, the warning should
be output (with a "VFS" prefix) to the fscontext log. In addition,
include the same log message for mount_too_revealing() when doing a
regular mount for consistency.

With the newest fsopen()-based mount(8) from util-linux, the error
messages now look like

  # mount -t proc proc /tmp
  mount: /tmp: fsmount() failed: VFS: Mount too revealing.
         dmesg(1) may have more information after failed mount system call.

which could finally result in mount_too_revealing() errors being easier
for users to detect and understand.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 fs/namespace.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 55f28cebbe7d..b2146857cbbd 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3820,8 +3820,10 @@ static int do_new_mount_fc(struct fs_context *fc, struct path *mountpoint,
 	int error;
 
 	error = security_sb_kern_mount(sb);
-	if (!error && mount_too_revealing(sb, &mnt_flags))
+	if (!error && mount_too_revealing(sb, &mnt_flags)) {
 		error = -EPERM;
+		errorfcp(fc, "VFS", "Mount too revealing");
+	}
 
 	if (unlikely(error)) {
 		fc_drop_locked(fc);
@@ -4547,7 +4549,7 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags,
 
 	ret = -EPERM;
 	if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
-		pr_warn("VFS: Mount too revealing\n");
+		errorfcp(fc, "VFS", "Mount too revealing");
 		goto err_unlock;
 	}
 

-- 
2.50.1


^ permalink raw reply related

* [PATCH 1/2] fscontext: add custom-prefix log helpers
From: Aleksa Sarai @ 2025-08-06  4:48 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara
  Cc: David Howells, linux-api, linux-kernel, linux-fsdevel,
	Aleksa Sarai
In-Reply-To: <20250806-errorfc-mount-too-revealing-v1-0-536540f51560@cyphar.com>

Sometimes, errors associated with an fscontext come from the VFS or
otherwise outside of the filesystem driver itself. However, the default
logging of errorfc will always prefix the message with the filesystem
name.

So, add some *fcp() wrappers that allow for custom prefixes to be used
when emitting information to the fscontext log.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/fs_context.h | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 7773eb870039..671f031be173 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -186,10 +186,12 @@ struct fc_log {
 extern __attribute__((format(printf, 4, 5)))
 void logfc(struct fc_log *log, const char *prefix, char level, const char *fmt, ...);
 
-#define __logfc(fc, l, fmt, ...) logfc((fc)->log.log, NULL, \
-					l, fmt, ## __VA_ARGS__)
-#define __plog(p, l, fmt, ...) logfc((p)->log, (p)->prefix, \
-					l, fmt, ## __VA_ARGS__)
+#define __logfc(fc, l, fmt, ...) \
+	logfc((fc)->log.log, NULL, (l), (fmt), ## __VA_ARGS__)
+#define __plogp(p, prefix, l, fmt, ...) \
+	logfc((p)->log, (prefix), (l), (fmt), ## __VA_ARGS__)
+#define __plog(p, l, fmt, ...) __plogp(p, (p)->prefix, l, fmt, ## __VA_ARGS__)
+
 /**
  * infof - Store supplementary informational message
  * @fc: The context in which to log the informational message
@@ -201,6 +203,8 @@ void logfc(struct fc_log *log, const char *prefix, char level, const char *fmt,
 #define infof(fc, fmt, ...) __logfc(fc, 'i', fmt, ## __VA_ARGS__)
 #define info_plog(p, fmt, ...) __plog(p, 'i', fmt, ## __VA_ARGS__)
 #define infofc(fc, fmt, ...) __plog((&(fc)->log), 'i', fmt, ## __VA_ARGS__)
+#define infofcp(fc, prefix, fmt, ...) \
+	__plogp((&(fc)->log), prefix, 'i', fmt, ## __VA_ARGS__)
 
 /**
  * warnf - Store supplementary warning message
@@ -213,6 +217,8 @@ void logfc(struct fc_log *log, const char *prefix, char level, const char *fmt,
 #define warnf(fc, fmt, ...) __logfc(fc, 'w', fmt, ## __VA_ARGS__)
 #define warn_plog(p, fmt, ...) __plog(p, 'w', fmt, ## __VA_ARGS__)
 #define warnfc(fc, fmt, ...) __plog((&(fc)->log), 'w', fmt, ## __VA_ARGS__)
+#define warnfcp(fc, prefix, fmt, ...) \
+	__plogp((&(fc)->log), prefix, 'w', fmt, ## __VA_ARGS__)
 
 /**
  * errorf - Store supplementary error message
@@ -225,6 +231,8 @@ void logfc(struct fc_log *log, const char *prefix, char level, const char *fmt,
 #define errorf(fc, fmt, ...) __logfc(fc, 'e', fmt, ## __VA_ARGS__)
 #define error_plog(p, fmt, ...) __plog(p, 'e', fmt, ## __VA_ARGS__)
 #define errorfc(fc, fmt, ...) __plog((&(fc)->log), 'e', fmt, ## __VA_ARGS__)
+#define errorfcp(fc, prefix, fmt, ...) \
+	__plogp((&(fc)->log), prefix, 'e', fmt, ## __VA_ARGS__)
 
 /**
  * invalf - Store supplementary invalid argument error message
@@ -237,5 +245,7 @@ void logfc(struct fc_log *log, const char *prefix, char level, const char *fmt,
 #define invalf(fc, fmt, ...) (errorf(fc, fmt, ## __VA_ARGS__), -EINVAL)
 #define inval_plog(p, fmt, ...) (error_plog(p, fmt, ## __VA_ARGS__), -EINVAL)
 #define invalfc(fc, fmt, ...) (errorfc(fc, fmt, ## __VA_ARGS__), -EINVAL)
+#define invalfcp(fc, prefix, fmt, ...) \
+	(errorfcp(fc, prefix, fmt, ## __VA_ARGS__), -EINVAL)
 
 #endif /* _LINUX_FS_CONTEXT_H */

-- 
2.50.1


^ permalink raw reply related

* [PATCH 0/2] vfs: output mount_too_revealing() errors to fscontext
From: Aleksa Sarai @ 2025-08-06  4:48 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara
  Cc: David Howells, linux-api, linux-kernel, linux-fsdevel,
	Aleksa Sarai

It makes little sense for fsmount() to output the warning message when
mount_too_revealing() is violated to kmsg. Instead, the warning should
be output (with a "VFS" prefix) to the fscontext log. In addition,
include the same log message for mount_too_revealing() when doing a
regular mount for consistency.

With the newest fsopen()-based mount(8) from util-linux, the error
messages now look like

  # mount -t proc proc /tmp
  mount: /tmp: fsmount() failed: VFS: Mount too revealing.
	 dmesg(1) may have more information after failed mount system call.

which could finally result in mount_too_revealing() errors being easier
for users to detect and understand.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Aleksa Sarai (2):
      fscontext: add custom-prefix log helpers
      vfs: output mount_too_revealing() errors to fscontext

 fs/namespace.c             |  6 ++++--
 include/linux/fs_context.h | 18 ++++++++++++++----
 2 files changed, 18 insertions(+), 6 deletions(-)
---
base-commit: 66639db858112bf6b0f76677f7517643d586e575
change-id: 20250805-errorfc-mount-too-revealing-5d9f670ba770

Best regards,
-- 
Aleksa Sarai <cyphar@cyphar.com>


^ permalink raw reply

* Re: [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
From: Randy Dunlap @ 2025-08-06  0:25 UTC (permalink / raw)
  To: Aleksa Sarai, Alexander Viro, Christian Brauner, Jan Kara,
	Jonathan Corbet, Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest
In-Reply-To: <20250805-procfs-pidns-api-v4-3-705f984940e7@cyphar.com>



On 8/4/25 10:45 PM, Aleksa Sarai wrote:
> /proc has historically had very opaque semantics about PID namespaces,
> which is a little unfortunate for container runtimes and other programs
> that deal with switching namespaces very often. One common issue is that
> of converting between PIDs in the process's namespace and PIDs in the
> namespace of /proc.
> 
> In principle, it is possible to do this today by opening a pidfd with
> pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> contain a PID value translated to the pid namespace associated with that
> procfs superblock). However, allocating a new file for each PID to be
> converted is less than ideal for programs that may need to scan procfs,
> and it is generally useful for userspace to be able to finally get this
> information from procfs.
> 
> So, add a new API to get the pid namespace of a procfs instance, in the
> form of an ioctl(2) you can call on the root directory of said procfs.
> The returned file descriptor will have O_CLOEXEC set. This acts as a
> sister feature to the new "pidns" mount option, finally allowing
> userspace full control of the pid namespaces associated with procfs
> instances.
> 
> The permission model for this is a bit looser than that of the "pidns"
> mount option (and also setns(2)) because /proc/1/ns/pid provides the
> same information, so as long as you have access to that magic-link (or
> something equivalently reasonable such as being in an ancestor pid
> namespace) it makes sense to allow userspace to grab a handle. Ideally
> we would check for ptrace-read access against all processes in the pidns
> (which is very likely to be true for at least one process, as
> SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
> programs), but this would obviously not scale.
> 
> setns(2) will still have their own permission checks, so being able to
> open a pidns handle doesn't really provide too many other capabilities.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  Documentation/filesystems/proc.rst |  4 +++
>  fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/fs.h            |  4 +++
>  3 files changed, 74 insertions(+), 2 deletions(-)
> 


> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 0bd678a4a10e..68e65e6d7d6b 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
>  			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
>  			 RWF_DONTCACHE)
>  
> +/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
>  #define PROCFS_IOCTL_MAGIC 'f'
>  
> +/* procfs root ioctls */
> +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)

Since the _IO() nr here is 32, Documentation/userspace-api/ioctl/ioctl-number.rst
should be updated like:

-'f'   00-0F  linux/fs.h                                                conflict!
+'f'   00-1F  linux/fs.h                                                conflict!

(17 is already used for PROCFS_IOCTL_MAGIC somewhere else, so that probably should
have update the Doc/rst file.)

> +
>  /* Pagemap ioctl */
>  #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
>  
> 
Thanks.
-- 
~Randy


^ permalink raw reply

* Re: [PATCH v4 2/4] procfs: add "pidns" mount option
From: Randy Dunlap @ 2025-08-06  0:19 UTC (permalink / raw)
  To: Aleksa Sarai, Alexander Viro, Christian Brauner, Jan Kara,
	Jonathan Corbet, Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest
In-Reply-To: <20250805-procfs-pidns-api-v4-2-705f984940e7@cyphar.com>

Hi,

On 8/4/25 10:45 PM, Aleksa Sarai wrote:
> Since the introduction of pid namespaces, their interaction with procfs
> has been entirely implicit in ways that require a lot of dancing around
> by programs that need to construct sandboxes with different PID
> namespaces.
> 
> Being able to explicitly specify the pid namespace to use when
> constructing a procfs super block will allow programs to no longer need
> to fork off a process which does then does unshare(2) / setns(2) and
> forks again in order to construct a procfs in a pidns.
> 
> So, provide a "pidns" mount option which allows such users to just
> explicitly state which pid namespace they want that procfs instance to
> use. This interface can be used with fsconfig(2) either with a file
> descriptor or a path:
> 
>   fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>   fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> 
> or with classic mount(2) / mount(8):
> 
>   // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
>   mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> 
> As this new API is effectively shorthand for setns(2) followed by
> mount(2), the permission model for this mirrors pidns_install() to avoid
> opening up new attack surfaces by loosening the existing permission
> model.
> 
> In order to avoid having to RCU-protect all users of proc_pid_ns() (to
> avoid UAFs), attempting to reconfigure an existing procfs instance's pid
> namespace will error out with -EBUSY. Creating new procfs instances is
> quite cheap, so this should not be an impediment to most users, and lets
> us avoid a lot of churn in fs/proc/* for a feature that it seems
> unlikely userspace would use.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  Documentation/filesystems/proc.rst |  8 ++++
>  fs/proc/root.c                     | 98 +++++++++++++++++++++++++++++++++++---
>  2 files changed, 100 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 5236cb52e357..5a157dadea0b 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -2360,6 +2360,7 @@ The following mount options are supported:
>  	hidepid=	Set /proc/<pid>/ access mode.
>  	gid=		Set the group authorized to learn processes information.
>  	subset=		Show only the specified subset of procfs.
> +	pidns=		Specify a the namespace used by this procfs.

			drop ^^ a

>  	=========	========================================================
>  
>  hidepid=off or hidepid=0 means classic mode - everybody may access all
> @@ -2392,6 +2393,13 @@ information about processes information, just add identd to this group.
>  subset=pid hides all top level files and directories in the procfs that
>  are not related to tasks.
>  
> +pidns= specifies a pid namespace (either as a string path to something like
> +`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
> +will be used by the procfs instance when translating pids. By default, procfs
> +will use the calling process's active pid namespace. Note that the pid
> +namespace of an existing procfs instance cannot be modified (attempting to do
> +so will give an `-EBUSY` error).
> +
>  Chapter 5: Filesystem behavior
>  ==============================
>  
-- 
~Randy


^ permalink raw reply

* Re: [PATCH v2 31/32] libluo: introduce luoctl
From: Pasha Tatashin @ 2025-08-05 18:24 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Steven Rostedt, Jason Gunthorpe, Thomas Gleixner, jasonmiu, graf,
	changyuanl, rppt, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, anna.schumaker, song, zhangguopeng, linux,
	linux-kernel, linux-doc, linux-mm, gregkh, mingo, bp, dave.hansen,
	x86, hpa, rafael, dakr, bartosz.golaszewski, cw00.choi,
	myungjoo.ham, yesanishhere, Jonathan.Cameron, quic_zijuhu,
	aleksander.lobakin, ira.weiny, andriy.shevchenko, leon, lukas,
	bhelgaas, wagi, djeffery, stuart.w.hayes, lennart, brauner,
	linux-api, linux-fsdevel, saeedm, ajayachandra, parav, leonro,
	witu
In-Reply-To: <mafs07bzqeg3x.fsf@kernel.org>

> To add some context: one of the reasons to include it in the series as
> an RFC at the end was to showcase the userspace side of the API and have
> a way for people to see how it can be used. Seeing an API in action
> provides useful context for reviewing patches.
>
> I think Pasha forgot to add the RFC tags when he created v2, since it is
> only meant to be RFC right now and not proper patches.

Correct, I accidently removed RFC from memfd patches in the version. I
will include memfd preservation as RFCv1 in v3 submission.

>
> The point of moving out of tree was also brought up in the live update
> call and I agree with Jason's feedback on it. The plan is to drop it
> from the series in the next revision, and just leave a reference to it
> in the cover letter instead.

 I will drop libluo/luoctl and will add a pointer to an external repo
where they can be accessed from.

Pasha

^ permalink raw reply

* Re: [PATCH v2 16/32] liveupdate: luo_ioctl: add ioctl interface
From: Pasha Tatashin @ 2025-08-05 18:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250729163536.GN36037@nvidia.com>

On Tue, Jul 29, 2025 at 12:35 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Jul 23, 2025 at 02:46:29PM +0000, Pasha Tatashin wrote:
> > Introduce the user-space interface for the Live Update Orchestrator
> > via ioctl commands, enabling external control over the live update
> > process and management of preserved resources.
>
> I strongly recommend copying something like fwctl (which is copying
> iommufd, which is copying some other best practices). I will try to
> outline the main points below.
>
> The design of the fwctl scheme allows alot of options for ABI
> compatible future extensions and I very strongly recommend that
> complex ioctl style APIs be built with that in mind. I have so many
> scars from trying to undo fixed ABI design :)

Thank you for bringing this up, I have reviewed fwctl ioctl
implementation, and also iommufd ioctl, and I made the necessary
changes to make luo similar.

> > +/**
> > + * struct liveupdate_fd - Holds parameters for preserving and restoring file
> > + * descriptors across live update.
> > + * @fd:    Input for %LIVEUPDATE_IOCTL_FD_PRESERVE: The user-space file
> > + *         descriptor to be preserved.
> > + *         Output for %LIVEUPDATE_IOCTL_FD_RESTORE: The new file descriptor
> > + *         representing the fully restored kernel resource.
> > + * @flags: Unused, reserved for future expansion, must be set to 0.
> > + * @token: Input for %LIVEUPDATE_IOCTL_FD_PRESERVE: An opaque, unique token
> > + *         preserved for preserved resource.
> > + *         Input for %LIVEUPDATE_IOCTL_FD_RESTORE: The token previously
> > + *         provided to the preserve ioctl for the resource to be restored.
> > + *
> > + * This structure is used as the argument for the %LIVEUPDATE_IOCTL_FD_PRESERVE
> > + * and %LIVEUPDATE_IOCTL_FD_RESTORE ioctls. These ioctls allow specific types
> > + * of file descriptors (for example memfd, kvm, iommufd, and VFIO) to have their
> > + * underlying kernel state preserved across a live update cycle.
> > + *
> > + * To preserve an FD, user space passes this struct to
> > + * %LIVEUPDATE_IOCTL_FD_PRESERVE with the @fd field set. On success, the
> > + * kernel uses the @token field to uniquly associate the preserved FD.
> > + *
> > + * After the live update transition, user space passes the struct populated with
> > + * the *same* @token to %LIVEUPDATE_IOCTL_FD_RESTORE. The kernel uses the @token
> > + * to find the preserved state and, on success, populates the @fd field with a
> > + * new file descriptor referring to the restored resource.
> > + */
> > +struct liveupdate_fd {
> > +     int             fd;
>
> 'int' should not appear in uapi structs. Fds are __s32

done

>
> > +     __u32           flags;
> > +     __aligned_u64   token;
> > +};
> > +
> > +/* The ioctl type, documented in ioctl-number.rst */
> > +#define LIVEUPDATE_IOCTL_TYPE                0xBA
>
> I have found it very helpful to organize the ioctl numbering like this:
>
> #define IOMMUFD_TYPE (';')
>
> enum {
>         IOMMUFD_CMD_BASE = 0x80,
>         IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
>         IOMMUFD_CMD_IOAS_ALLOC = 0x81,
>         IOMMUFD_CMD_IOAS_ALLOW_IOVAS = 0x82,
> [..]
>
> #define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
>
> The numbers should be tightly packed and non-overlapping. It becomes
> difficult to manage this if the numbers are sprinkled all over the
> file. The above structuring will enforce git am conflicts if things
> get muddled up. Saved me a few times already in iommufd.

Done

>
> > +/**
> > + * LIVEUPDATE_IOCTL_FD_PRESERVE - Validate and initiate preservation for a file
> > + * descriptor.
> > + *
> > + * Argument: Pointer to &struct liveupdate_fd.
> > + *
> > + * User sets the @fd field identifying the file descriptor to preserve
> > + * (e.g., memfd, kvm, iommufd, VFIO). The kernel validates if this FD type
> > + * and its dependencies are supported for preservation. If validation passes,
> > + * the kernel marks the FD internally and *initiates the process* of preparing
> > + * its state for saving. The actual snapshotting of the state typically occurs
> > + * during the subsequent %LIVEUPDATE_IOCTL_PREPARE execution phase, though
> > + * some finalization might occur during freeze.
> > + * On successful validation and initiation, the kernel uses the @token
> > + * field with an opaque identifier representing the resource being preserved.
> > + * This token confirms the FD is targeted for preservation and is required for
> > + * the subsequent %LIVEUPDATE_IOCTL_FD_RESTORE call after the live update.
> > + *
> > + * Return: 0 on success (validation passed, preservation initiated), negative
> > + * error code on failure (e.g., unsupported FD type, dependency issue,
> > + * validation failed).
> > + */
> > +#define LIVEUPDATE_IOCTL_FD_PRESERVE                                 \
> > +     _IOW(LIVEUPDATE_IOCTL_TYPE, 0x00, struct liveupdate_fd)
>
> From a kdoc perspective I find it works much better to attach the kdoc
> to the struct, not the ioctl:
>
> /**
>  * struct iommu_destroy - ioctl(IOMMU_DESTROY)
>  * @size: sizeof(struct iommu_destroy)
>  * @id: iommufd object ID to destroy. Can be any destroyable object type.
>  *
>  * Destroy any object held within iommufd.
>  */
> struct iommu_destroy {
>         __u32 size;
>         __u32 id;
> };
> #define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
>
> Generates this kdoc:
>
> https://docs.kernel.org/userspace-api/iommufd.html#c.iommu_destroy

Agreed, done the same as above.

>
> You should also make sure to link the uapi header into the kdoc build
> under the "userspace API" chaper.
>
> The structs should also be self-describing. I am fairly strongly
> against using the size mechanism in the _IOW macro, it is instantly
> ABI incompatible and basically impossible to deal with from userspace.
>
> Hence why the IOMMFD version is _IO().

Right, I came to the same conclusion while reviewing fwctl, I replaced
everything with pure _IO().

>
> This means stick a size member in the first 4 bytes of every
> struct. More on this later..
>
> > +/**
> > + * LIVEUPDATE_IOCTL_FD_UNPRESERVE - Remove a file descriptor from the
> > + * preservation list.
> > + *
> > + * Argument: Pointer to __u64 token.
>
> Every ioctl should have a struct, with the size header. If you want to
> do more down the road you can not using this structure.

Done

>
> > +#define LIVEUPDATE_IOCTL_FD_RESTORE                                  \
> > +     _IOWR(LIVEUPDATE_IOCTL_TYPE, 0x02, struct liveupdate_fd)
>
> Strongly recommend that every ioctl have a unique struct. Sharing
> structs makes future extend-ability harder.

Done

>
> > +/**
> > + * LIVEUPDATE_IOCTL_PREPARE - Initiate preparation phase and trigger state
> > + * saving.
>
> Perhaps these just want to be a single 'set state' ioctl with an enum
> input argument?

Added a IOCTL: LIVEUPDATE_SET_EVENT, and all events
PREPARE/FINISH/CANCEL are now done through it.

>
> > @@ -7,4 +7,5 @@ obj-$(CONFIG_KEXEC_HANDOVER)          += kexec_handover.o
> >  obj-$(CONFIG_KEXEC_HANDOVER_DEBUG)   += kexec_handover_debug.o
> >  obj-$(CONFIG_LIVEUPDATE)             += luo_core.o
> >  obj-$(CONFIG_LIVEUPDATE)             += luo_files.o
> > +obj-$(CONFIG_LIVEUPDATE)             += luo_ioctl.o
> >  obj-$(CONFIG_LIVEUPDATE)             += luo_subsystems.o
>
> I don't think luo is modular, but I think it is generally better to
> write the kbuilds as though it was anyhow if it has a lot of files:
>
> iommufd-y := \
>         device.o \
>         eventq.o \
>         hw_pagetable.o \
>         io_pagetable.o \
>         ioas.o \
>         main.o \
>         pages.o \
>         vfio_compat.o \
>         viommu.o
> obj-$(CONFIG_IOMMUFD) += iommufd.o

Done

>
> Basically don't repeat obj-$(CONFIG_LIVEUPDATE), every one of those
> lines creates a new module (if it was modular)
>
> > +static int luo_open(struct inode *inodep, struct file *filep)
> > +{
> > +     if (!capable(CAP_SYS_ADMIN))
> > +             return -EACCES;
>
> IMHO file system permissions should control permission to open. No
> capable check.

Removed

>
> > +     if (filep->f_flags & O_EXCL)
> > +             return -EINVAL;
>
> O_EXCL doesn't really do anything for cdev, I'd drop this.
>
> The open should have an atomic to check for single open though.

Removed, and added an enforcement for a single open.

>
> > +static long luo_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
> > +{
> > +     void __user *argp = (void __user *)arg;
> > +     struct liveupdate_fd luo_fd;
> > +     enum liveupdate_state state;
> > +     int ret = 0;
> > +     u64 token;
> > +
> > +     if (_IOC_TYPE(cmd) != LIVEUPDATE_IOCTL_TYPE)
> > +             return -ENOTTY;
>
> The generic parse/disptach from fwctl is a really good idea here, you
> can cut and paste it, change the names. It makes it really easy to manage future extensibility:
>
> List the ops and their structs:
>
> static const struct fwctl_ioctl_op fwctl_ioctl_ops[] = {
>         IOCTL_OP(FWCTL_INFO, fwctl_cmd_info, struct fwctl_info, out_device_data),
>         IOCTL_OP(FWCTL_RPC, fwctl_cmd_rpc, struct fwctl_rpc, out),
> };
>
> Index the list and copy_from_user the struct desribing the opt:
>
> static long fwctl_fops_ioctl(struct file *filp, unsigned int cmd,
>                                unsigned long arg)
> {
>         struct fwctl_uctx *uctx = filp->private_data;
>         const struct fwctl_ioctl_op *op;
>         struct fwctl_ucmd ucmd = {};
>         union fwctl_ucmd_buffer buf;
>         unsigned int nr;
>         int ret;
>
>         nr = _IOC_NR(cmd);
>         if ((nr - FWCTL_CMD_BASE) >= ARRAY_SIZE(fwctl_ioctl_ops))
>                 return -ENOIOCTLCMD;
>
>         op = &fwctl_ioctl_ops[nr - FWCTL_CMD_BASE];
>         if (op->ioctl_num != cmd)
>                 return -ENOIOCTLCMD;
>
>         ucmd.uctx = uctx;
>         ucmd.cmd = &buf;
>         ucmd.ubuffer = (void __user *)arg;
>         // This is reading/checking the standard 4 byte size header:
>         ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
>         if (ret)
>                 return ret;
>
>         if (ucmd.user_size < op->min_size)
>                 return -EINVAL;
>
>         ret = copy_struct_from_user(ucmd.cmd, op->size, ucmd.ubuffer,
>                                     ucmd.user_size);
>
>
> Removes a bunch of boiler plate and easy to make wrong copy_from_users
> in the ioctls. Centralizes size validation, zero padding checking/etc.

Yeap, implemented as  above.

>
> > +             ret = luo_register_file(luo_fd.token, luo_fd.fd);
> > +             if (!ret && copy_to_user(argp, &luo_fd, sizeof(luo_fd))) {
> > +                     WARN_ON_ONCE(luo_unregister_file(luo_fd.token));
> > +                     ret = -EFAULT;
>
> Then for extensibility you'd copy back the struct:
>
> static int ucmd_respond(struct fwctl_ucmd *ucmd, size_t cmd_len)
> {
>         if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
>                          min_t(size_t, ucmd->user_size, cmd_len)))
>                 return -EFAULT;
>         return 0;
> }
>
> Which truncates it/etc according to some ABI extensibility rules.
>
> > +static int __init liveupdate_init(void)
> > +{
> > +     int err;
> > +
> > +     if (!liveupdate_enabled())
> > +             return 0;
> > +
> > +     err = misc_register(&liveupdate_miscdev);
> > +     if (err < 0) {
> > +             pr_err("Failed to register misc device '%s': %d\n",
> > +                    liveupdate_miscdev.name, err);
>
> Should remove most of the pr_err's, here too IMHO..

Removed.

>
> Jason

Thanks a lot for the thorough review!

Pasha

^ permalink raw reply

* [PATCH v4 2/2] man/man2/mremap.2: describe previously undocumented shrink behaviour
From: Lorenzo Stoakes @ 2025-08-05 17:31 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <cover.1754414738.git.lorenzo.stoakes@oracle.com>

There is pre-existing logic that appears to be undocumented for an mremap()
shrink operation, where it turns out that the usual 'input range must span
a single mapping' requirement no longer applies.

In fact, it turns out that the input range specified by [old_address,
old_address + old_size) may span any number of mappings.

If shrinking in-place (that is, neither the MREMAP_FIXED nor
MREMAP_DONTUNMAP flags are specified), then the new span may also span any
number of VMAs - [old_address, old_address + new_size).

If shrinking and moving, the range specified by [old_address, old_address +
new_size) must span a single VMA.

There must be at least one VMA contained within the [old_address,
old_address + old_size) range, and old_address must be within the range of
a VMA.

Explicitly document this.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 man/man2/mremap.2 | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
index 6ba51310c..631c835b8 100644
--- a/man/man2/mremap.2
+++ b/man/man2/mremap.2
@@ -48,8 +48,35 @@ The
 .B MREMAP_DONTUNMAP
 flag may be specified.
 .P
-If the operation is not
-simply moving mappings,
+Equally, if the operation performs a shrink,
+that is if
+.I old_size
+is greater than
+.IR new_size ,
+then
+.I old_size
+may also span multiple mappings
+which do not have to be
+adjacent to one another.
+If this shrink is performed
+in-place,
+that is,
+neither
+.BR MREMAP_FIXED ,
+nor
+.B MREMAP_DONTUNMAP
+are specified,
+.I new_size
+may also span multiple VMAs.
+However, if the range is moved,
+then
+.I new_size
+must span only a single mapping.
+.P
+If the operation is neither a
+.B MREMAP_FIXED
+move
+nor a shrink,
 then
 .I old_size
 must span only a single mapping.
-- 
2.50.1

^ permalink raw reply related

* [PATCH v4 1/2] man/man2/mremap.2: describe multiple mapping move
From: Lorenzo Stoakes @ 2025-08-05 17:31 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <cover.1754414738.git.lorenzo.stoakes@oracle.com>

Document the new behaviour introduced in Linux 6.17 whereby it is now
possible to move multiple mappings in a single operation, as long as the
operation is purely a move, that is old_size is equal to new_size and
MREMAP_FIXED is specified.

To make things clearer, also describe this 'pure move' operation, before
expanding upon it to describe the newly introduced behaviour.

This change also explains the limitations of of this method and the
possibility of partial failure.

Finally, we pluralise language where it makes sense to so the documentation
does not contradict either this new capability nor the pre-existing edge
case.

Example code is enclosed below demonstrating the behaviour which is now
possible:

int main(void)
{
	unsigned long page_size = sysconf(_SC_PAGESIZE);
	void *ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
			 MAP_ANON | MAP_PRIVATE, -1, 0);
	void *tgt_ptr = mmap(NULL, 10 * page_size, PROT_NONE,
			     MAP_ANON | MAP_PRIVATE, -1, 0);
	int i;

	if (ptr == MAP_FAILED || tgt_ptr == MAP_FAILED) {
		perror("mmap");
		return EXIT_FAILURE;
	}

	/* Unmap every other page. */
	for (i = 1; i < 10; i += 2)
		munmap(ptr + i * page_size, page_size);

	/* Now move all 5 distinct mappings to tgt_ptr. */
	ptr = mremap(ptr, 10 * page_size, 10 * page_size,
		     MREMAP_MAYMOVE | MREMAP_FIXED, tgt_ptr);
	if (ptr == MAP_FAILED) {
		perror("mremap");
		return EXIT_FAILURE;
	}

	return EXIT_SUCCESS;
}

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 man/man2/mremap.2 | 84 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 73 insertions(+), 11 deletions(-)

diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
index 2168ca728..6ba51310c 100644
--- a/man/man2/mremap.2
+++ b/man/man2/mremap.2
@@ -25,18 +25,47 @@ moving it at the same time (controlled by the
 argument and
 the available virtual address space).
 .P
+Mappings can also simply be moved
+(without any resizing)
+by specifying equal
+.I old_size
+and
+.I new_size
+and using the
+.B MREMAP_FIXED
+flag
+(see below).
+Since Linux 6.17,
+while
+.I old_address
+must reside within a mapping,
+.I old_size
+may span multiple mappings
+which do not have to be
+adjacent to one another when
+performing a move like this.
+The
+.B MREMAP_DONTUNMAP
+flag may be specified.
+.P
+If the operation is not
+simply moving mappings,
+then
+.I old_size
+must span only a single mapping.
+.P
 .I old_address
-is the old address of the virtual memory block that you
-want to expand (or shrink).
+is the old address of the first virtual memory block that you
+want to expand, shrink, and/or move.
 Note that
 .I old_address
 has to be page aligned.
 .I old_size
-is the old size of the
-virtual memory block.
+is the size of the range containing
+virtual memory blocks to be manipulated.
 .I new_size
 is the requested size of the
-virtual memory block after the resize.
+virtual memory blocks after the resize.
 An optional fifth argument,
 .IR new_address ,
 may be provided; see the description of
@@ -105,13 +134,43 @@ If
 is specified, then
 .B MREMAP_MAYMOVE
 must also be specified.
+.IP
+Since Linux 6.17,
+if
+.I old_size
+is equal to
+.I new_size
+and
+.B MREMAP_FIXED
+is specified, then
+.I old_size
+may span beyond the mapping in which
+.I old_address
+resides.
+In this case,
+gaps between mappings in the original range
+are maintained in the new range.
+The whole operation is performed atomically
+unless an error arises,
+in which case the operation may be partially
+completed,
+that is,
+some mappings may be moved and others not.
+.IP
+
+Moving multiple mappings is not permitted if
+any of those mappings have either
+been registered with
+.BR userfaultfd (2) ,
+or map drivers that
+specify their own custom address mapping logic.
 .TP
 .BR MREMAP_DONTUNMAP " (since Linux 5.7)"
 .\" commit e346b3813067d4b17383f975f197a9aa28a3b077
 This flag, which must be used in conjunction with
 .BR MREMAP_MAYMOVE ,
-remaps a mapping to a new address but does not unmap the mapping at
-.IR old_address .
+remaps mappings to a new address but does not unmap them
+from their original address.
 .IP
 The
 .B MREMAP_DONTUNMAP
@@ -149,13 +208,13 @@ mapped.
 See NOTES for some possible applications of
 .BR MREMAP_DONTUNMAP .
 .P
-If the memory segment specified by
+If the memory segments specified by
 .I old_address
 and
 .I old_size
-is locked (using
+are locked (using
 .BR mlock (2)
-or similar), then this lock is maintained when the segment is
+or similar), then this lock is maintained when the segments are
 resized and/or relocated.
 As a consequence, the amount of memory locked by the process may change.
 .SH RETURN VALUE
@@ -188,7 +247,10 @@ virtual memory address for this process.
 You can also get
 .B EFAULT
 even if there exist mappings that cover the
-whole address space requested, but those mappings are of different types.
+whole address space requested, but those mappings are of different types,
+and the
+.BR mremap ()
+operation being performed does not support this.
 .TP
 .B EINVAL
 An invalid argument was given.
-- 
2.50.1


^ permalink raw reply related

* [PATCH v4 0/2] man/man2/mremap.2: describe multiple mapping move, shrink
From: Lorenzo Stoakes @ 2025-08-05 17:31 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api

We have added new functionality to mremap() in Linux 6.17, permitting the
move of multiple VMAs when performing a move alone (that is - providing
MREMAP_MAYMOVE | MREMAP_FIXED flags and specifying old_size == new_size).

We document this new feature.

Additionally, we document previously undocumented behaviour around
shrinking of input VMA ranges which permits the input range to span
multiple VMAs.

v4:
* Update description of newly discovered mremap() behaviour to highlight the
  fact that, if in-place, [old_address, old_address + new_length) may span
  multiple VMAs also.
* Fix up commit message for 2/2 to correct typo on specified range.
* Added code sample to 1/2 as per Alejandro.

v3:
* Use more precise language around mremap() move description as per Jann.
* Fix some typos in commit messages.
https://lore.kernel.org/all/cover.1753795807.git.lorenzo.stoakes@oracle.com/

v2:
* Split out the two man page changes as requested by Alejandro.
https://lore.kernel.org/all/cover.1753711160.git.lorenzo.stoakes@oracle.com/

v1:
https://lore.kernel.org/all/20250723174634.75054-1-lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (2):
  man/man2/mremap.2: describe multiple mapping move
  man/man2/mremap.2: describe previously undocumented shrink behaviour

 man/man2/mremap.2 | 111 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 100 insertions(+), 11 deletions(-)

--
2.50.1

^ permalink raw reply

* Re: [PATCH 1/5] Add manpage for open_tree(2)
From: Aleksa Sarai @ 2025-08-05 16:45 UTC (permalink / raw)
  To: Askar Safin
  Cc: Aleksa Sarai, Christian Brauner, David Howells, Alexander Viro,
	linux-api, linux-fsdevel, linux-kernel, linux-man
In-Reply-To: <20250201024322.2625842-1-safinaskar@zohomail.com>

On 2025-02-01, Askar Safin said:
> David Howells, ping! You still didn't add all these manpages.

In case you're interested, Christian Brauner maintained these man pages
in a markdown format from 2024[1]. I sat down this last weekend and
rewrote them mostly from scratch as man pages, and resubmitted them to
man-pages again today[2].

PTAL.

[1]: https://github.com/brauner/man-pages-md
[2]: https://lore.kernel.org/all/20250806-new-mount-api-v1-0-8678f56c6ee0@cyphar.com/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

^ permalink raw reply

* Re: [PATCH v2 10/32] liveupdate: luo_core: Live Update Orchestrator
From: Jason Gunthorpe @ 2025-08-05 12:31 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bCrfVef_sFWCQpdwe9N_go8F_pU4O-w+XBJZ6yEuXRj9g@mail.gmail.com>

On Sun, Aug 03, 2025 at 09:11:20PM -0400, Pasha Tatashin wrote:

> Having a global state is necessary for performance optimizations. This
> is similar to why we export the state to userspace via sysfs: it
> allows other subsystems to behave differently during a
> performance-optimized live update versus a normal boot.
 
> For example, in our code base we have a driver that doesn't
> participate in the live update itself (it has no state to preserve).
> However, during boot, it checks this global state. If it's a live
> update boot, the driver skips certain steps, like loading firmware, to
> accelerate the overall boot time.

TBH, I'm against this. Give the driver a 0 byte state if it wants to
behave differently during live update. We should not be making
implicit things like this.

Plus the usual complaining about building core kernel infrastructure
around weird out of tree drivers.

If userspace wants a device to participate in live update, even just
"optimizations", then it has to opt in.

Frankly, this driver has no idea what the prior kernel did, and by
"optimizing" I think you are actually assuming that the prior kernel
had it bound to a normal kernel driver that left it in some
predictable configuration.

Vs say bound to VFIO and completely messed up.

So this should be represented by a LUO serialization that says "the
prior kernel left this device in well defined state X" even if it
takes 0 bytes to describe that state.

So no globals, there should be a way for a driver to tell if it is
participating in LUO, but not some global 'is luo' boot fla.g

> > +       ret = liveupdate_register_subsystem(&luo_file_subsys);
> > +       if (ret) {
> > +               pr_warn("Failed to register luo_file subsystem [%d]\n", ret);
> > +               return ret;
> > +       }
> > +
> > +       if (liveupdate_state_updated()) {
> >
> > Thats going to be a standard pattern - I would expect that
> > liveupdate_register_subsystem() would do the check for updated and
> > then arrange to call back something like
> > liveupdate_subsystem.ops.post_update()
> >
> > And then post_update() would get the info that is currently under
> > liveupdate_get_subsystem_data() as arguments instead of having to make
> > more functions calls.
> >
> > Maybe even the fdt_node_check_compatible() can be hoisted.
> >
> > That would remove a bunch more liveupdate_state_updated() calls.
> 
> That's a good suggestion for a potential refactor. For now, the
> state-check call is inexpensive and is not in a performance-critical
> path. We can certainly implement this optimization later if it becomes
> necessary.

It is not an optimization, it is having a proper logical code
structure that doesn't rely on globals. I'm strongly against
sprinkling globals everywhere in the code when there are simple
logical APIs that entirely avoid it.

When I looked at where there were globals I didn't find any good
justifications, just thinks like this that are poor API design.

Jason

^ permalink raw reply

* Re: [PATCH v4 2/4] procfs: add "pidns" mount option
From: Aleksa Sarai @ 2025-08-05  7:29 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Amir Goldstein
In-Reply-To: <20250805-procfs-pidns-api-v4-2-705f984940e7@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 8905 bytes --]

On 2025-08-05, Aleksa Sarai <cyphar@cyphar.com> wrote:
> Since the introduction of pid namespaces, their interaction with procfs
> has been entirely implicit in ways that require a lot of dancing around
> by programs that need to construct sandboxes with different PID
> namespaces.
> 
> Being able to explicitly specify the pid namespace to use when
> constructing a procfs super block will allow programs to no longer need
> to fork off a process which does then does unshare(2) / setns(2) and
> forks again in order to construct a procfs in a pidns.
> 
> So, provide a "pidns" mount option which allows such users to just
> explicitly state which pid namespace they want that procfs instance to
> use. This interface can be used with fsconfig(2) either with a file
> descriptor or a path:
> 
>   fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>   fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> 
> or with classic mount(2) / mount(8):
> 
>   // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
>   mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> 
> As this new API is effectively shorthand for setns(2) followed by
> mount(2), the permission model for this mirrors pidns_install() to avoid
> opening up new attack surfaces by loosening the existing permission
> model.
> 
> In order to avoid having to RCU-protect all users of proc_pid_ns() (to
> avoid UAFs), attempting to reconfigure an existing procfs instance's pid
> namespace will error out with -EBUSY. Creating new procfs instances is
> quite cheap, so this should not be an impediment to most users, and lets
> us avoid a lot of churn in fs/proc/* for a feature that it seems
> unlikely userspace would use.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  Documentation/filesystems/proc.rst |  8 ++++
>  fs/proc/root.c                     | 98 +++++++++++++++++++++++++++++++++++---
>  2 files changed, 100 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 5236cb52e357..5a157dadea0b 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -2360,6 +2360,7 @@ The following mount options are supported:
>  	hidepid=	Set /proc/<pid>/ access mode.
>  	gid=		Set the group authorized to learn processes information.
>  	subset=		Show only the specified subset of procfs.
> +	pidns=		Specify a the namespace used by this procfs.
>  	=========	========================================================
>  
>  hidepid=off or hidepid=0 means classic mode - everybody may access all
> @@ -2392,6 +2393,13 @@ information about processes information, just add identd to this group.
>  subset=pid hides all top level files and directories in the procfs that
>  are not related to tasks.
>  
> +pidns= specifies a pid namespace (either as a string path to something like
> +`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
> +will be used by the procfs instance when translating pids. By default, procfs
> +will use the calling process's active pid namespace. Note that the pid
> +namespace of an existing procfs instance cannot be modified (attempting to do
> +so will give an `-EBUSY` error).
> +
>  Chapter 5: Filesystem behavior
>  ==============================
>  
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index ed86ac710384..fd1f1c8a939a 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -38,12 +38,14 @@ enum proc_param {
>  	Opt_gid,
>  	Opt_hidepid,
>  	Opt_subset,
> +	Opt_pidns,
>  };
>  
>  static const struct fs_parameter_spec proc_fs_parameters[] = {
> -	fsparam_u32("gid",	Opt_gid),
> +	fsparam_u32("gid",		Opt_gid),
>  	fsparam_string("hidepid",	Opt_hidepid),
>  	fsparam_string("subset",	Opt_subset),
> +	fsparam_file_or_string("pidns",	Opt_pidns),
>  	{}
>  };
>  
> @@ -109,11 +111,66 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_PID_NS
> +static int proc_parse_pidns_param(struct fs_context *fc,
> +				  struct fs_parameter *param,
> +				  struct fs_parse_result *result)
> +{
> +	struct proc_fs_context *ctx = fc->fs_private;
> +	struct pid_namespace *target, *active = task_active_pid_ns(current);
> +	struct ns_common *ns;
> +	struct file *ns_filp __free(fput) = NULL;
> +
> +	switch (param->type) {
> +	case fs_value_is_file:
> +		/* came through fsconfig, steal the file reference */
> +		ns_filp = no_free_ptr(param->file);
> +		break;
> +	case fs_value_is_string:
> +		ns_filp = filp_open(param->string, O_RDONLY, 0);
> +		break;

I just realised that we probably also want to support FSCONFIG_SET_PATH
here, but fsparam_file_or_string() doesn't handle that at the moment. I
think we probably want to have fsparam_file_or_path() which would act
like:

 1. A path with FSCONFIG_SET_STRING and FSCONFIG_SET_PATH.
 2. A file with FSCONFIG_SET_FD.

These are the semantics I would already expect from these kinds of
flags, but at the moment FSCONFIG_SET_PATH is entirely disallowed.

@Amir:

I wonder if overlayfs (the only other user of fsparam_file_or_string())
would also prefer having these semantics? We could just migrate
fsparam_file_or_string() to fsparam_file_or_path() everwhere, since I'm
pretty sure these are the semantics userspace expects anyway.

> +	default:
> +		WARN_ON_ONCE(true);
> +		break;
> +	}
> +	if (!ns_filp)
> +		ns_filp = ERR_PTR(-EBADF);
> +	if (IS_ERR(ns_filp)) {
> +		errorfc(fc, "could not get file from pidns argument");
> +		return PTR_ERR(ns_filp);
> +	}
> +
> +	if (!proc_ns_file(ns_filp))
> +		return invalfc(fc, "pidns argument is not an nsfs file");
> +	ns = get_proc_ns(file_inode(ns_filp));
> +	if (ns->ops->type != CLONE_NEWPID)
> +		return invalfc(fc, "pidns argument is not a pidns file");
> +	target = container_of(ns, struct pid_namespace, ns);
> +
> +	/*
> +	 * pidns= is shorthand for joining the pidns to get a fsopen fd, so the
> +	 * permission model should be the same as pidns_install().
> +	 */
> +	if (!ns_capable(target->user_ns, CAP_SYS_ADMIN)) {
> +		errorfc(fc, "insufficient permissions to set pidns");
> +		return -EPERM;
> +	}
> +	if (!pidns_is_ancestor(target, active))
> +		return invalfc(fc, "cannot set pidns to non-descendant pidns");
> +
> +	put_pid_ns(ctx->pid_ns);
> +	ctx->pid_ns = get_pid_ns(target);
> +	put_user_ns(fc->user_ns);
> +	fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
> +	return 0;
> +}
> +#endif /* CONFIG_PID_NS */
> +
>  static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  {
>  	struct proc_fs_context *ctx = fc->fs_private;
>  	struct fs_parse_result result;
> -	int opt;
> +	int opt, err;
>  
>  	opt = fs_parse(fc, proc_fs_parameters, param, &result);
>  	if (opt < 0)
> @@ -125,14 +182,38 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  		break;
>  
>  	case Opt_hidepid:
> -		if (proc_parse_hidepid_param(fc, param))
> -			return -EINVAL;
> +		err = proc_parse_hidepid_param(fc, param);
> +		if (err)
> +			return err;
>  		break;
>  
>  	case Opt_subset:
> -		if (proc_parse_subset_param(fc, param->string) < 0)
> -			return -EINVAL;
> +		err = proc_parse_subset_param(fc, param->string);
> +		if (err)
> +			return err;
> +		break;
> +
> +	case Opt_pidns:
> +#ifdef CONFIG_PID_NS
> +		/*
> +		 * We would have to RCU-protect every proc_pid_ns() or
> +		 * proc_sb_info() access if we allowed this to be reconfigured
> +		 * for an existing procfs instance. Luckily, procfs instances
> +		 * are cheap to create, and mount-beneath would let you
> +		 * atomically replace an instance even with overmounts.
> +		 */
> +		if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE) {
> +			errorfc(fc, "cannot reconfigure pidns for existing procfs");
> +			return -EBUSY;
> +		}
> +		err = proc_parse_pidns_param(fc, param, &result);
> +		if (err)
> +			return err;
>  		break;
> +#else
> +		errorfc(fc, "pidns mount flag not supported on this system");
> +		return -EOPNOTSUPP;
> +#endif
>  
>  	default:
>  		return -EINVAL;
> @@ -154,6 +235,11 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
>  		fs_info->hide_pid = ctx->hidepid;
>  	if (ctx->mask & (1 << Opt_subset))
>  		fs_info->pidonly = ctx->pidonly;
> +	if (ctx->mask & (1 << Opt_pidns) &&
> +	    !WARN_ON_ONCE(fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)) {
> +		put_pid_ns(fs_info->pid_ns);
> +		fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
> +	}
>  }
>  
>  static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> 
> -- 
> 2.50.1
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [PATCH v4 0/4] procfs: make reference pidns more user-visible
From: Aleksa Sarai @ 2025-08-05  5:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Aleksa Sarai

Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs mount is
auto-selected based on the mounting process's active pidns, and the
pidns itself is basically hidden once the mount has been constructed).

/* pidns mount option for procfs */

This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns of a
procfs mount as desired. Examples include:

 * In order to bypass the mnt_too_revealing() check, Kubernetes creates
   a procfs mount from an empty pidns so that user namespaced containers
   can be nested (without this, the nested containers would fail to
   mount procfs). But this requires forking off a helper process because
   you cannot just one-shot this using mount(2).

 * Container runtimes in general need to fork into a container before
   configuring its mounts, which can lead to security issues in the case
   of shared-pidns containers (a privileged process in the pidns can
   interact with your container runtime process). While
   SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
   strict need for this due to a minor uAPI wart is kind of unfortunate.

Things would be much easier if there was a way for userspace to just
specify the pidns they want. Patch 1 implements a new "pidns" argument
which can be set using fsconfig(2):

    fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
    fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

or classic mount(2) / mount(8):

    // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
    mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

The initial security model I have in this RFC is to be as conservative
as possible and just mirror the security model for setns(2) -- which
means that you can only set pidns=... to pid namespaces that your
current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
privileges over the pid namespace. This fulfils the requirements of
container runtimes, but I suspect that this may be too strict for some
usecases.

The pidns argument is not displayed in mountinfo -- it's not clear to me
what value it would make sense to show (maybe we could just use ns_dname
to provide an identifier for the namespace, but this number would be
fairly useless to userspace). I'm open to suggestions. Note that
PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
information about this outside of mountinfo.

Note that you cannot change the pidns of an already-created procfs
instance. The primary reason is that allowing this to be changed would
require RCU-protecting proc_pid_ns(sb) and thus auditing all of
fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
the pid namespace. Since creating procfs instances is very cheap, it
seems unnecessary to overcomplicate this upfront. Trying to reconfigure
procfs this way errors out with -EBUSY.

/* ioctl(PROCFS_GET_PID_NAMESPACE) */

In addition, being able to figure out what pid namespace is being used
by a procfs mount is quite useful when you have an administrative
process (such as a container runtime) which wants to figure out the
correct way of mapping PIDs between its own namespace and the namespace
for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
alternative ways to do this, but they all rely on ancillary information
that third-party libraries and tools do not necessarily have access to.

To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
can be used to get a reference to the pidns that a procfs is using.

Rather than copying the (fairly strict) security model for setns(2),
apply a slightly looser model to better match what userspace can already
do:

 * Make the ioctl only valid on the root (meaning that a process without
   access to the procfs root -- such as only having an fd to a procfs
   file or some open_tree(2)-like subset -- cannot use this API). This
   means that the process already has some level of access to the
   /proc/$pid directories.

 * If the calling process is in an ancestor pidns, then they can already
   create pidfd for processes inside the pidns, which is morally
   equivalent to a pidns file descriptor according to setns(2). So it
   seems reasonable to just allow it in this case. (The justification
   for this model was suggested by Christian.)

 * If the process has access to /proc/1/ns/pid already (i.e. has
   ptrace-read access to the pidns pid1), then this ioctl is equivalent
   to just opening a handle to it that way.

   Ideally we would check for ptrace-read access against all processes
   in the pidns (which is very likely to be true for at least one
   process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set
   by most programs), but this would obviously not scale.

I'm open to suggestions for whether we need to make this stricter (or
possibly allow more cases).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Changes in v4:
- Remove unneeded EXPORT_SYMBOL_GPL. [Christian Brauner]
- Return -EOPNOTSUPP for new APIs for CONFIG_PID_NS=n rather than
  pretending they don't exist entirely. [Christian Brauner]
- PROCFS_IOCTL_MAGIC conflicts with XSDFEC_MAGIC, so we need to allocate
  subvalues more carefully (switch to _IO(PROCFS_IOCTL_MAGIC, 32)).
- Add some more selftests for PROCFS_GET_PID_NAMESPACE.
- Reword argument for PROCFS_GET_PID_NAMESPACE security model based on
  Christian's suggestion, and remove CAP_SYS_ADMIN edge-case (in most
  cases, such a process would also have ptrace-read credentials over the
  pidns pid1).
- v3: <https://lore.kernel.org/r/20250724-procfs-pidns-api-v3-0-4c685c910923@cyphar.com>

Changes in v3:
- Disallow changing pidns for existing procfs instances, as we'd
  probably have to RCU-protect everything that touches the pinned pidns
  reference.
- Improve tests with slightly nicer ASSERT_ERRNO* macros.
- v2: <https://lore.kernel.org/r/20250723-procfs-pidns-api-v2-0-621e7edd8e40@cyphar.com>

Changes in v2:
- #ifdef CONFIG_PID_NS
- Improve cover letter wording to make it clear we're talking about two
  separate features with different permission models. [Andy Lutomirski]
- Fix build warnings in pidns_is_ancestor() patch. [kernel test robot]
- v1: <https://lore.kernel.org/r/20250721-procfs-pidns-api-v1-0-5cd9007e512d@cyphar.com>

---
Aleksa Sarai (4):
      pidns: move is-ancestor logic to helper
      procfs: add "pidns" mount option
      procfs: add PROCFS_GET_PID_NAMESPACE ioctl
      selftests/proc: add tests for new pidns APIs

 Documentation/filesystems/proc.rst        |  12 ++
 fs/proc/root.c                            | 166 +++++++++++++++-
 include/linux/pid_namespace.h             |   9 +
 include/uapi/linux/fs.h                   |   4 +
 kernel/pid_namespace.c                    |  22 ++-
 tools/testing/selftests/proc/.gitignore   |   1 +
 tools/testing/selftests/proc/Makefile     |   1 +
 tools/testing/selftests/proc/proc-pidns.c | 315 ++++++++++++++++++++++++++++++
 8 files changed, 514 insertions(+), 16 deletions(-)
---
base-commit: 66639db858112bf6b0f76677f7517643d586e575
change-id: 20250717-procfs-pidns-api-8ed1583431f0

Best regards,
-- 
Aleksa Sarai <cyphar@cyphar.com>

^ permalink raw reply

* [PATCH v4 4/4] selftests/proc: add tests for new pidns APIs
From: Aleksa Sarai @ 2025-08-05  5:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Aleksa Sarai
In-Reply-To: <20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com>

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 tools/testing/selftests/proc/.gitignore   |   1 +
 tools/testing/selftests/proc/Makefile     |   1 +
 tools/testing/selftests/proc/proc-pidns.c | 315 ++++++++++++++++++++++++++++++
 3 files changed, 317 insertions(+)

diff --git a/tools/testing/selftests/proc/.gitignore b/tools/testing/selftests/proc/.gitignore
index 973968f45bba..2dced03e9e0e 100644
--- a/tools/testing/selftests/proc/.gitignore
+++ b/tools/testing/selftests/proc/.gitignore
@@ -17,6 +17,7 @@
 /proc-tid0
 /proc-uptime-001
 /proc-uptime-002
+/proc-pidns
 /read
 /self
 /setns-dcache
diff --git a/tools/testing/selftests/proc/Makefile b/tools/testing/selftests/proc/Makefile
index b12921b9794b..c6f7046b9860 100644
--- a/tools/testing/selftests/proc/Makefile
+++ b/tools/testing/selftests/proc/Makefile
@@ -27,5 +27,6 @@ TEST_GEN_PROGS += setns-sysvipc
 TEST_GEN_PROGS += thread-self
 TEST_GEN_PROGS += proc-multiple-procfs
 TEST_GEN_PROGS += proc-fsconfig-hidepid
+TEST_GEN_PROGS += proc-pidns
 
 include ../lib.mk
diff --git a/tools/testing/selftests/proc/proc-pidns.c b/tools/testing/selftests/proc/proc-pidns.c
new file mode 100644
index 000000000000..f7dd80a2c150
--- /dev/null
+++ b/tools/testing/selftests/proc/proc-pidns.c
@@ -0,0 +1,315 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2025 SUSE LLC.
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <sched.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <sys/prctl.h>
+
+#include "../kselftest_harness.h"
+
+#define ASSERT_ERRNO(expected, _t, seen)				\
+	__EXPECT(expected, #expected,					\
+		({__typeof__(seen) _tmp_seen = (seen);			\
+		  _tmp_seen >= 0 ? _tmp_seen : -errno; }), #seen, _t, 1)
+
+#define ASSERT_ERRNO_EQ(expected, seen) \
+	ASSERT_ERRNO(expected, ==, seen)
+
+#define ASSERT_SUCCESS(seen) \
+	ASSERT_ERRNO(0, <=, seen)
+
+static int touch(char *path)
+{
+	int fd = open(path, O_WRONLY|O_CREAT|O_CLOEXEC, 0644);
+	if (fd < 0)
+		return -1;
+	return close(fd);
+}
+
+FIXTURE(ns)
+{
+	int host_mntns, host_pidns;
+	int dummy_pidns;
+};
+
+FIXTURE_SETUP(ns)
+{
+	/* Stash the old mntns. */
+	self->host_mntns = open("/proc/self/ns/mnt", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(self->host_mntns);
+
+	/* Create a new mount namespace and make it private. */
+	ASSERT_SUCCESS(unshare(CLONE_NEWNS));
+	ASSERT_SUCCESS(mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL));
+
+	/*
+	 * Create a proper tmpfs that we can use and will disappear once we
+	 * leave this mntns.
+	 */
+	ASSERT_SUCCESS(mount("tmpfs", "/tmp", "tmpfs", 0, NULL));
+
+	/*
+	 * Create a pidns we can use for later tests. We need to fork off a
+	 * child so that we get a usable nsfd that we can bind-mount and open.
+	 */
+	ASSERT_SUCCESS(mkdir("/tmp/dummy", 0755));
+	ASSERT_SUCCESS(touch("/tmp/dummy/pidns"));
+	ASSERT_SUCCESS(mkdir("/tmp/dummy/proc", 0755));
+
+	self->host_pidns = open("/proc/self/ns/pid", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(self->host_pidns);
+	ASSERT_SUCCESS(unshare(CLONE_NEWPID));
+
+	pid_t pid = fork();
+	ASSERT_SUCCESS(pid);
+	if (!pid) {
+		prctl(PR_SET_PDEATHSIG, SIGKILL);
+		ASSERT_SUCCESS(mount("/proc/self/ns/pid", "/tmp/dummy/pidns", NULL, MS_BIND, NULL));
+		ASSERT_SUCCESS(mount("proc", "/tmp/dummy/proc", "proc", 0, NULL));
+		exit(0);
+	}
+
+	int wstatus;
+	ASSERT_EQ(waitpid(pid, &wstatus, 0), pid);
+	ASSERT_TRUE(WIFEXITED(wstatus));
+	ASSERT_EQ(WEXITSTATUS(wstatus), 0);
+
+	ASSERT_SUCCESS(setns(self->host_pidns, CLONE_NEWPID));
+
+	self->dummy_pidns = open("/tmp/dummy/pidns", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(self->dummy_pidns);
+}
+
+FIXTURE_TEARDOWN(ns)
+{
+	ASSERT_SUCCESS(setns(self->host_mntns, CLONE_NEWNS));
+	ASSERT_SUCCESS(close(self->host_mntns));
+
+	ASSERT_SUCCESS(close(self->host_pidns));
+	ASSERT_SUCCESS(close(self->dummy_pidns));
+}
+
+TEST_F(ns, pidns_mount_string_path)
+{
+	ASSERT_SUCCESS(mkdir("/tmp/proc-host", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc-host", "proc", 0, "pidns=/proc/self/ns/pid"));
+	ASSERT_SUCCESS(access("/tmp/proc-host/self/", X_OK));
+
+	ASSERT_SUCCESS(mkdir("/tmp/proc-dummy", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc-dummy", "proc", 0, "pidns=/tmp/dummy/pidns"));
+	ASSERT_ERRNO_EQ(-ENOENT, access("/tmp/proc-dummy/1/", X_OK));
+	ASSERT_ERRNO_EQ(-ENOENT, access("/tmp/proc-dummy/self/", X_OK));
+}
+
+TEST_F(ns, pidns_fsconfig_string_path)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy/pidns", 0));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+TEST_F(ns, pidns_fsconfig_fd)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+TEST_F(ns, pidns_reconfigure_remount)
+{
+	ASSERT_SUCCESS(mkdir("/tmp/proc", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc", "proc", 0, ""));
+
+	ASSERT_SUCCESS(access("/tmp/proc/1/", X_OK));
+	ASSERT_SUCCESS(access("/tmp/proc/self/", X_OK));
+
+	ASSERT_ERRNO_EQ(-EBUSY, mount(NULL, "/tmp/proc", NULL, MS_REMOUNT, "pidns=/tmp/dummy/pidns"));
+
+	ASSERT_SUCCESS(access("/tmp/proc/1/", X_OK));
+	ASSERT_SUCCESS(access("/tmp/proc/self/", X_OK));
+}
+
+TEST_F(ns, pidns_reconfigure_fsconfig_string_path)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_ERRNO_EQ(-EBUSY, fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy/pidns", 0));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0)); /* noop */
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+TEST_F(ns, pidns_reconfigure_fsconfig_fd)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_ERRNO_EQ(-EBUSY, fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0)); /* noop */
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+int is_same_inode(int fd1, int fd2)
+{
+	struct stat stat1, stat2;
+
+	assert(fstat(fd1, &stat1) == 0);
+	assert(fstat(fd2, &stat2) == 0);
+
+	return stat1.st_ino == stat2.st_ino && stat1.st_dev == stat2.st_dev;
+}
+
+#define PROCFS_IOCTL_MAGIC 'f'
+#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)
+
+TEST_F(ns, host_get_pidns_ioctl)
+{
+	int procfs = open("/proc", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(procfs);
+
+	int procfs_pidns = ioctl(procfs, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_SUCCESS(procfs_pidns);
+
+	ASSERT_TRUE(is_same_inode(self->host_pidns, procfs_pidns));
+	ASSERT_FALSE(is_same_inode(self->dummy_pidns, procfs_pidns));
+
+	ASSERT_SUCCESS(close(procfs));
+	ASSERT_SUCCESS(close(procfs_pidns));
+}
+
+TEST_F(ns, mount_implicit_get_pidns_ioctl)
+{
+	int procfs = open("/tmp/dummy/proc", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(procfs);
+
+	int procfs_pidns = ioctl(procfs, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_SUCCESS(procfs_pidns);
+
+	ASSERT_FALSE(is_same_inode(self->host_pidns, procfs_pidns));
+	ASSERT_TRUE(is_same_inode(self->dummy_pidns, procfs_pidns));
+
+	ASSERT_SUCCESS(close(procfs));
+	ASSERT_SUCCESS(close(procfs_pidns));
+}
+
+TEST_F(ns, mount_pidns_get_pidns_ioctl)
+{
+	ASSERT_SUCCESS(mkdir("/tmp/proc-host", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc-host", "proc", 0, "pidns=/proc/self/ns/pid"));
+
+	int host_procfs = open("/tmp/proc-host", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(host_procfs);
+	int host_procfs_pidns = ioctl(host_procfs, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_SUCCESS(host_procfs_pidns);
+
+	ASSERT_TRUE(is_same_inode(self->host_pidns, host_procfs_pidns));
+	ASSERT_FALSE(is_same_inode(self->dummy_pidns, host_procfs_pidns));
+
+	ASSERT_SUCCESS(mkdir("/tmp/proc-dummy", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc-dummy", "proc", 0, "pidns=/tmp/dummy/pidns"));
+
+	int dummy_procfs = open("/tmp/proc-dummy", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(dummy_procfs);
+	int dummy_procfs_pidns = ioctl(dummy_procfs, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_SUCCESS(dummy_procfs_pidns);
+
+	ASSERT_FALSE(is_same_inode(self->host_pidns, dummy_procfs_pidns));
+	ASSERT_TRUE(is_same_inode(self->dummy_pidns, dummy_procfs_pidns));
+
+	ASSERT_SUCCESS(close(host_procfs));
+	ASSERT_SUCCESS(close(host_procfs_pidns));
+	ASSERT_SUCCESS(close(dummy_procfs));
+	ASSERT_SUCCESS(close(dummy_procfs_pidns));
+}
+
+TEST_F(ns, fsconfig_pidns_get_pidns_ioctl)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	/* fsmount returns an O_PATH, which ioctl(2) doesn't accept. */
+	int new_mountfd = openat(mountfd, ".", O_RDONLY|O_DIRECTORY|O_CLOEXEC);
+	ASSERT_SUCCESS(new_mountfd);
+
+	ASSERT_SUCCESS(close(mountfd));
+	mountfd = -EBADF;
+
+	int procfs_pidns = ioctl(new_mountfd, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_SUCCESS(procfs_pidns);
+
+	ASSERT_NE(self->dummy_pidns, procfs_pidns);
+	ASSERT_FALSE(is_same_inode(self->host_pidns, procfs_pidns));
+	ASSERT_TRUE(is_same_inode(self->dummy_pidns, procfs_pidns));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(new_mountfd));
+	ASSERT_SUCCESS(close(procfs_pidns));
+}
+
+TEST_HARNESS_MAIN

-- 
2.50.1


^ permalink raw reply related

* [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
From: Aleksa Sarai @ 2025-08-05  5:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Aleksa Sarai
In-Reply-To: <20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com>

/proc has historically had very opaque semantics about PID namespaces,
which is a little unfortunate for container runtimes and other programs
that deal with switching namespaces very often. One common issue is that
of converting between PIDs in the process's namespace and PIDs in the
namespace of /proc.

In principle, it is possible to do this today by opening a pidfd with
pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
contain a PID value translated to the pid namespace associated with that
procfs superblock). However, allocating a new file for each PID to be
converted is less than ideal for programs that may need to scan procfs,
and it is generally useful for userspace to be able to finally get this
information from procfs.

So, add a new API to get the pid namespace of a procfs instance, in the
form of an ioctl(2) you can call on the root directory of said procfs.
The returned file descriptor will have O_CLOEXEC set. This acts as a
sister feature to the new "pidns" mount option, finally allowing
userspace full control of the pid namespaces associated with procfs
instances.

The permission model for this is a bit looser than that of the "pidns"
mount option (and also setns(2)) because /proc/1/ns/pid provides the
same information, so as long as you have access to that magic-link (or
something equivalently reasonable such as being in an ancestor pid
namespace) it makes sense to allow userspace to grab a handle. Ideally
we would check for ptrace-read access against all processes in the pidns
(which is very likely to be true for at least one process, as
SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
programs), but this would obviously not scale.

setns(2) will still have their own permission checks, so being able to
open a pidns handle doesn't really provide too many other capabilities.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 Documentation/filesystems/proc.rst |  4 +++
 fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
 include/uapi/linux/fs.h            |  4 +++
 3 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 5a157dadea0b..840f820fb467 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2400,6 +2400,10 @@ will use the calling process's active pid namespace. Note that the pid
 namespace of an existing procfs instance cannot be modified (attempting to do
 so will give an `-EBUSY` error).
 
+Processes can check which pid namespace is used by a procfs instance by using
+the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
+instance.
+
 Chapter 5: Filesystem behavior
 ==============================
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index fd1f1c8a939a..ac9b115fad7b 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -23,8 +23,10 @@
 #include <linux/cred.h>
 #include <linux/magic.h>
 #include <linux/slab.h>
+#include <linux/ptrace.h>
 
 #include "internal.h"
+#include "../internal.h"
 
 struct proc_fs_context {
 	struct pid_namespace	*pid_ns;
@@ -426,15 +428,77 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
 	return proc_pid_readdir(file, ctx);
 }
 
+static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case PROCFS_GET_PID_NAMESPACE: {
+#ifdef CONFIG_PID_NS
+		struct pid_namespace *active = task_active_pid_ns(current);
+		struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
+		bool can_access_pidns = false;
+
+		/*
+		 * Having a handle to a pidns is not sufficient to do anything
+		 * particularly harmful, as setns(2) has its own separate
+		 * privilege checks. So, we can loosen the privilege
+		 * requirements here a little to make this more ergonomic.
+		 *
+		 * If we are in an ancestor pidns of the pidns, then we can
+		 * already address any process in the pidns. From a setns(2)
+		 * privileges perspective, we can create a pidfd which setns(2)
+		 * would also accept (pending any privilege checks).
+		 *
+		 * If we are not in an ancestor pidns, because this operation
+		 * is being done on the root of the /proc instance, the caller
+		 * can try to access /proc/1/ns/pid which is equivalent to this
+		 * ioctl and so we should copy the PTRACE_MODE_READ_FSCREDS
+		 * permission model use by proc_ns_get_link(). Ideally we would
+		 * check for ptrace-read access against all processes in the
+		 * pidns (which is very likely to be true for at least one
+		 * process, as SUID_DUMP_DISABLE is cleared on exec(2) and is
+		 * rarely set by most programs), but this would obviously not
+		 * scale.
+		 *
+		 * If there is no root process, then there is no real downside
+		 * to unprivileged users to open a handle to it.
+		 */
+		can_access_pidns = pidns_is_ancestor(ns, active);
+		if (!can_access_pidns) {
+			bool cannot_ptrace_pid1 = false;
+
+			read_lock(&tasklist_lock);
+			if (ns->child_reaper)
+				cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper,
+								       PTRACE_MODE_READ_FSCREDS);
+			read_unlock(&tasklist_lock);
+			can_access_pidns = !cannot_ptrace_pid1;
+		}
+		if (!can_access_pidns)
+			return -EPERM;
+
+		/* open_namespace() unconditionally consumes the reference. */
+		get_pid_ns(ns);
+		return open_namespace(to_ns_common(ns));
+#else
+		return -EOPNOTSUPP;
+#endif
+	}
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
 /*
  * The root /proc directory is special, as it has the
  * <pid> directories. Thus we don't use the generic
  * directory handling functions for that..
  */
 static const struct file_operations proc_root_operations = {
-	.read		 = generic_read_dir,
-	.iterate_shared	 = proc_root_readdir,
+	.read		= generic_read_dir,
+	.iterate_shared	= proc_root_readdir,
 	.llseek		= generic_file_llseek,
+	.unlocked_ioctl = proc_root_ioctl,
+	.compat_ioctl   = compat_ptr_ioctl,
 };
 
 /*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 0bd678a4a10e..68e65e6d7d6b 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
 			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
 			 RWF_DONTCACHE)
 
+/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
 #define PROCFS_IOCTL_MAGIC 'f'
 
+/* procfs root ioctls */
+#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)
+
 /* Pagemap ioctl */
 #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
 

-- 
2.50.1


^ permalink raw reply related

* [PATCH v4 2/4] procfs: add "pidns" mount option
From: Aleksa Sarai @ 2025-08-05  5:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Aleksa Sarai
In-Reply-To: <20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com>

Since the introduction of pid namespaces, their interaction with procfs
has been entirely implicit in ways that require a lot of dancing around
by programs that need to construct sandboxes with different PID
namespaces.

Being able to explicitly specify the pid namespace to use when
constructing a procfs super block will allow programs to no longer need
to fork off a process which does then does unshare(2) / setns(2) and
forks again in order to construct a procfs in a pidns.

So, provide a "pidns" mount option which allows such users to just
explicitly state which pid namespace they want that procfs instance to
use. This interface can be used with fsconfig(2) either with a file
descriptor or a path:

  fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
  fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

or with classic mount(2) / mount(8):

  // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
  mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

As this new API is effectively shorthand for setns(2) followed by
mount(2), the permission model for this mirrors pidns_install() to avoid
opening up new attack surfaces by loosening the existing permission
model.

In order to avoid having to RCU-protect all users of proc_pid_ns() (to
avoid UAFs), attempting to reconfigure an existing procfs instance's pid
namespace will error out with -EBUSY. Creating new procfs instances is
quite cheap, so this should not be an impediment to most users, and lets
us avoid a lot of churn in fs/proc/* for a feature that it seems
unlikely userspace would use.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 Documentation/filesystems/proc.rst |  8 ++++
 fs/proc/root.c                     | 98 +++++++++++++++++++++++++++++++++++---
 2 files changed, 100 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 5236cb52e357..5a157dadea0b 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2360,6 +2360,7 @@ The following mount options are supported:
 	hidepid=	Set /proc/<pid>/ access mode.
 	gid=		Set the group authorized to learn processes information.
 	subset=		Show only the specified subset of procfs.
+	pidns=		Specify a the namespace used by this procfs.
 	=========	========================================================
 
 hidepid=off or hidepid=0 means classic mode - everybody may access all
@@ -2392,6 +2393,13 @@ information about processes information, just add identd to this group.
 subset=pid hides all top level files and directories in the procfs that
 are not related to tasks.
 
+pidns= specifies a pid namespace (either as a string path to something like
+`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
+will be used by the procfs instance when translating pids. By default, procfs
+will use the calling process's active pid namespace. Note that the pid
+namespace of an existing procfs instance cannot be modified (attempting to do
+so will give an `-EBUSY` error).
+
 Chapter 5: Filesystem behavior
 ==============================
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index ed86ac710384..fd1f1c8a939a 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -38,12 +38,14 @@ enum proc_param {
 	Opt_gid,
 	Opt_hidepid,
 	Opt_subset,
+	Opt_pidns,
 };
 
 static const struct fs_parameter_spec proc_fs_parameters[] = {
-	fsparam_u32("gid",	Opt_gid),
+	fsparam_u32("gid",		Opt_gid),
 	fsparam_string("hidepid",	Opt_hidepid),
 	fsparam_string("subset",	Opt_subset),
+	fsparam_file_or_string("pidns",	Opt_pidns),
 	{}
 };
 
@@ -109,11 +111,66 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
 	return 0;
 }
 
+#ifdef CONFIG_PID_NS
+static int proc_parse_pidns_param(struct fs_context *fc,
+				  struct fs_parameter *param,
+				  struct fs_parse_result *result)
+{
+	struct proc_fs_context *ctx = fc->fs_private;
+	struct pid_namespace *target, *active = task_active_pid_ns(current);
+	struct ns_common *ns;
+	struct file *ns_filp __free(fput) = NULL;
+
+	switch (param->type) {
+	case fs_value_is_file:
+		/* came through fsconfig, steal the file reference */
+		ns_filp = no_free_ptr(param->file);
+		break;
+	case fs_value_is_string:
+		ns_filp = filp_open(param->string, O_RDONLY, 0);
+		break;
+	default:
+		WARN_ON_ONCE(true);
+		break;
+	}
+	if (!ns_filp)
+		ns_filp = ERR_PTR(-EBADF);
+	if (IS_ERR(ns_filp)) {
+		errorfc(fc, "could not get file from pidns argument");
+		return PTR_ERR(ns_filp);
+	}
+
+	if (!proc_ns_file(ns_filp))
+		return invalfc(fc, "pidns argument is not an nsfs file");
+	ns = get_proc_ns(file_inode(ns_filp));
+	if (ns->ops->type != CLONE_NEWPID)
+		return invalfc(fc, "pidns argument is not a pidns file");
+	target = container_of(ns, struct pid_namespace, ns);
+
+	/*
+	 * pidns= is shorthand for joining the pidns to get a fsopen fd, so the
+	 * permission model should be the same as pidns_install().
+	 */
+	if (!ns_capable(target->user_ns, CAP_SYS_ADMIN)) {
+		errorfc(fc, "insufficient permissions to set pidns");
+		return -EPERM;
+	}
+	if (!pidns_is_ancestor(target, active))
+		return invalfc(fc, "cannot set pidns to non-descendant pidns");
+
+	put_pid_ns(ctx->pid_ns);
+	ctx->pid_ns = get_pid_ns(target);
+	put_user_ns(fc->user_ns);
+	fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
+	return 0;
+}
+#endif /* CONFIG_PID_NS */
+
 static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
 	struct fs_parse_result result;
-	int opt;
+	int opt, err;
 
 	opt = fs_parse(fc, proc_fs_parameters, param, &result);
 	if (opt < 0)
@@ -125,14 +182,38 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		break;
 
 	case Opt_hidepid:
-		if (proc_parse_hidepid_param(fc, param))
-			return -EINVAL;
+		err = proc_parse_hidepid_param(fc, param);
+		if (err)
+			return err;
 		break;
 
 	case Opt_subset:
-		if (proc_parse_subset_param(fc, param->string) < 0)
-			return -EINVAL;
+		err = proc_parse_subset_param(fc, param->string);
+		if (err)
+			return err;
+		break;
+
+	case Opt_pidns:
+#ifdef CONFIG_PID_NS
+		/*
+		 * We would have to RCU-protect every proc_pid_ns() or
+		 * proc_sb_info() access if we allowed this to be reconfigured
+		 * for an existing procfs instance. Luckily, procfs instances
+		 * are cheap to create, and mount-beneath would let you
+		 * atomically replace an instance even with overmounts.
+		 */
+		if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE) {
+			errorfc(fc, "cannot reconfigure pidns for existing procfs");
+			return -EBUSY;
+		}
+		err = proc_parse_pidns_param(fc, param, &result);
+		if (err)
+			return err;
 		break;
+#else
+		errorfc(fc, "pidns mount flag not supported on this system");
+		return -EOPNOTSUPP;
+#endif
 
 	default:
 		return -EINVAL;
@@ -154,6 +235,11 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
 		fs_info->hide_pid = ctx->hidepid;
 	if (ctx->mask & (1 << Opt_subset))
 		fs_info->pidonly = ctx->pidonly;
+	if (ctx->mask & (1 << Opt_pidns) &&
+	    !WARN_ON_ONCE(fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)) {
+		put_pid_ns(fs_info->pid_ns);
+		fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+	}
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)

-- 
2.50.1


^ permalink raw reply related

* [PATCH v4 1/4] pidns: move is-ancestor logic to helper
From: Aleksa Sarai @ 2025-08-05  5:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Aleksa Sarai
In-Reply-To: <20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com>

This check will be needed in later patches, and there's no point
open-coding it each time.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/pid_namespace.h |  9 +++++++++
 kernel/pid_namespace.c        | 22 ++++++++++++++--------
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 7c67a5811199..17fdc059f8da 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -84,6 +84,9 @@ extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
 extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd);
 extern void put_pid_ns(struct pid_namespace *ns);
 
+extern bool pidns_is_ancestor(struct pid_namespace *child,
+			      struct pid_namespace *ancestor);
+
 #else /* !CONFIG_PID_NS */
 #include <linux/err.h>
 
@@ -118,6 +121,12 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
 {
 	return 0;
 }
+
+static inline bool pidns_is_ancestor(struct pid_namespace *child,
+				     struct pid_namespace *ancestor)
+{
+	return false;
+}
 #endif /* CONFIG_PID_NS */
 
 extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 7098ed44e717..b7b45c2597ec 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -390,11 +390,23 @@ static void pidns_put(struct ns_common *ns)
 	put_pid_ns(to_pid_ns(ns));
 }
 
+bool pidns_is_ancestor(struct pid_namespace *child,
+		       struct pid_namespace *ancestor)
+{
+	struct pid_namespace *ns;
+
+	if (child->level < ancestor->level)
+		return false;
+	for (ns = child; ns->level > ancestor->level; ns = ns->parent)
+		;
+	return ns == ancestor;
+}
+
 static int pidns_install(struct nsset *nsset, struct ns_common *ns)
 {
 	struct nsproxy *nsproxy = nsset->nsproxy;
 	struct pid_namespace *active = task_active_pid_ns(current);
-	struct pid_namespace *ancestor, *new = to_pid_ns(ns);
+	struct pid_namespace *new = to_pid_ns(ns);
 
 	if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) ||
 	    !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN))
@@ -408,13 +420,7 @@ static int pidns_install(struct nsset *nsset, struct ns_common *ns)
 	 * this maintains the property that processes and their
 	 * children can not escape their current pid namespace.
 	 */
-	if (new->level < active->level)
-		return -EINVAL;
-
-	ancestor = new;
-	while (ancestor->level > active->level)
-		ancestor = ancestor->parent;
-	if (ancestor != active)
+	if (!pidns_is_ancestor(new, active))
 		return -EINVAL;
 
 	put_pid_ns(nsproxy->pid_ns_for_children);

-- 
2.50.1


^ permalink raw reply related

* Re: [PATCH v2 14/32] liveupdate: luo_files: add infrastructure for FDs
From: Pasha Tatashin @ 2025-08-04 23:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250729173318.GQ36037@nvidia.com>

> > +struct liveupdate_file_ops {
> > +     int (*prepare)(struct file *file, void *arg, u64 *data);
> > +     int (*freeze)(struct file *file, void *arg, u64 *data);
> > +     void (*cancel)(struct file *file, void *arg, u64 data);
> > +     void (*finish)(struct file *file, void *arg, u64 data, bool reclaimed);
> > +     int (*retrieve)(void *arg, u64 data, struct file **file);
> > +     bool (*can_preserve)(struct file *file, void *arg);
> > +};
>
> ops structures often have an owner = THIS_MODULE

Added here, and to subsystems.

>
> It wouldn't hurt to add it here too, and some appropriate module_get's
> though I didn't try to figure what happens if userspace races a module
> unload with other luo operations.

I added try_module_get()/module_put() to register/unregister functions.

> > +
> > +/**
> > + * struct liveupdate_file_handler - Represents a handler for a live-updatable
> > + * file type.
> > + * @ops:           Callback functions
> > + * @compatible:    The compatibility string (e.g., "memfd-v1", "vfiofd-v1")
> > + *                 that uniquely identifies the file type this handler supports.
> > + *                 This is matched against the compatible string associated with
> > + *                 individual &struct liveupdate_file instances.
> > + * @arg:           An opaque pointer to implementation-specific context data
> > + *                 associated with this file handler registration.
>
> Why? This is not the normal way, if you want context data then
> allocate a struct driver_liveupdate_file_handler and embed a normal
> struct liveupdate_file_handler inside it, then use container_of.

Good point. I removed arg, and added handler as an argument to the
callback functions.

> > +     fdt_for_each_subnode(file_node_offset, luo_file_fdt_in, 0) {
> > +             bool handler_found = false;
> > +             u64 token;
> > +
> > +             node_name = fdt_get_name(luo_file_fdt_in, file_node_offset,
> > +                                      NULL);
> > +             if (!node_name) {
> > +                     panic("FDT subnode at offset %d: Cannot get name\n",
> > +                           file_node_offset);
>
> I think this approach will raise lots of questions..
>
> I'd introduce a new function "luo_deserialize_failure" that does panic
> internally.
>
> Only called by places that are parsing the FDT & related but run into
> trouble that cannot be savely recovered from.

Agreed. I added a new macro in luo_internal.h:

 11 /*
 12  * Handles a deserialization failure: devices and memory is in
unpredictable
 13  * state.
 14  *
 15  * Continuing the boot process after a failure is dangerous
because it could
 16  * lead to leaks of private data.
 17  */
 18 #define luo_restore_fail(__fmt, ...) panic(__fmt, ##__VA_ARGS__)

And use it in places where we panic during deserialization.

Pasha

^ permalink raw reply

* Re: [fuse-devel] copy_file_range return value on FUSE
From: Florian Weimer @ 2025-08-04 14:30 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: fuse-devel, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <CAJfpegur9fUQ8MaOqrE-XrGUDK40+PGQeMZ+AzzpX6hNV_BKsw@mail.gmail.com>

* Miklos Szeredi:

> On Mon, 4 Aug 2025 at 11:42, Florian Weimer via fuse-devel
> <fuse-devel@lists.sourceforge.net> wrote:
>>
>> The FUSE protocol uses struct fuse_write_out to convey the return value
>> of copy_file_range, which is restricted to uint32_t.  But the
>> copy_file_range interface supports a 64-bit copy operation.  Given that
>> copy_file_range is expected to clone huge files, large copies are not
>> unexpected, so this appears to be a real limitation.
>
> That's a nasty oversight.  Fixing with a new FUSE_COPY_FILE_RANGE_64
> op, fallback to the legacy FUSE_COPY_FILE_RANGE.

Or adding a capability flag to switch from struct fuse_write_out to
something that uses an uint64_t value.  One complication: The struct
fuse_write_out layout is too close to a potential 64-bit version of it
on little-endian systems, so that proper testing might be difficult with
the obvious approach.

>> There is another wrinkle: we'd need to check if the process runs in
>> 32-bit compat mode, and reject size_t arguments larger than INT_MAX in
>> this case (with EOVERFLOW presumably).  But perhaps this should be
>> handled on the kernel side?  Currently, this doesn't seem to happen, and
>> we can get copy_file_range results in the in-band error range.
>> Applications have no way to disambiguate this.
>
> That's not fuse specific, right?

In-kernel file systems can check if the request originated from a compat
process, using in_compat_syscall.  I don't think that's possible over
FUSE.

Thanks,
Florian


^ permalink raw reply

* Re: [PATCH v3 1/2] man/man2/mremap.2: describe multiple mapping move
From: Lorenzo Stoakes @ 2025-08-04 13:31 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <dizfk5jqwwzozotvkk6phb36blcpsve3yw53xrdjwqeut27djt@rftbd7lkgsjp>

On Sun, Aug 03, 2025 at 04:17:39PM +0200, Alejandro Colomar wrote:
> Hi Lorenzo,
>
> On Sun, Aug 03, 2025 at 12:15:15PM +0100, Lorenzo Stoakes wrote:
> > On Sun, Aug 03, 2025 at 08:47:28AM +0200, Alejandro Colomar wrote:
> > > Would it be possible to write a small C program that uses this new
> > > feature?
> >
> > I could do, but it's an unusual use of mremap() and we don't currently have
> > example C code for the _general_ usage so I wonder if it might be somewhat
> > misleading to have example code only for this?
>
> I didn't mean for the manual page.  It was more for helping review the
> changes.  If it's very small, it would be useful to include it in the
> commit message.  What do you think?

Ack I can do.

In discussion with colleagues I also realise I have to add a little more
information to 2/2 as well, so worth a respin anyway.

>
>
> Have a lovely day!
> Alex
>

Cheers, Lorenzo

^ permalink raw reply

* Re: [fuse-devel] copy_file_range return value on FUSE
From: Miklos Szeredi @ 2025-08-04 13:30 UTC (permalink / raw)
  To: Florian Weimer; +Cc: fuse-devel, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <lhuh5ynl8z5.fsf@oldenburg.str.redhat.com>

On Mon, 4 Aug 2025 at 11:42, Florian Weimer via fuse-devel
<fuse-devel@lists.sourceforge.net> wrote:
>
> The FUSE protocol uses struct fuse_write_out to convey the return value
> of copy_file_range, which is restricted to uint32_t.  But the
> copy_file_range interface supports a 64-bit copy operation.  Given that
> copy_file_range is expected to clone huge files, large copies are not
> unexpected, so this appears to be a real limitation.

That's a nasty oversight.  Fixing with a new FUSE_COPY_FILE_RANGE_64
op, fallback to the legacy FUSE_COPY_FILE_RANGE.

> There is another wrinkle: we'd need to check if the process runs in
> 32-bit compat mode, and reject size_t arguments larger than INT_MAX in
> this case (with EOVERFLOW presumably).  But perhaps this should be
> handled on the kernel side?  Currently, this doesn't seem to happen, and
> we can get copy_file_range results in the in-band error range.
> Applications have no way to disambiguate this.

That's not fuse specific, right?

Thanks,
Miklos

^ permalink raw reply

* copy_file_range return value on FUSE
From: Florian Weimer @ 2025-08-04  9:41 UTC (permalink / raw)
  To: fuse-devel, linux-api, linux-fsdevel, linux-kernel

The FUSE protocol uses struct fuse_write_out to convey the return value
of copy_file_range, which is restricted to uint32_t.  But the
copy_file_range interface supports a 64-bit copy operation.  Given that
copy_file_range is expected to clone huge files, large copies are not
unexpected, so this appears to be a real limitation.

There is another wrinkle: we'd need to check if the process runs in
32-bit compat mode, and reject size_t arguments larger than INT_MAX in
this case (with EOVERFLOW presumably).  But perhaps this should be
handled on the kernel side?  Currently, this doesn't seem to happen, and
we can get copy_file_range results in the in-band error range.
Applications have no way to disambiguate this.

Thanks,
Florian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox