[PATCH 0/4] pidfs: implement file handle support

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] pidfs: implement file handle support
@ 2024-11-01 13:54 Erin Shepherd
  2024-11-01 13:54 ` [PATCH 1/4] pseudofs: add support for export_ops Erin Shepherd
                   ` (4 more replies)
  0 siblings, 5 replies; 57+ messages in thread
From: Erin Shepherd @ 2024-11-01 13:54 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: christian, paul, bluca, erin.shepherd

Since the introduction of pidfs, we have had 64-bit process identifiers 
that will not be reused for the entire uptime of the system. This greatly 
facilitates process tracking in userspace.

There are two limitations at present:

 * These identifiers are currently only exposed to processes on 64-bit 
   systems. On 32-bit systems, inode space is also limited to 32 bits and 
   therefore is subject to the same reuse issues.
 * There is no way to go from one of these unique identifiers to a pid or 
   pidfd.

Patch 1 & 2 in this stack implements fh_export for pidfs. This means 
userspace  can retrieve a unique process identifier even on 32-bit systems 
via name_to_handle_at.

Patch 3 & 4 in this stack implement fh_to_dentry for pidfs. This means 
userspace can convert back from a file handle to the corresponding pidfd. 
To support us going from a file handle to a pidfd, we have to store a pid 
inside the file handle. To ensure file handles are invariant and can move 
between pid namespaces, we stash a pid from the initial namespace inside 
the file handle.

I'm not quite sure if stashing an initial-namespace pid inside the file 
handle is the right approach here; if not, I think that patch 1 & 2 are 
useful on their own.

Erin Shepherd (4):
  pseudofs: add support for export_ops
  pidfs: implement file handle export support
  pid: introduce find_get_pid_ns
  pidfs: implement fh_to_dentry

 fs/libfs.c                |  1 +
 fs/pidfs.c                | 57 +++++++++++++++++++++++++++++++++++++++
 include/linux/pid.h       |  1 +
 include/linux/pseudo_fs.h |  1 +
 kernel/pid.c              | 10 +++++--
 5 files changed, 68 insertions(+), 2 deletions(-)

-- 
2.46.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 1/4] pseudofs: add support for export_ops
  2024-11-01 13:54 [PATCH 0/4] pidfs: implement file handle support Erin Shepherd
@ 2024-11-01 13:54 ` Erin Shepherd
  2024-11-12 15:56   ` Amir Goldstein
  2024-11-01 13:54 ` [PATCH 2/4] pidfs: implement file handle export support Erin Shepherd
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 57+ messages in thread
From: Erin Shepherd @ 2024-11-01 13:54 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: christian, paul, bluca, erin.shepherd

Pseudo-filesystems might reasonably wish to implement the export ops
(particularly for name_to_handle_at/open_by_handle_at); plumb this
through pseudo_fs_context

Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
---
 fs/libfs.c                | 1 +
 include/linux/pseudo_fs.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/libfs.c b/fs/libfs.c
index 46966fd8bcf9..698a2ddfd0cb 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -669,6 +669,7 @@ static int pseudo_fs_fill_super(struct super_block *s, struct fs_context *fc)
 	s->s_blocksize_bits = PAGE_SHIFT;
 	s->s_magic = ctx->magic;
 	s->s_op = ctx->ops ?: &simple_super_operations;
+	s->s_export_op = ctx->eops;
 	s->s_xattr = ctx->xattr;
 	s->s_time_gran = 1;
 	root = new_inode(s);
diff --git a/include/linux/pseudo_fs.h b/include/linux/pseudo_fs.h
index 730f77381d55..2503f7625d65 100644
--- a/include/linux/pseudo_fs.h
+++ b/include/linux/pseudo_fs.h
@@ -5,6 +5,7 @@
 
 struct pseudo_fs_context {
 	const struct super_operations *ops;
+	const struct export_operations *eops;
 	const struct xattr_handler * const *xattr;
 	const struct dentry_operations *dops;
 	unsigned long magic;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/4] pseudofs: add support for export_ops
  2024-11-01 13:54 ` [PATCH 1/4] pseudofs: add support for export_ops Erin Shepherd
@ 2024-11-12 15:56   ` Amir Goldstein
  0 siblings, 0 replies; 57+ messages in thread
From: Amir Goldstein @ 2024-11-12 15:56 UTC (permalink / raw)
  To: Erin Shepherd; +Cc: linux-kernel, linux-fsdevel, christian, paul, bluca

On Fri, Nov 1, 2024 at 2:55 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>
> Pseudo-filesystems might reasonably wish to implement the export ops
> (particularly for name_to_handle_at/open_by_handle_at); plumb this
> through pseudo_fs_context
>
> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

> ---
>  fs/libfs.c                | 1 +
>  include/linux/pseudo_fs.h | 1 +
>  2 files changed, 2 insertions(+)
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 46966fd8bcf9..698a2ddfd0cb 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -669,6 +669,7 @@ static int pseudo_fs_fill_super(struct super_block *s, struct fs_context *fc)
>         s->s_blocksize_bits = PAGE_SHIFT;
>         s->s_magic = ctx->magic;
>         s->s_op = ctx->ops ?: &simple_super_operations;
> +       s->s_export_op = ctx->eops;
>         s->s_xattr = ctx->xattr;
>         s->s_time_gran = 1;
>         root = new_inode(s);
> diff --git a/include/linux/pseudo_fs.h b/include/linux/pseudo_fs.h
> index 730f77381d55..2503f7625d65 100644
> --- a/include/linux/pseudo_fs.h
> +++ b/include/linux/pseudo_fs.h
> @@ -5,6 +5,7 @@
>
>  struct pseudo_fs_context {
>         const struct super_operations *ops;
> +       const struct export_operations *eops;
>         const struct xattr_handler * const *xattr;
>         const struct dentry_operations *dops;
>         unsigned long magic;
> --
> 2.46.1
>
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 2/4] pidfs: implement file handle export support
  2024-11-01 13:54 [PATCH 0/4] pidfs: implement file handle support Erin Shepherd
  2024-11-01 13:54 ` [PATCH 1/4] pseudofs: add support for export_ops Erin Shepherd
@ 2024-11-01 13:54 ` Erin Shepherd
  2024-11-12 15:55   ` Amir Goldstein
  2024-11-01 13:54 ` [PATCH 3/4] pid: introduce find_get_pid_ns Erin Shepherd
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 57+ messages in thread
From: Erin Shepherd @ 2024-11-01 13:54 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: christian, paul, bluca, erin.shepherd

On 64-bit platforms, userspace can read the pidfd's inode in order to
get a never-repeated PID identifier. On 32-bit platforms this identifier
is not exposed, as inodes are limited to 32 bits. Instead expose the
identifier via export_fh, which makes it available to userspace via
name_to_handle_at

Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
---
 fs/pidfs.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 80675b6bf884..c8e7e9011550 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/anon_inodes.h>
+#include <linux/exportfs.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/magic.h>
@@ -347,6 +348,25 @@ static const struct dentry_operations pidfs_dentry_operations = {
 	.d_prune	= stashed_dentry_prune,
 };
 
+static int pidfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
+			   struct inode *parent)
+{
+	struct pid *pid = inode->i_private;
+
+	if (*max_len < 2) {
+		*max_len = 2;
+		return FILEID_INVALID;
+	}
+
+	*max_len = 2;
+	*(u64 *)fh = pid->ino;
+	return FILEID_KERNFS;
+}
+
+static const struct export_operations pidfs_export_operations = {
+	.encode_fh = pidfs_encode_fh,
+};
+
 static int pidfs_init_inode(struct inode *inode, void *data)
 {
 	inode->i_private = data;
@@ -382,6 +402,7 @@ static int pidfs_init_fs_context(struct fs_context *fc)
 		return -ENOMEM;
 
 	ctx->ops = &pidfs_sops;
+	ctx->eops = &pidfs_export_operations;
 	ctx->dops = &pidfs_dentry_operations;
 	fc->s_fs_info = (void *)&pidfs_stashed_ops;
 	return 0;
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/4] pidfs: implement file handle export support
  2024-11-01 13:54 ` [PATCH 2/4] pidfs: implement file handle export support Erin Shepherd
@ 2024-11-12 15:55   ` Amir Goldstein
  0 siblings, 0 replies; 57+ messages in thread
From: Amir Goldstein @ 2024-11-12 15:55 UTC (permalink / raw)
  To: Erin Shepherd; +Cc: linux-kernel, linux-fsdevel, christian, paul, bluca

On Fri, Nov 1, 2024 at 2:55 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>
> On 64-bit platforms, userspace can read the pidfd's inode in order to
> get a never-repeated PID identifier. On 32-bit platforms this identifier
> is not exposed, as inodes are limited to 32 bits. Instead expose the
> identifier via export_fh, which makes it available to userspace via
> name_to_handle_at
>
> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

> ---
>  fs/pidfs.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
>
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 80675b6bf884..c8e7e9011550 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -1,5 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <linux/anon_inodes.h>
> +#include <linux/exportfs.h>
>  #include <linux/file.h>
>  #include <linux/fs.h>
>  #include <linux/magic.h>
> @@ -347,6 +348,25 @@ static const struct dentry_operations pidfs_dentry_operations = {
>         .d_prune        = stashed_dentry_prune,
>  };
>
> +static int pidfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
> +                          struct inode *parent)
> +{
> +       struct pid *pid = inode->i_private;
> +
> +       if (*max_len < 2) {
> +               *max_len = 2;
> +               return FILEID_INVALID;
> +       }
> +
> +       *max_len = 2;
> +       *(u64 *)fh = pid->ino;
> +       return FILEID_KERNFS;
> +}
> +
> +static const struct export_operations pidfs_export_operations = {
> +       .encode_fh = pidfs_encode_fh,
> +};
> +
>  static int pidfs_init_inode(struct inode *inode, void *data)
>  {
>         inode->i_private = data;
> @@ -382,6 +402,7 @@ static int pidfs_init_fs_context(struct fs_context *fc)
>                 return -ENOMEM;
>
>         ctx->ops = &pidfs_sops;
> +       ctx->eops = &pidfs_export_operations;
>         ctx->dops = &pidfs_dentry_operations;
>         fc->s_fs_info = (void *)&pidfs_stashed_ops;
>         return 0;
> --
> 2.46.1
>
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 3/4] pid: introduce find_get_pid_ns
  2024-11-01 13:54 [PATCH 0/4] pidfs: implement file handle support Erin Shepherd
  2024-11-01 13:54 ` [PATCH 1/4] pseudofs: add support for export_ops Erin Shepherd
  2024-11-01 13:54 ` [PATCH 2/4] pidfs: implement file handle export support Erin Shepherd
@ 2024-11-01 13:54 ` Erin Shepherd
  2024-11-12 15:59   ` Amir Goldstein
  2024-11-01 13:54 ` [PATCH 4/4] pidfs: implement fh_to_dentry Erin Shepherd
  2024-11-12 13:10 ` [PATCH 0/4] pidfs: implement file handle support Christian Brauner
  4 siblings, 1 reply; 57+ messages in thread
From: Erin Shepherd @ 2024-11-01 13:54 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: christian, paul, bluca, erin.shepherd

In some situations it is useful to be able to atomically get a PID
from a specific PID namespace.

Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
---
 include/linux/pid.h |  1 +
 kernel/pid.c        | 10 ++++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index a3aad9b4074c..965f8b3ff9a8 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -124,6 +124,7 @@ extern struct pid *find_vpid(int nr);
 /*
  * Lookup a PID in the hash table, and return with it's count elevated.
  */
+extern struct pid *find_get_pid_ns(int nr, struct pid_namespace *ns);
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 2715afb77eab..2967f8a98330 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -470,16 +470,22 @@ struct task_struct *get_pid_task(struct pid *pid, enum pid_type type)
 }
 EXPORT_SYMBOL_GPL(get_pid_task);
 
-struct pid *find_get_pid(pid_t nr)
+struct pid *find_get_pid_ns(pid_t nr, struct pid_namespace *ns)
 {
 	struct pid *pid;
 
 	rcu_read_lock();
-	pid = get_pid(find_vpid(nr));
+	pid = get_pid(find_pid_ns(nr, ns));
 	rcu_read_unlock();
 
 	return pid;
 }
+EXPORT_SYMBOL_GPL(find_get_pid_ns);
+
+struct pid *find_get_pid(pid_t nr)
+{
+	return find_get_pid_ns(nr, task_active_pid_ns(current));
+}
 EXPORT_SYMBOL_GPL(find_get_pid);
 
 pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 3/4] pid: introduce find_get_pid_ns
  2024-11-01 13:54 ` [PATCH 3/4] pid: introduce find_get_pid_ns Erin Shepherd
@ 2024-11-12 15:59   ` Amir Goldstein
  0 siblings, 0 replies; 57+ messages in thread
From: Amir Goldstein @ 2024-11-12 15:59 UTC (permalink / raw)
  To: Erin Shepherd; +Cc: linux-kernel, linux-fsdevel, christian, paul, bluca

On Fri, Nov 1, 2024 at 2:56 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>
> In some situations it is useful to be able to atomically get a PID
> from a specific PID namespace.
>
> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

> ---
>  include/linux/pid.h |  1 +
>  kernel/pid.c        | 10 ++++++++--
>  2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index a3aad9b4074c..965f8b3ff9a8 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -124,6 +124,7 @@ extern struct pid *find_vpid(int nr);
>  /*
>   * Lookup a PID in the hash table, and return with it's count elevated.
>   */
> +extern struct pid *find_get_pid_ns(int nr, struct pid_namespace *ns);
>  extern struct pid *find_get_pid(int nr);
>  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 2715afb77eab..2967f8a98330 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -470,16 +470,22 @@ struct task_struct *get_pid_task(struct pid *pid, enum pid_type type)
>  }
>  EXPORT_SYMBOL_GPL(get_pid_task);
>
> -struct pid *find_get_pid(pid_t nr)
> +struct pid *find_get_pid_ns(pid_t nr, struct pid_namespace *ns)
>  {
>         struct pid *pid;
>
>         rcu_read_lock();
> -       pid = get_pid(find_vpid(nr));
> +       pid = get_pid(find_pid_ns(nr, ns));
>         rcu_read_unlock();
>
>         return pid;
>  }
> +EXPORT_SYMBOL_GPL(find_get_pid_ns);
> +
> +struct pid *find_get_pid(pid_t nr)
> +{
> +       return find_get_pid_ns(nr, task_active_pid_ns(current));
> +}
>  EXPORT_SYMBOL_GPL(find_get_pid);
>
>  pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
> --
> 2.46.1
>
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 4/4] pidfs: implement fh_to_dentry
  2024-11-01 13:54 [PATCH 0/4] pidfs: implement file handle support Erin Shepherd
                   ` (2 preceding siblings ...)
  2024-11-01 13:54 ` [PATCH 3/4] pid: introduce find_get_pid_ns Erin Shepherd
@ 2024-11-01 13:54 ` Erin Shepherd
  2024-11-12 16:33   ` Amir Goldstein
                     ` (2 more replies)
  2024-11-12 13:10 ` [PATCH 0/4] pidfs: implement file handle support Christian Brauner
  4 siblings, 3 replies; 57+ messages in thread
From: Erin Shepherd @ 2024-11-01 13:54 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: christian, paul, bluca, erin.shepherd

This enables userspace to use name_to_handle_at to recover a pidfd
to a process.

We stash the process' PID in the root pid namespace inside the handle,
and use that to recover the pid (validating that pid->ino matches the
value in the handle, i.e. that the pid has not been reused).

We use the root namespace in order to ensure that file handles can be
moved across namespaces; however, we validate that the PID exists in
the current namespace before returning the inode.

Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
---
 fs/pidfs.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index c8e7e9011550..2d66610ef385 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -348,23 +348,59 @@ static const struct dentry_operations pidfs_dentry_operations = {
 	.d_prune	= stashed_dentry_prune,
 };
 
-static int pidfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
+#define PIDFD_FID_LEN 3
+
+struct pidfd_fid {
+	u64 ino;
+	s32 pid;
+} __packed;
+
+static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
 			   struct inode *parent)
 {
 	struct pid *pid = inode->i_private;
-	
-	if (*max_len < 2) {
-		*max_len = 2;
+	struct pidfd_fid *fid = (struct pidfd_fid *)fh;
+
+	if (*max_len < PIDFD_FID_LEN) {
+		*max_len = PIDFD_FID_LEN;
 		return FILEID_INVALID;
 	}
 
-	*max_len = 2;
-	*(u64 *)fh = pid->ino;
-	return FILEID_KERNFS;
+	fid->ino = pid->ino;
+	fid->pid = pid_nr(pid);
+	*max_len = PIDFD_FID_LEN;
+	return FILEID_INO64_GEN;
+}
+
+static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
+					 struct fid *gen_fid,
+					 int fh_len, int fh_type)
+{
+	int ret;
+	struct path path;
+	struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
+	struct pid *pid;
+
+	if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
+		return NULL;
+
+	pid = find_get_pid_ns(fid->pid, &init_pid_ns);
+	if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0) {
+		put_pid(pid);
+		return NULL;
+	}
+
+	ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	mntput(path.mnt);
+	return path.dentry;
 }
 
 static const struct export_operations pidfs_export_operations = {
 	.encode_fh = pidfs_encode_fh,
+	.fh_to_dentry = pidfs_fh_to_dentry,
 };
 
 static int pidfs_init_inode(struct inode *inode, void *data)
-- 
2.46.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/4] pidfs: implement fh_to_dentry
  2024-11-01 13:54 ` [PATCH 4/4] pidfs: implement fh_to_dentry Erin Shepherd
@ 2024-11-12 16:33   ` Amir Goldstein
  2024-11-12 23:51   ` Jeff Layton
  2024-11-13 12:09   ` Christian Brauner
  2 siblings, 0 replies; 57+ messages in thread
From: Amir Goldstein @ 2024-11-12 16:33 UTC (permalink / raw)
  To: Erin Shepherd; +Cc: linux-kernel, linux-fsdevel, christian, paul, bluca

On Fri, Nov 1, 2024 at 3:05 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>
> This enables userspace to use name_to_handle_at to recover a pidfd
> to a process.
>
> We stash the process' PID in the root pid namespace inside the handle,
> and use that to recover the pid (validating that pid->ino matches the
> value in the handle, i.e. that the pid has not been reused).
>
> We use the root namespace in order to ensure that file handles can be
> moved across namespaces; however, we validate that the PID exists in
> the current namespace before returning the inode.
>
> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>

Functionally, this looks correct to me, so you may add:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

But I can't say that I am a good person to judge if this new functionality
can expose new information to unpriv users or allow them to get access
to processes that they could not get access to before.

Thanks,
Amir.

> ---
>  fs/pidfs.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 43 insertions(+), 7 deletions(-)
>
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index c8e7e9011550..2d66610ef385 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -348,23 +348,59 @@ static const struct dentry_operations pidfs_dentry_operations = {
>         .d_prune        = stashed_dentry_prune,
>  };
>
> -static int pidfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
> +#define PIDFD_FID_LEN 3
> +
> +struct pidfd_fid {
> +       u64 ino;
> +       s32 pid;
> +} __packed;
> +
> +static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
>                            struct inode *parent)
>  {
>         struct pid *pid = inode->i_private;
> -
> -       if (*max_len < 2) {
> -               *max_len = 2;
> +       struct pidfd_fid *fid = (struct pidfd_fid *)fh;
> +
> +       if (*max_len < PIDFD_FID_LEN) {
> +               *max_len = PIDFD_FID_LEN;
>                 return FILEID_INVALID;
>         }
>
> -       *max_len = 2;
> -       *(u64 *)fh = pid->ino;
> -       return FILEID_KERNFS;
> +       fid->ino = pid->ino;
> +       fid->pid = pid_nr(pid);
> +       *max_len = PIDFD_FID_LEN;
> +       return FILEID_INO64_GEN;
> +}
> +
> +static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
> +                                        struct fid *gen_fid,
> +                                        int fh_len, int fh_type)
> +{
> +       int ret;
> +       struct path path;
> +       struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
> +       struct pid *pid;
> +
> +       if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
> +               return NULL;
> +
> +       pid = find_get_pid_ns(fid->pid, &init_pid_ns);
> +       if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0) {
> +               put_pid(pid);
> +               return NULL;
> +       }
> +
> +       ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
> +       if (ret < 0)
> +               return ERR_PTR(ret);
> +
> +       mntput(path.mnt);
> +       return path.dentry;
>  }
>
>  static const struct export_operations pidfs_export_operations = {
>         .encode_fh = pidfs_encode_fh,
> +       .fh_to_dentry = pidfs_fh_to_dentry,
>  };
>
>  static int pidfs_init_inode(struct inode *inode, void *data)
> --
> 2.46.1
>
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/4] pidfs: implement fh_to_dentry
  2024-11-01 13:54 ` [PATCH 4/4] pidfs: implement fh_to_dentry Erin Shepherd
  2024-11-12 16:33   ` Amir Goldstein
@ 2024-11-12 23:51   ` Jeff Layton
  2024-11-13  8:01     ` Amir Goldstein
  2024-11-13 12:09   ` Christian Brauner
  2 siblings, 1 reply; 57+ messages in thread
From: Jeff Layton @ 2024-11-12 23:51 UTC (permalink / raw)
  To: Erin Shepherd, linux-kernel, linux-fsdevel; +Cc: christian, paul, bluca

On Fri, 2024-11-01 at 13:54 +0000, Erin Shepherd wrote:
> This enables userspace to use name_to_handle_at to recover a pidfd
> to a process.
> 
> We stash the process' PID in the root pid namespace inside the handle,
> and use that to recover the pid (validating that pid->ino matches the
> value in the handle, i.e. that the pid has not been reused).
> 
> We use the root namespace in order to ensure that file handles can be
> moved across namespaces; however, we validate that the PID exists in
> the current namespace before returning the inode.
> 
> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> ---
>  fs/pidfs.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 43 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index c8e7e9011550..2d66610ef385 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -348,23 +348,59 @@ static const struct dentry_operations pidfs_dentry_operations = {
>  	.d_prune	= stashed_dentry_prune,
>  };
>  
> -static int pidfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
> +#define PIDFD_FID_LEN 3
> +
> +struct pidfd_fid {
> +	u64 ino;
> +	s32 pid;
> +} __packed;
> +
> +static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
>  			   struct inode *parent)
>  {
>  	struct pid *pid = inode->i_private;
> -	
> -	if (*max_len < 2) {
> -		*max_len = 2;
> +	struct pidfd_fid *fid = (struct pidfd_fid *)fh;
> +
> +	if (*max_len < PIDFD_FID_LEN) {
> +		*max_len = PIDFD_FID_LEN;
>  		return FILEID_INVALID;
>  	}
>  
> -	*max_len = 2;
> -	*(u64 *)fh = pid->ino;
> -	return FILEID_KERNFS;
> +	fid->ino = pid->ino;
> +	fid->pid = pid_nr(pid);

I worry a little here. A container being able to discover its pid in
the root namespace seems a little sketchy. This makes that fairly
simple to figure out.

Maybe generate a random 32 bit value at boot time, and then xor that
into this? Then you could just reverse that and look up the pid.

> +	*max_len = PIDFD_FID_LEN;
> +	return FILEID_INO64_GEN;
> +}
> +
> +static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
> +					 struct fid *gen_fid,
> +					 int fh_len, int fh_type)
> +{
> +	int ret;
> +	struct path path;
> +	struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
> +	struct pid *pid;
> +
> +	if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
> +		return NULL;
> +
> +	pid = find_get_pid_ns(fid->pid, &init_pid_ns);
> +	if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0) {
> +		put_pid(pid);
> +		return NULL;
> +	}
> +
> +	ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +
> +	mntput(path.mnt);
> +	return path.dentry;
>  }
>  
>  static const struct export_operations pidfs_export_operations = {
>  	.encode_fh = pidfs_encode_fh,
> +	.fh_to_dentry = pidfs_fh_to_dentry,
>  };
>  
>  static int pidfs_init_inode(struct inode *inode, void *data)

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/4] pidfs: implement fh_to_dentry
  2024-11-12 23:51   ` Jeff Layton
@ 2024-11-13  8:01     ` Amir Goldstein
  2024-11-13 10:11       ` Erin Shepherd
  0 siblings, 1 reply; 57+ messages in thread
From: Amir Goldstein @ 2024-11-13  8:01 UTC (permalink / raw)
  To: Jeff Layton, Erin Shepherd
  Cc: linux-kernel, linux-fsdevel, christian, paul, bluca

On Wed, Nov 13, 2024 at 1:34 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Fri, 2024-11-01 at 13:54 +0000, Erin Shepherd wrote:
> > This enables userspace to use name_to_handle_at to recover a pidfd
> > to a process.
> >
> > We stash the process' PID in the root pid namespace inside the handle,
> > and use that to recover the pid (validating that pid->ino matches the
> > value in the handle, i.e. that the pid has not been reused).
> >
> > We use the root namespace in order to ensure that file handles can be
> > moved across namespaces; however, we validate that the PID exists in
> > the current namespace before returning the inode.
> >
> > Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> > ---
> >  fs/pidfs.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 43 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > index c8e7e9011550..2d66610ef385 100644
> > --- a/fs/pidfs.c
> > +++ b/fs/pidfs.c
> > @@ -348,23 +348,59 @@ static const struct dentry_operations pidfs_dentry_operations = {
> >       .d_prune        = stashed_dentry_prune,
> >  };
> >
> > -static int pidfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
> > +#define PIDFD_FID_LEN 3
> > +
> > +struct pidfd_fid {
> > +     u64 ino;
> > +     s32 pid;
> > +} __packed;
> > +
> > +static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
> >                          struct inode *parent)
> >  {
> >       struct pid *pid = inode->i_private;
> > -
> > -     if (*max_len < 2) {
> > -             *max_len = 2;
> > +     struct pidfd_fid *fid = (struct pidfd_fid *)fh;
> > +
> > +     if (*max_len < PIDFD_FID_LEN) {
> > +             *max_len = PIDFD_FID_LEN;
> >               return FILEID_INVALID;
> >       }
> >
> > -     *max_len = 2;
> > -     *(u64 *)fh = pid->ino;
> > -     return FILEID_KERNFS;
> > +     fid->ino = pid->ino;
> > +     fid->pid = pid_nr(pid);
>
> I worry a little here. A container being able to discover its pid in
> the root namespace seems a little sketchy. This makes that fairly
> simple to figure out.
>
> Maybe generate a random 32 bit value at boot time, and then xor that
> into this? Then you could just reverse that and look up the pid.
>

I don't like playing pseudo cryptographic games, we are not
crypto experts so we are bound to lose in this game.

My thinking is the other way around -
- encode FILEID_INO32_GEN with pid_nr + i_generation
- pid_nr is obviously not unique across pidns and reusable
  but that makes it just like i_ino across filesystems
- the resulting file handle is thus usable only in the pidns where
  it was encoded - is that a bad thing?

Erin,

You write that "To ensure file handles are invariant and can move
between pid namespaces, we stash a pid from the initial namespace
inside the file handle."

Why is it a requirement for userspace that pidfs file handles are
invariant to pidns?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/4] pidfs: implement fh_to_dentry
  2024-11-13  8:01     ` Amir Goldstein
@ 2024-11-13 10:11       ` Erin Shepherd
  2024-11-13 12:21         ` Christian Brauner
  0 siblings, 1 reply; 57+ messages in thread
From: Erin Shepherd @ 2024-11-13 10:11 UTC (permalink / raw)
  To: Amir Goldstein, Jeff Layton
  Cc: linux-kernel, linux-fsdevel, christian, paul, bluca

On 13/11/2024 09:01, Amir Goldstein wrote:

> I don't like playing pseudo cryptographic games, we are not
> crypto experts so we are bound to lose in this game.

I agree. It would be one thing to obfusficate things in order to prevent
userspace from relying upon something that's not ABI; it would be another
to do so with the intent of hiding data. If we wanted to do that, we'd
need to actually encrypt the PID (with e.g. AES-CTR(key, iv=inode_nr))

> My thinking is the other way around -
> - encode FILEID_INO32_GEN with pid_nr + i_generation
> - pid_nr is obviously not unique across pidns and reusable
>   but that makes it just like i_ino across filesystems
> - the resulting file handle is thus usable only in the pidns where
>   it was encoded - is that a bad thing?
>
> Erin,
>
> You write that "To ensure file handles are invariant and can move
> between pid namespaces, we stash a pid from the initial namespace
> inside the file handle."
>
> Why is it a requirement for userspace that pidfs file handles are
> invariant to pidns?

I don't think it's a requirement, but I do think its useful - it is nice if
a service inside a pidns can pass you a file handle and you can restore it and
things are fine (consider also handles stored on the filesystem, as a better
analog for PID files)

But I too was uncertain about exposing root namespace PIDs to containers. I
have no objections to limiting restore of file handles to the same pid ns -
though I think we should defnitely document that such a limitation may be
lifted in the future.

- Erin


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/4] pidfs: implement fh_to_dentry
  2024-11-13 10:11       ` Erin Shepherd
@ 2024-11-13 12:21         ` Christian Brauner
  0 siblings, 0 replies; 57+ messages in thread
From: Christian Brauner @ 2024-11-13 12:21 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Amir Goldstein, Jeff Layton, linux-kernel, linux-fsdevel,
	christian, paul, bluca

On Wed, Nov 13, 2024 at 11:11:47AM +0100, Erin Shepherd wrote:
> On 13/11/2024 09:01, Amir Goldstein wrote:
> 
> > I don't like playing pseudo cryptographic games, we are not
> > crypto experts so we are bound to lose in this game.
> 
> I agree. It would be one thing to obfusficate things in order to prevent
> userspace from relying upon something that's not ABI; it would be another
> to do so with the intent of hiding data. If we wanted to do that, we'd
> need to actually encrypt the PID (with e.g. AES-CTR(key, iv=inode_nr))
> 
> > My thinking is the other way around -
> > - encode FILEID_INO32_GEN with pid_nr + i_generation
> > - pid_nr is obviously not unique across pidns and reusable
> >   but that makes it just like i_ino across filesystems
> > - the resulting file handle is thus usable only in the pidns where
> >   it was encoded - is that a bad thing?
> >
> > Erin,
> >
> > You write that "To ensure file handles are invariant and can move
> > between pid namespaces, we stash a pid from the initial namespace
> > inside the file handle."
> >
> > Why is it a requirement for userspace that pidfs file handles are
> > invariant to pidns?
> 
> I don't think it's a requirement, but I do think its useful - it is nice if

It kind of is though, no? Because you need a reliable way to decode the
pidfs file handle to a struct pid. If one encodes pid numbers as seen
from the encoders pid namespace the decoder has no way of knowing what
pid namespace to resolve it in as the same pid number can obviously be
present in multiple pid namespace. So not encoding the global pid number
would be inherently ambiguous.

> a service inside a pidns can pass you a file handle and you can restore it and
> things are fine (consider also handles stored on the filesystem, as a better
> analog for PID files)
> 
> But I too was uncertain about exposing root namespace PIDs to containers. I
> have no objections to limiting restore of file handles to the same pid ns -
> though I think we should defnitely document that such a limitation may be
> lifted in the future.

The point is really just the provided pid needs to be resolvable in the
pid namespace of the caller. Encoding a global pid number means that
internally we can always resolve it as we know that we always encode
pids in the init pid namespace.

In a second step we can then verify that the struct pid we found based
on the pid number is a member of the pid namespace hierarchy of the
caller. If that is the case then the caller is allowed to get a pidfd
from open_by_handle_at() as they would also be able to get a pidfd via
pidfd_open().

So a container will never be able to a pidfd from a pid number that
resolves to a struct pid that is outside its pid namespace hierarchy.

Let me know if I misunderstood the concerns.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/4] pidfs: implement fh_to_dentry
  2024-11-01 13:54 ` [PATCH 4/4] pidfs: implement fh_to_dentry Erin Shepherd
  2024-11-12 16:33   ` Amir Goldstein
  2024-11-12 23:51   ` Jeff Layton
@ 2024-11-13 12:09   ` Christian Brauner
  2024-11-13 13:06     ` Erin Shepherd
  2 siblings, 1 reply; 57+ messages in thread
From: Christian Brauner @ 2024-11-13 12:09 UTC (permalink / raw)
  To: Erin Shepherd; +Cc: linux-kernel, linux-fsdevel, christian, paul, bluca

On Fri, Nov 01, 2024 at 01:54:52PM +0000, Erin Shepherd wrote:
> This enables userspace to use name_to_handle_at to recover a pidfd
> to a process.
> 
> We stash the process' PID in the root pid namespace inside the handle,
> and use that to recover the pid (validating that pid->ino matches the
> value in the handle, i.e. that the pid has not been reused).
> 
> We use the root namespace in order to ensure that file handles can be
> moved across namespaces; however, we validate that the PID exists in
> the current namespace before returning the inode.
> 
> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> ---
>  fs/pidfs.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 43 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index c8e7e9011550..2d66610ef385 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -348,23 +348,59 @@ static const struct dentry_operations pidfs_dentry_operations = {
>  	.d_prune	= stashed_dentry_prune,
>  };
>  
> -static int pidfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
> +#define PIDFD_FID_LEN 3
> +
> +struct pidfd_fid {
> +	u64 ino;
> +	s32 pid;
> +} __packed;
> +
> +static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
>  			   struct inode *parent)
>  {
>  	struct pid *pid = inode->i_private;
> -	
> -	if (*max_len < 2) {
> -		*max_len = 2;
> +	struct pidfd_fid *fid = (struct pidfd_fid *)fh;
> +
> +	if (*max_len < PIDFD_FID_LEN) {
> +		*max_len = PIDFD_FID_LEN;
>  		return FILEID_INVALID;
>  	}
>  
> -	*max_len = 2;
> -	*(u64 *)fh = pid->ino;
> -	return FILEID_KERNFS;
> +	fid->ino = pid->ino;
> +	fid->pid = pid_nr(pid);

Hm, a pidfd comes in two flavours:

(1) thread-group leader pidfd: pidfd_open(<pid>, 0)
(2) thread pidfd:              pidfd_open(<pid>, PIDFD_THREAD)

In your current scheme fid->pid = pid_nr(pid) means that you always
encode a pidfs file handle for a thread pidfd no matter if the provided
pidfd was a thread-group leader pidfd or a thread pidfd. This is very
likely wrong as it means users that use a thread-group pidfd get a
thread-specific pid back.

I think we need to encode (1) and (2) in the pidfs file handle so users
always get back the correct type of pidfd.

That very likely means name_to_handle_at() needs to encode this into the
pidfs file handle.

We need to think a bit how to do this as we need access to the file so
we can tell (1) and (2) apart. It shouldn't be that big of a deal. For
pidfds we don't need any path-based lookup anyway. IOW, AT_EMPTY_PATH is
the only valid case. Starting with v6.13 we'll have getname_maybe_null()
so access to the file is roughly:

struct path path;
struct filename *fname;
unsigned in f_flags = 0;

fname = getname_maybe_null(name, flag & AT_EMPTY_PATH);
if (fname) {
        ret = filename_lookup(dfd, fname, lookup_flags, &path, NULL);
        if (ret)
                return ret;
} else {
	CLASS(fd, f)(dfd);
	if (fd_empty(f))
		return -EBADF;
	path = fd_file(f)->f_path;
	if (pidfd_pid(fd_file(f))
		f_flags = fd_file(f)->f_flags;
	path_get(&path);
}

and then a thread pidfd is reconginzable as f_flags & PIDFD_THREAD/O_EXCL.

The question again is how to plumb this through to the export_operations
encoding function.

> +	*max_len = PIDFD_FID_LEN;
> +	return FILEID_INO64_GEN;
> +}
> +
> +static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
> +					 struct fid *gen_fid,
> +					 int fh_len, int fh_type)
> +{
> +	int ret;
> +	struct path path;
> +	struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
> +	struct pid *pid;
> +
> +	if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
> +		return NULL;
> +
> +	pid = find_get_pid_ns(fid->pid, &init_pid_ns);
> +	if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0) {
> +		put_pid(pid);
> +		return NULL;
> +	}

I think we can avoid the premature reference bump and do:

scoped_guard(rcu) {
        struct pid *pid;

	pid = find_pid_ns(fid->pid, &init_pid_ns);
	if (!pid)
		return NULL;

	/* Did the pid get recycled? */
	if (pid->ino != fid->ino)
		return NULL;

	/* Must be resolvable in the caller's pid namespace. */
	if (pid_vnr(pid) == 0)
		return NULL;

	/* Ok, this is the pid we want. */
	get_pid(pid);
}

> +
> +	ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +
> +	mntput(path.mnt);
> +	return path.dentry;
>  }
>  
>  static const struct export_operations pidfs_export_operations = {
>  	.encode_fh = pidfs_encode_fh,
> +	.fh_to_dentry = pidfs_fh_to_dentry,
>  };
>  
>  static int pidfs_init_inode(struct inode *inode, void *data)
> -- 
> 2.46.1
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/4] pidfs: implement fh_to_dentry
  2024-11-13 12:09   ` Christian Brauner
@ 2024-11-13 13:06     ` Erin Shepherd
  2024-11-13 13:26       ` Christian Brauner
  0 siblings, 1 reply; 57+ messages in thread
From: Erin Shepherd @ 2024-11-13 13:06 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-kernel, linux-fsdevel, christian, paul, bluca

On 13/11/2024 13:09, Christian Brauner wrote:

> Hm, a pidfd comes in two flavours:
>
> (1) thread-group leader pidfd: pidfd_open(<pid>, 0)
> (2) thread pidfd:              pidfd_open(<pid>, PIDFD_THREAD)
>
> In your current scheme fid->pid = pid_nr(pid) means that you always
> encode a pidfs file handle for a thread pidfd no matter if the provided
> pidfd was a thread-group leader pidfd or a thread pidfd. This is very
> likely wrong as it means users that use a thread-group pidfd get a
> thread-specific pid back.
>
> I think we need to encode (1) and (2) in the pidfs file handle so users
> always get back the correct type of pidfd.
>
> That very likely means name_to_handle_at() needs to encode this into the
> pidfs file handle.

I guess a question here is whether a pidfd handle encodes a handle to a pid
in a specific mode, or just to a pid in general? The thought had occurred
to me while I was working on this initially, but I felt like perhaps treating
it as a property of the file descriptor in general was better.

Currently open_by_handle_at always returns a thread-group pidfd (since
PIDFD_THREAD) isn't set, regardless of what type of pidfd you passed to
name_to_handle_at. I had thought that PIDFD_THREAD/O_EXCL would have been
passed through to f->f_flags on the restored pidfd, but upon checking I see that
it gets filtered out in do_dentry_open.

I feel like leaving it up to the caller of open_by_handle_at might be better
(because they are probably better informed about whether they want poll() to
inform them of thread or process exit) but I could lean either way.

>> +static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
>> +					 struct fid *gen_fid,
>> +					 int fh_len, int fh_type)
>> +{
>> +	int ret;
>> +	struct path path;
>> +	struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
>> +	struct pid *pid;
>> +
>> +	if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
>> +		return NULL;
>> +
>> +	pid = find_get_pid_ns(fid->pid, &init_pid_ns);
>> +	if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0) {
>> +		put_pid(pid);
>> +		return NULL;
>> +	}
> I think we can avoid the premature reference bump and do:
>
> scoped_guard(rcu) {
>         struct pid *pid;
>
> 	pid = find_pid_ns(fid->pid, &init_pid_ns);
> 	if (!pid)
> 		return NULL;
>
> 	/* Did the pid get recycled? */
> 	if (pid->ino != fid->ino)
> 		return NULL;
>
> 	/* Must be resolvable in the caller's pid namespace. */
> 	if (pid_vnr(pid) == 0)
> 		return NULL;
>
> 	/* Ok, this is the pid we want. */
> 	get_pid(pid);
> }

I can go with that if preferred. I was worried a bit about making the RCU
critical section too large, but of course I'm sure there are much larger
sections inside the kernel.

>> +
>> +	ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
>> +	if (ret < 0)
>> +		return ERR_PTR(ret);
>> +
>> +	mntput(path.mnt);
>> +	return path.dentry;
>>  }

Similarly here i should probably refactor this into dentry_from_stashed in
order to avoid a needless bump-then-drop of path.mnt's reference count


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/4] pidfs: implement fh_to_dentry
  2024-11-13 13:06     ` Erin Shepherd
@ 2024-11-13 13:26       ` Christian Brauner
  2024-11-13 13:48         ` Erin Shepherd
  0 siblings, 1 reply; 57+ messages in thread
From: Christian Brauner @ 2024-11-13 13:26 UTC (permalink / raw)
  To: Erin Shepherd; +Cc: linux-kernel, linux-fsdevel, christian, paul, bluca

On Wed, Nov 13, 2024 at 02:06:56PM +0100, Erin Shepherd wrote:
> On 13/11/2024 13:09, Christian Brauner wrote:
> 
> > Hm, a pidfd comes in two flavours:
> >
> > (1) thread-group leader pidfd: pidfd_open(<pid>, 0)
> > (2) thread pidfd:              pidfd_open(<pid>, PIDFD_THREAD)
> >
> > In your current scheme fid->pid = pid_nr(pid) means that you always
> > encode a pidfs file handle for a thread pidfd no matter if the provided
> > pidfd was a thread-group leader pidfd or a thread pidfd. This is very
> > likely wrong as it means users that use a thread-group pidfd get a
> > thread-specific pid back.
> >
> > I think we need to encode (1) and (2) in the pidfs file handle so users
> > always get back the correct type of pidfd.
> >
> > That very likely means name_to_handle_at() needs to encode this into the
> > pidfs file handle.
> 
> I guess a question here is whether a pidfd handle encodes a handle to a pid
> in a specific mode, or just to a pid in general? The thought had occurred
> to me while I was working on this initially, but I felt like perhaps treating
> it as a property of the file descriptor in general was better.
> 
> Currently open_by_handle_at always returns a thread-group pidfd (since
> PIDFD_THREAD) isn't set, regardless of what type of pidfd you passed to
> name_to_handle_at. I had thought that PIDFD_THREAD/O_EXCL would have been

I don't think you're returning a thread-groupd pidfd from
open_by_handle_at() in your scheme. After all you're encoding the tid in
pid_nr() so you'll always find the struct pid for the thread afaict. If
I'm wrong could you please explain how you think this works? I might
just be missing something obvious.

> passed through to f->f_flags on the restored pidfd, but upon checking I see that
> it gets filtered out in do_dentry_open.

It does, but note that __pidfd_prepare() raises it explicitly on the
file afterwards. So it works fine.

> 
> I feel like leaving it up to the caller of open_by_handle_at might be better
> (because they are probably better informed about whether they want poll() to
> inform them of thread or process exit) but I could lean either way.

So in order to decode a pidfs file handle you want the caller to have to
specify O_EXCL in the flags argument of open_by_handle_at()? Is that
your idea?

> 
> >> +static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
> >> +					 struct fid *gen_fid,
> >> +					 int fh_len, int fh_type)
> >> +{
> >> +	int ret;
> >> +	struct path path;
> >> +	struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
> >> +	struct pid *pid;
> >> +
> >> +	if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
> >> +		return NULL;
> >> +
> >> +	pid = find_get_pid_ns(fid->pid, &init_pid_ns);
> >> +	if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0) {
> >> +		put_pid(pid);
> >> +		return NULL;
> >> +	}
> > I think we can avoid the premature reference bump and do:
> >
> > scoped_guard(rcu) {
> >         struct pid *pid;
> >
> > 	pid = find_pid_ns(fid->pid, &init_pid_ns);
> > 	if (!pid)
> > 		return NULL;
> >
> > 	/* Did the pid get recycled? */
> > 	if (pid->ino != fid->ino)
> > 		return NULL;
> >
> > 	/* Must be resolvable in the caller's pid namespace. */
> > 	if (pid_vnr(pid) == 0)
> > 		return NULL;
> >
> > 	/* Ok, this is the pid we want. */
> > 	get_pid(pid);
> > }
> 
> I can go with that if preferred. I was worried a bit about making the RCU
> critical section too large, but of course I'm sure there are much larger
> sections inside the kernel.

This is perfectly fine. Don't worry about it.

> 
> >> +
> >> +	ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
> >> +	if (ret < 0)
> >> +		return ERR_PTR(ret);
> >> +
> >> +	mntput(path.mnt);
> >> +	return path.dentry;
> >>  }
> 
> Similarly here i should probably refactor this into dentry_from_stashed in
> order to avoid a needless bump-then-drop of path.mnt's reference count

No, what you have now is fine. I wouldn't add a specific helper for
this. In contrast to the pid the pidfs mount never goes away.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/4] pidfs: implement fh_to_dentry
  2024-11-13 13:26       ` Christian Brauner
@ 2024-11-13 13:48         ` Erin Shepherd
  2024-11-14 10:29           ` Christian Brauner
  0 siblings, 1 reply; 57+ messages in thread
From: Erin Shepherd @ 2024-11-13 13:48 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-kernel, linux-fsdevel, christian, paul, bluca

On 13/11/2024 14:26, Christian Brauner wrote:

> On Wed, Nov 13, 2024 at 02:06:56PM +0100, Erin Shepherd wrote:
>> On 13/11/2024 13:09, Christian Brauner wrote:
>>
>>> Hm, a pidfd comes in two flavours:
>>>
>>> (1) thread-group leader pidfd: pidfd_open(<pid>, 0)
>>> (2) thread pidfd:              pidfd_open(<pid>, PIDFD_THREAD)
>>>
>>> In your current scheme fid->pid = pid_nr(pid) means that you always
>>> encode a pidfs file handle for a thread pidfd no matter if the provided
>>> pidfd was a thread-group leader pidfd or a thread pidfd. This is very
>>> likely wrong as it means users that use a thread-group pidfd get a
>>> thread-specific pid back.
>>>
>>> I think we need to encode (1) and (2) in the pidfs file handle so users
>>> always get back the correct type of pidfd.
>>>
>>> That very likely means name_to_handle_at() needs to encode this into the
>>> pidfs file handle.
>> I guess a question here is whether a pidfd handle encodes a handle to a pid
>> in a specific mode, or just to a pid in general? The thought had occurred
>> to me while I was working on this initially, but I felt like perhaps treating
>> it as a property of the file descriptor in general was better.
>>
>> Currently open_by_handle_at always returns a thread-group pidfd (since
>> PIDFD_THREAD) isn't set, regardless of what type of pidfd you passed to
>> name_to_handle_at. I had thought that PIDFD_THREAD/O_EXCL would have been
> I don't think you're returning a thread-groupd pidfd from
> open_by_handle_at() in your scheme. After all you're encoding the tid in
> pid_nr() so you'll always find the struct pid for the thread afaict. If
> I'm wrong could you please explain how you think this works? I might
> just be missing something obvious.

Moudlo namespaces, the pid in fid->pid is the same one passed to pidfd_open().
In the root namespace, you could replace name_to_handle_at(...) with
pidfd_open(fid->pid, 0) and get the same result (if both are successful, at least).

The resulting pidfd points to the same struct pid. The only thing that should differ
is whether PIDFD_THREAD is set in f->f_flags.

>> I feel like leaving it up to the caller of open_by_handle_at might be better
>> (because they are probably better informed about whether they want poll() to
>> inform them of thread or process exit) but I could lean either way.
> So in order to decode a pidfs file handle you want the caller to have to
> specify O_EXCL in the flags argument of open_by_handle_at()? Is that
> your idea?

If they want a PIDFD_THREAD pidfd, yes. I see it as similar to O_RDONLY, where its a
flag that applies to the file descriptor but not to the underlying file.

While ideally we'd implement it from an API completeness perspective, practically I'm
not sure how often the option would ever be used. While there are hundreds of reasons
why you might want to track the state of another process, I struggle to think of cases
where Process A needs to track Process B's threads besides a debugger (and a debugger
is probably better off using ptrace), and it can happily track its own threads by just
holding onto the pidfd.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/4] pidfs: implement fh_to_dentry
  2024-11-13 13:48         ` Erin Shepherd
@ 2024-11-14 10:29           ` Christian Brauner
  2024-11-14 12:21             ` Erin Shepherd
  0 siblings, 1 reply; 57+ messages in thread
From: Christian Brauner @ 2024-11-14 10:29 UTC (permalink / raw)
  To: Erin Shepherd; +Cc: linux-kernel, linux-fsdevel, christian, paul, bluca

On Wed, Nov 13, 2024 at 02:48:43PM +0100, Erin Shepherd wrote:
> On 13/11/2024 14:26, Christian Brauner wrote:
> 
> > On Wed, Nov 13, 2024 at 02:06:56PM +0100, Erin Shepherd wrote:
> >> On 13/11/2024 13:09, Christian Brauner wrote:
> >>
> >>> Hm, a pidfd comes in two flavours:
> >>>
> >>> (1) thread-group leader pidfd: pidfd_open(<pid>, 0)
> >>> (2) thread pidfd:              pidfd_open(<pid>, PIDFD_THREAD)
> >>>
> >>> In your current scheme fid->pid = pid_nr(pid) means that you always
> >>> encode a pidfs file handle for a thread pidfd no matter if the provided
> >>> pidfd was a thread-group leader pidfd or a thread pidfd. This is very
> >>> likely wrong as it means users that use a thread-group pidfd get a
> >>> thread-specific pid back.
> >>>
> >>> I think we need to encode (1) and (2) in the pidfs file handle so users
> >>> always get back the correct type of pidfd.
> >>>
> >>> That very likely means name_to_handle_at() needs to encode this into the
> >>> pidfs file handle.
> >> I guess a question here is whether a pidfd handle encodes a handle to a pid
> >> in a specific mode, or just to a pid in general? The thought had occurred
> >> to me while I was working on this initially, but I felt like perhaps treating
> >> it as a property of the file descriptor in general was better.
> >>
> >> Currently open_by_handle_at always returns a thread-group pidfd (since
> >> PIDFD_THREAD) isn't set, regardless of what type of pidfd you passed to
> >> name_to_handle_at. I had thought that PIDFD_THREAD/O_EXCL would have been
> > I don't think you're returning a thread-groupd pidfd from
> > open_by_handle_at() in your scheme. After all you're encoding the tid in
> > pid_nr() so you'll always find the struct pid for the thread afaict. If
> > I'm wrong could you please explain how you think this works? I might
> > just be missing something obvious.
> 
> Moudlo namespaces, the pid in fid->pid is the same one passed to pidfd_open().
> In the root namespace, you could replace name_to_handle_at(...) with
> pidfd_open(fid->pid, 0) and get the same result (if both are successful, at least).
> 
> The resulting pidfd points to the same struct pid. The only thing that should differ
> is whether PIDFD_THREAD is set in f->f_flags.

I see what you mean but then there's another problem afaict.

Two cases:

(1) @pidfd_thread_group = pidfd_open(1234, 0)

    The pidfd_open() will succeed if the struct pid that 1234 resolves
    to is used as a thread-group leader.

(2) @pidfd_thread = pidfd_open(5678, PIDFD_THREAD)

    The pidfd_open() will succeed even if the struct pid that 5678
    resolves to isn't used as a thread-group leader.

    The resulting struct file will be marked as being a thread pidfd by
    raising O_EXCL.

(1') If (1) is passed to name_to_handle_at() a pidfs file handle is
     encoded for 1234. If later open_by_hande_at() is called then by
     default a thread-group leader pidfd is created. This is fine

(2') If (2) is passed to name_to_handle_at() a pidfs file handle is
     encoded for 5678. If later open_by_handle_at() is called then a
     thread-group leader pidfd will be created again.

So in (2') the caller has managed to create a thread-group leader pidfd
even though the struct pid isn't used as a thread-group leader pidfd.
Consequently, that pidfd is useless when passed to any of the pidfd_*()
system calls.

So basically, you need to verify that if O_EXCL isn't specified with
open_by_handle_at() that the struct pid that is resolved is used as a
thread-group leader and if not, refuse to create a pidfd.

Am I making sense?

> 
> >> I feel like leaving it up to the caller of open_by_handle_at might be better
> >> (because they are probably better informed about whether they want poll() to
> >> inform them of thread or process exit) but I could lean either way.
> > So in order to decode a pidfs file handle you want the caller to have to
> > specify O_EXCL in the flags argument of open_by_handle_at()? Is that
> > your idea?
> 
> If they want a PIDFD_THREAD pidfd, yes. I see it as similar to O_RDONLY, where its a
> flag that applies to the file descriptor but not to the underlying file.

This is probably fine.

> 
> While ideally we'd implement it from an API completeness perspective, practically I'm
> not sure how often the option would ever be used. While there are hundreds of reasons
> why you might want to track the state of another process, I struggle to think of cases
> where Process A needs to track Process B's threads besides a debugger (and a debugger
> is probably better off using ptrace), and it can happily track its own threads by just
> holding onto the pidfd.

We recently imlemented PIDFD_THREAD support because it is used inside
Netflix. I forgot the details thought tbh. So it's actually used. We
only implemented it once people requested it.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 4/4] pidfs: implement fh_to_dentry
  2024-11-14 10:29           ` Christian Brauner
@ 2024-11-14 12:21             ` Erin Shepherd
  0 siblings, 0 replies; 57+ messages in thread
From: Erin Shepherd @ 2024-11-14 12:21 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-kernel, linux-fsdevel, christian, paul, bluca


On 14/11/2024 11:29, Christian Brauner wrote:
>> Moudlo namespaces, the pid in fid->pid is the same one passed to pidfd_open().
>> In the root namespace, you could replace name_to_handle_at(...) with
>> pidfd_open(fid->pid, 0) and get the same result (if both are successful, at least).
>>
>> The resulting pidfd points to the same struct pid. The only thing that should differ
>> is whether PIDFD_THREAD is set in f->f_flags.
> I see what you mean but then there's another problem afaict.
>
> Two cases:
>
> (1) @pidfd_thread_group = pidfd_open(1234, 0)
>
>     The pidfd_open() will succeed if the struct pid that 1234 resolves
>     to is used as a thread-group leader.
>
> (2) @pidfd_thread = pidfd_open(5678, PIDFD_THREAD)
>
>     The pidfd_open() will succeed even if the struct pid that 5678
>     resolves to isn't used as a thread-group leader.
>
>     The resulting struct file will be marked as being a thread pidfd by
>     raising O_EXCL.
>
> (1') If (1) is passed to name_to_handle_at() a pidfs file handle is
>      encoded for 1234. If later open_by_hande_at() is called then by
>      default a thread-group leader pidfd is created. This is fine
>
> (2') If (2) is passed to name_to_handle_at() a pidfs file handle is
>      encoded for 5678. If later open_by_handle_at() is called then a
>      thread-group leader pidfd will be created again.
>
> So in (2') the caller has managed to create a thread-group leader pidfd
> even though the struct pid isn't used as a thread-group leader pidfd.
> Consequently, that pidfd is useless when passed to any of the pidfd_*()
> system calls.
>
> So basically, you need to verify that if O_EXCL isn't specified with
> open_by_handle_at() that the struct pid that is resolved is used as a
> thread-group leader and if not, refuse to create a pidfd.
>
> Am I making sense?

Ah, I fully see what you mean now.

I could implement pidfs_file_operations.open and check the flags there, but
that runs into the issue of vfs_open resetting the flags afterwards so its
entirely pointless. If PIDFD_THREAD wasn't in the set O_CREAT / O_EXCL /
O_NOCTTY / O_TRUNC then this would be much easier, but alas; and its ABI now
too.

I guess the options are

1. Let an FS specify that it doesn't want O_EXCL cleared, but this is getting
   to be some gnarly VFS surgery, or
2. We just detect we're working on a pidfd early in open_by_handle_at and
   skip straight into dedicated logic.

I know you suggested (2) earlier and I increasingly think you're right about
it being the best approach. It also fits better with the special casing PIDFD_SELF
will want when that lands.

So I'll see what an implementation with that approach looks like.

>> If they want a PIDFD_THREAD pidfd, yes. I see it as similar to O_RDONLY, where its a
>> flag that applies to the file descriptor but not to the underlying file.
> This is probably fine.
>> While ideally we'd implement it from an API completeness perspective, practically I'm
>> not sure how often the option would ever be used. While there are hundreds of reasons
>> why you might want to track the state of another process, I struggle to think of cases
>> where Process A needs to track Process B's threads besides a debugger (and a debugger
>> is probably better off using ptrace), and it can happily track its own threads by just
>> holding onto the pidfd.
> We recently imlemented PIDFD_THREAD support because it is used inside
> Netflix. I forgot the details thought tbh. So it's actually used. We
> only implemented it once people requested it.
Oh, I entirely understand the utility of PIDFD_THREAD - I'm just not sure how mnay of
those use cases are cross-process (and in the cases where they are cross process, how many
of those uses would benefit from file handles vs fd passing)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/4] pidfs: implement file handle support
  2024-11-01 13:54 [PATCH 0/4] pidfs: implement file handle support Erin Shepherd
                   ` (3 preceding siblings ...)
  2024-11-01 13:54 ` [PATCH 4/4] pidfs: implement fh_to_dentry Erin Shepherd
@ 2024-11-12 13:10 ` Christian Brauner
  2024-11-12 13:57   ` Jeff Layton
  2024-11-12 23:03   ` [PATCH 0/4] " Erin Shepherd
  4 siblings, 2 replies; 57+ messages in thread
From: Christian Brauner @ 2024-11-12 13:10 UTC (permalink / raw)
  To: Erin Shepherd, Jeff Layton, Amir Goldstein
  Cc: linux-kernel, linux-fsdevel, christian, paul, bluca, Chuck Lever

On Fri, Nov 01, 2024 at 01:54:48PM +0000, Erin Shepherd wrote:
> Since the introduction of pidfs, we have had 64-bit process identifiers 
> that will not be reused for the entire uptime of the system. This greatly 
> facilitates process tracking in userspace.
> 
> There are two limitations at present:
> 
>  * These identifiers are currently only exposed to processes on 64-bit 
>    systems. On 32-bit systems, inode space is also limited to 32 bits and 
>    therefore is subject to the same reuse issues.
>  * There is no way to go from one of these unique identifiers to a pid or 
>    pidfd.
> 
> Patch 1 & 2 in this stack implements fh_export for pidfs. This means 
> userspace  can retrieve a unique process identifier even on 32-bit systems 
> via name_to_handle_at.
> 
> Patch 3 & 4 in this stack implement fh_to_dentry for pidfs. This means 
> userspace can convert back from a file handle to the corresponding pidfd. 
> To support us going from a file handle to a pidfd, we have to store a pid 
> inside the file handle. To ensure file handles are invariant and can move 
> between pid namespaces, we stash a pid from the initial namespace inside 
> the file handle.
> 
> I'm not quite sure if stashing an initial-namespace pid inside the file 
> handle is the right approach here; if not, I think that patch 1 & 2 are 
> useful on their own.

Sorry for the delayed reply (I'm recovering from a lengthy illness.).

I like the idea in general. I think this is really useful. A few of my
thoughts but I need input from Amir and Jeff:

* In the last patch of the series you already implement decoding of
  pidfd file handles by adding a .fh_to_dentry export_operations method.

  There are a few things to consider because of how open_by_handle_at()
  works.

  - open_by_handle_at() needs to be restricted so it only creates pidfds
    from pidfs file handles that resolve to a struct pid that is
    reachable in the caller's pid namespace. In other words, it should
    mirror pidfd_open().

    Put another way, open_by_handle_at() must not be usable to open
    arbitrary pids to prevent a container from constructing a pidfd file
    handle for a process that lives outside it's pid namespace
    hierarchy.

    With this restriction in place open_by_handle_at() can be available
    to let unprivileged processes open pidfd file handles.

    Related to that, I don't think we need to make open_by_handle_at()
    open arbitrary pidfd file handles via CAP_DAC_READ_SEARCH. Simply
    because any process in the initial pid namespace can open any other
    process via pidfd_open() anyway because pid namespaces are
    hierarchical.

    IOW, CAP_DAC_READ_SEARCH must not override the restriction that the
    provided pidfs file handle must be reachable from the caller's pid
    namespace.

  - open_by_handle_at() uses may_decode_fh() to determine whether it's
    possible to decode a file handle as an unprivileged user. The
    current checks don't make sense for pidfs. Conceptually, I think
    there don't need to place any restrictions based on global
    CAP_DAC_READ_SEARCH, owning user namespace of the superblock or
    mount on pidfs file handles.

    The only restriction that matters is that the requested pidfs file
    handle is reachable from the caller's pid namespace.

  - A pidfd always has exactly a single inode and a single dentry.
    There's no aliases.

  - Generally, in my naive opinion, I think that decoding pidfs file
    handles should be a lot simpler than decoding regular path based
    file handles. Because there should be no need to verify any
    ancestors, or reconnect paths. Pidfs also doesn't have directory
    inodes, only regular inodes. In other words, any dentry is
    acceptable.

    Essentially, the only thing we need is for exportfs_decode_fh_raw()
    to verify that the provided pidfs file handle is resolvable in the
    caller's pid namespace. If so we're done. The challenge is how to
    nicely plumb this into the code without it sticking out like a sore
    thumb.

  - Pidfs should not be exportable via NFS. It doesn't make sense.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/4] pidfs: implement file handle support
  2024-11-12 13:10 ` [PATCH 0/4] pidfs: implement file handle support Christian Brauner
@ 2024-11-12 13:57   ` Jeff Layton
  2024-11-12 22:43     ` Erin Shepherd
  2024-11-12 23:03   ` [PATCH 0/4] " Erin Shepherd
  1 sibling, 1 reply; 57+ messages in thread
From: Jeff Layton @ 2024-11-12 13:57 UTC (permalink / raw)
  To: Christian Brauner, Erin Shepherd, Amir Goldstein
  Cc: linux-kernel, linux-fsdevel, christian, paul, bluca, Chuck Lever

On Tue, 2024-11-12 at 14:10 +0100, Christian Brauner wrote:
> On Fri, Nov 01, 2024 at 01:54:48PM +0000, Erin Shepherd wrote:
> > Since the introduction of pidfs, we have had 64-bit process identifiers 
> > that will not be reused for the entire uptime of the system. This greatly 
> > facilitates process tracking in userspace.
> > 
> > There are two limitations at present:
> > 
> >  * These identifiers are currently only exposed to processes on 64-bit 
> >    systems. On 32-bit systems, inode space is also limited to 32 bits and 
> >    therefore is subject to the same reuse issues.

We should really just move to storing 64-bit inode numbers internally
on 32-bit machines. That would at least make statx() give you all 64
bits on 32-bit host.

> >  * There is no way to go from one of these unique identifiers to a pid or 
> >    pidfd.
> > 
> > Patch 1 & 2 in this stack implements fh_export for pidfs. This means 
> > userspace  can retrieve a unique process identifier even on 32-bit systems 
> > via name_to_handle_at.
> > 
> > Patch 3 & 4 in this stack implement fh_to_dentry for pidfs. This means 
> > userspace can convert back from a file handle to the corresponding pidfd. 
> > To support us going from a file handle to a pidfd, we have to store a pid 
> > inside the file handle. To ensure file handles are invariant and can move 
> > between pid namespaces, we stash a pid from the initial namespace inside 
> > the file handle.
> > 
> > I'm not quite sure if stashing an initial-namespace pid inside the file 
> > handle is the right approach here; if not, I think that patch 1 & 2 are 
> > useful on their own.

Hmm... I guess pid namespaces don't have a convenient 64-bit ID like
mount namespaces do? In that case, stashing the pid from init_ns is
probably the next best thing.

> 
> Sorry for the delayed reply (I'm recovering from a lengthy illness.).
> 
> I like the idea in general. I think this is really useful. A few of my
> thoughts but I need input from Amir and Jeff:
> 
> * In the last patch of the series you already implement decoding of
>   pidfd file handles by adding a .fh_to_dentry export_operations method.
> 
>   There are a few things to consider because of how open_by_handle_at()
>   works.
> 
>   - open_by_handle_at() needs to be restricted so it only creates pidfds
>     from pidfs file handles that resolve to a struct pid that is
>     reachable in the caller's pid namespace. In other words, it should
>     mirror pidfd_open().
> 
>     Put another way, open_by_handle_at() must not be usable to open
>     arbitrary pids to prevent a container from constructing a pidfd file
>     handle for a process that lives outside it's pid namespace
>     hierarchy.
> 
>     With this restriction in place open_by_handle_at() can be available
>     to let unprivileged processes open pidfd file handles.
> 
>     Related to that, I don't think we need to make open_by_handle_at()
>     open arbitrary pidfd file handles via CAP_DAC_READ_SEARCH. Simply
>     because any process in the initial pid namespace can open any other
>     process via pidfd_open() anyway because pid namespaces are
>     hierarchical.
> 
>     IOW, CAP_DAC_READ_SEARCH must not override the restriction that the
>     provided pidfs file handle must be reachable from the caller's pid
>     namespace.
> 
>   - open_by_handle_at() uses may_decode_fh() to determine whether it's
>     possible to decode a file handle as an unprivileged user. The
>     current checks don't make sense for pidfs. Conceptually, I think
>     there don't need to place any restrictions based on global
>     CAP_DAC_READ_SEARCH, owning user namespace of the superblock or
>     mount on pidfs file handles.
> 
>     The only restriction that matters is that the requested pidfs file
>     handle is reachable from the caller's pid namespace.
>
>   - A pidfd always has exactly a single inode and a single dentry.
>     There's no aliases.
> 
>   - Generally, in my naive opinion, I think that decoding pidfs file
>     handles should be a lot simpler than decoding regular path based
>     file handles. Because there should be no need to verify any
>     ancestors, or reconnect paths. Pidfs also doesn't have directory
>     inodes, only regular inodes. In other words, any dentry is
>     acceptable.
> 
>     Essentially, the only thing we need is for exportfs_decode_fh_raw()
>     to verify that the provided pidfs file handle is resolvable in the
>     caller's pid namespace. If so we're done. The challenge is how to
>     nicely plumb this into the code without it sticking out like a sore
>     thumb.
> 
>   - Pidfs should not be exportable via NFS. It doesn't make sense.

I haven't looked over the patchset yet, but those restrictions all
sound pretty reasonable to me. Special casing the may_decode_fh
permission checks may be the tricky bit. I'm not sure what that should
look like, tbqh.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/4] pidfs: implement file handle support
  2024-11-12 13:57   ` Jeff Layton
@ 2024-11-12 22:43     ` Erin Shepherd
  2024-11-13  0:37       ` Darrick J. Wong
                         ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Erin Shepherd @ 2024-11-12 22:43 UTC (permalink / raw)
  To: Jeff Layton, Christian Brauner, Amir Goldstein
  Cc: linux-kernel, linux-fsdevel, christian, paul, bluca, Chuck Lever

On 12/11/2024 14:57, Jeff Layton wrote:
> On Tue, 2024-11-12 at 14:10 +0100, Christian Brauner wrote:
> We should really just move to storing 64-bit inode numbers internally
> on 32-bit machines. That would at least make statx() give you all 64
> bits on 32-bit host.

I think that would be ideal from the perspective of exposing it to
userspace.
It does leave the question of going back from inode to pidfd unsolved
though.I like the name_to_handle_at/open_by_handle_at approach because
it neatly solves both sides of the problem with APIs we already have and
understand

> Hmm... I guess pid namespaces don't have a convenient 64-bit ID like
> mount namespaces do? In that case, stashing the pid from init_ns is
> probably the next best thing.

Not that I could identify, no; so stashing the PID seemed like the most
pragmatic
approach.

I'm not 100% sure it should be a documented property of the file handle
format; I
somewhat think that everything after the PID inode should be opaque to
userspace
and subject to change in the future (to the point I considered xoring it
with a
magic constant to make it less obvious to userspace/make it more obvious
that its
not to be relied upon; but that to my knowledge is not something that
the kernel
has done elsewhere).

- Erin

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/4] pidfs: implement file handle support
  2024-11-12 22:43     ` Erin Shepherd
@ 2024-11-13  0:37       ` Darrick J. Wong
  2024-11-13 11:35       ` Christian Brauner
  2024-11-13 17:55       ` [PATCH v2 0/3] " Erin Shepherd
  2 siblings, 0 replies; 57+ messages in thread
From: Darrick J. Wong @ 2024-11-13  0:37 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Jeff Layton, Christian Brauner, Amir Goldstein, linux-kernel,
	linux-fsdevel, christian, paul, bluca, Chuck Lever

On Tue, Nov 12, 2024 at 11:43:13PM +0100, Erin Shepherd wrote:
> On 12/11/2024 14:57, Jeff Layton wrote:
> > On Tue, 2024-11-12 at 14:10 +0100, Christian Brauner wrote:
> > We should really just move to storing 64-bit inode numbers internally
> > on 32-bit machines. That would at least make statx() give you all 64
> > bits on 32-bit host.
> 
> I think that would be ideal from the perspective of exposing it to
> userspace.
> It does leave the question of going back from inode to pidfd unsolved
> though.I like the name_to_handle_at/open_by_handle_at approach because
> it neatly solves both sides of the problem with APIs we already have and
> understand
> 
> > Hmm... I guess pid namespaces don't have a convenient 64-bit ID like
> > mount namespaces do? In that case, stashing the pid from init_ns is
> > probably the next best thing.
> 
> Not that I could identify, no; so stashing the PID seemed like the most
> pragmatic
> approach.
> 
> I'm not 100% sure it should be a documented property of the file
> handle format; I somewhat think that everything after the PID inode
> should be opaque to userspace and subject to change in the future (to
> the point I considered xoring it with a magic constant to make it less
> obvious to userspace/make it more obvious that its not to be relied
> upon; but that to my knowledge is not something that the kernel has
> done elsewhere).

It's a handle, the internal details of its layout of it is supposed to
be opaque to userspace.  I wonder how well userspace deals with weirdly
sized handles though...

--D

> - Erin
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/4] pidfs: implement file handle support
  2024-11-12 22:43     ` Erin Shepherd
  2024-11-13  0:37       ` Darrick J. Wong
@ 2024-11-13 11:35       ` Christian Brauner
  2024-11-13 17:55       ` [PATCH v2 0/3] " Erin Shepherd
  2 siblings, 0 replies; 57+ messages in thread
From: Christian Brauner @ 2024-11-13 11:35 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Jeff Layton, Amir Goldstein, linux-kernel, linux-fsdevel,
	christian, paul, bluca, Chuck Lever

On Tue, Nov 12, 2024 at 11:43:13PM +0100, Erin Shepherd wrote:
> On 12/11/2024 14:57, Jeff Layton wrote:
> > On Tue, 2024-11-12 at 14:10 +0100, Christian Brauner wrote:
> > We should really just move to storing 64-bit inode numbers internally
> > on 32-bit machines. That would at least make statx() give you all 64
> > bits on 32-bit host.
> 
> I think that would be ideal from the perspective of exposing it to
> userspace.
> It does leave the question of going back from inode to pidfd unsolved
> though.I like the name_to_handle_at/open_by_handle_at approach because

Indeed it doesn't solve it because it's possible that a given struct pid
never had a pidfd created for it and thus no inode actually does exist.
So when you're decoding a pidfs file handle you need to go to a struct
pid based on some property. The pid is fine for that and it is
equivalen to how pidfd_open() works.

> it neatly solves both sides of the problem with APIs we already have and
> understand
> 
> > Hmm... I guess pid namespaces don't have a convenient 64-bit ID like
> > mount namespaces do? In that case, stashing the pid from init_ns is
> > probably the next best thing.
> 
> Not that I could identify, no; so stashing the PID seemed like the most
> pragmatic
> approach.
> 
> I'm not 100% sure it should be a documented property of the file handle
> format; I
> somewhat think that everything after the PID inode should be opaque to
> userspace
> and subject to change in the future (to the point I considered xoring it
> with a
> magic constant to make it less obvious to userspace/make it more obvious
> that its
> not to be relied upon; but that to my knowledge is not something that
> the kernel
> has done elsewhere).
> 
> - Erin
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 0/3] pidfs: implement file handle support
  2024-11-12 22:43     ` Erin Shepherd
  2024-11-13  0:37       ` Darrick J. Wong
  2024-11-13 11:35       ` Christian Brauner
@ 2024-11-13 17:55       ` Erin Shepherd
  2024-11-13 17:55         ` [PATCH v2 1/3] pseudofs: add support for export_ops Erin Shepherd
                           ` (3 more replies)
  2 siblings, 4 replies; 57+ messages in thread
From: Erin Shepherd @ 2024-11-13 17:55 UTC (permalink / raw)
  To: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, Jeff Layton, Amir Goldstein,
	linux-nfs, Erin Shepherd

Since the introduction of pidfs, we have had 64-bit process identifiers
that will not be reused for the entire uptime of the system. This greatly
facilitates process tracking in userspace.

There are two limitations at present:

 * These identifiers are currently only exposed to processes on 64-bit
   systems. On 32-bit systems, inode space is also limited to 32 bits and
   therefore is subject to the same reuse issues.
 * There is no way to go from one of these unique identifiers to a pid or
   pidfd.

This patch implements fh_export and fh_to_dentry which enables userspace to
convert PIDs to and from PID file handles. A process can convert a pidfd into
a file handle using name_to_handle_at, store it (in memory, on disk, or 
elsewhere) and then convert it back into a pidfd suing open_by_handle_at.

To support us going from a file handle to a pidfd, we have to store a pid
inside the file handle. To ensure file handles are invariant and can move
between pid namespaces, we stash a pid from the initial namespace inside
the file handle.
	
  (There has been some discussion as to whether or not it is OK to include
  the PID in the initial pid namespace, but so far there hasn't been any
  conclusive reason given as to why this would be a bad idea)	

Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
---
Changes in v2:
- Permit filesystems to opt out of CAP_DAC_READ_SEARCH
- Inline find_pid_ns/get_pid logic; remove unnecessary put_pid
- Squash fh_export & fh_to_dentry into one commit
- Link to v1: https://lore.kernel.org/r/2aa94713-c12a-4344-a45c-a01f26e16a0d@e43.eu

Remaining: To address the PIDFD_THREAD question
 - Do we want to stash it in file handles (IMO no but there may be merits to
   doing so)
 - If not do we want PIDFD_THREAD/O_EXCL to open_by_handle_at to work or do we
   do something else?

   (Perhaps we could just add an ioctl which lets an app change the PIDFD_THREAD flag?)

If PIDFD_SELF lands <https://lore.kernel.org/r/cover.1727644404.git.lorenzo.stoakes@oracle.com>:
 - Support for PIDFD_SELF(_THREAD) as reference fd in open_by_handle_at?
   (We can even detect that special case early there and bypass most of the file handle logic)

---
Erin Shepherd (3):
      pseudofs: add support for export_ops
      exportfs: allow fs to disable CAP_DAC_READ_SEARCH check
      pidfs: implement file handle support

 fs/fhandle.c              | 36 +++++++++++++++------------
 fs/libfs.c                |  1 +
 fs/pidfs.c                | 62 ++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/exportfs.h  |  3 +++
 include/linux/pseudo_fs.h |  1 +
 5 files changed, 87 insertions(+), 16 deletions(-)
---
base-commit: 14b6320953a3f856a3f93bf9a0e423395baa593d
change-id: 20241112-pidfs_fh-fbb04b108a2b

Best regards,
-- 
Erin Shepherd <erin.shepherd@e43.eu>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 1/3] pseudofs: add support for export_ops
  2024-11-13 17:55       ` [PATCH v2 0/3] " Erin Shepherd
@ 2024-11-13 17:55         ` Erin Shepherd
  2024-11-13 17:55         ` [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check Erin Shepherd
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 57+ messages in thread
From: Erin Shepherd @ 2024-11-13 17:55 UTC (permalink / raw)
  To: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, Jeff Layton, Amir Goldstein,
	linux-nfs, Erin Shepherd

Pseudo-filesystems might reasonably wish to implement the export ops
(particularly for name_to_handle_at/open_by_handle_at); plumb this
through pseudo_fs_context

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
---
 fs/libfs.c                | 1 +
 include/linux/pseudo_fs.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/libfs.c b/fs/libfs.c
index 46966fd8bcf9f042e85d0b66134e59fbef83abfd..698a2ddfd0cb94a8927d1d8a3bb3b3226d6d5476 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -669,6 +669,7 @@ static int pseudo_fs_fill_super(struct super_block *s, struct fs_context *fc)
 	s->s_blocksize_bits = PAGE_SHIFT;
 	s->s_magic = ctx->magic;
 	s->s_op = ctx->ops ?: &simple_super_operations;
+	s->s_export_op = ctx->eops;
 	s->s_xattr = ctx->xattr;
 	s->s_time_gran = 1;
 	root = new_inode(s);
diff --git a/include/linux/pseudo_fs.h b/include/linux/pseudo_fs.h
index 730f77381d55f1816ef14adf7dd2cf1d62bb912c..2503f7625d65e7b1fbe9e64d5abf06cd8f017b5f 100644
--- a/include/linux/pseudo_fs.h
+++ b/include/linux/pseudo_fs.h
@@ -5,6 +5,7 @@
 
 struct pseudo_fs_context {
 	const struct super_operations *ops;
+	const struct export_operations *eops;
 	const struct xattr_handler * const *xattr;
 	const struct dentry_operations *dops;
 	unsigned long magic;

-- 
2.46.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check
  2024-11-13 17:55       ` [PATCH v2 0/3] " Erin Shepherd
  2024-11-13 17:55         ` [PATCH v2 1/3] pseudofs: add support for export_ops Erin Shepherd
@ 2024-11-13 17:55         ` Erin Shepherd
  2024-11-13 22:50           ` kernel test robot
                             ` (3 more replies)
  2024-11-13 17:55         ` [PATCH v2 3/3] pidfs: implement file handle support Erin Shepherd
  2024-11-14  7:02         ` [PATCH v2 0/3] " Amir Goldstein
  3 siblings, 4 replies; 57+ messages in thread
From: Erin Shepherd @ 2024-11-13 17:55 UTC (permalink / raw)
  To: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, Jeff Layton, Amir Goldstein,
	linux-nfs, Erin Shepherd

For pidfs, there is no reason to restrict file handle decoding by
CAP_DAC_READ_SEARCH. Introduce an export_ops flag that can indicate
this

Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
---
 fs/fhandle.c             | 36 +++++++++++++++++++++---------------
 include/linux/exportfs.h |  3 +++
 2 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/fs/fhandle.c b/fs/fhandle.c
index 82df28d45cd70a7df525f50bbb398d646110cd99..056116e58f43983bc7bb86da170fb554c7a2fac7 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -235,26 +235,32 @@ static int do_handle_to_path(struct file_handle *handle, struct path *path,
 	return 0;
 }
 
-/*
- * Allow relaxed permissions of file handles if the caller has the
- * ability to mount the filesystem or create a bind-mount of the
- * provided @mountdirfd.
- *
- * In both cases the caller may be able to get an unobstructed way to
- * the encoded file handle. If the caller is only able to create a
- * bind-mount we need to verify that there are no locked mounts on top
- * of it that could prevent us from getting to the encoded file.
- *
- * In principle, locked mounts can prevent the caller from mounting the
- * filesystem but that only applies to procfs and sysfs neither of which
- * support decoding file handles.
- */
 static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
 				 unsigned int o_flags)
 {
 	struct path *root = &ctx->root;
+	struct export_operations *nop = root->mnt->mnt_sb->s_export_op;
+
+	if (nop && nop->flags & EXPORT_OP_UNRESTRICTED_OPEN)
+		return true;
+
+	if (capable(CAP_DAC_READ_SEARCH))
+		return true;
 
 	/*
+	 * Allow relaxed permissions of file handles if the caller has the
+	 * ability to mount the filesystem or create a bind-mount of the
+	 * provided @mountdirfd.
+	 *
+	 * In both cases the caller may be able to get an unobstructed way to
+	 * the encoded file handle. If the caller is only able to create a
+	 * bind-mount we need to verify that there are no locked mounts on top
+	 * of it that could prevent us from getting to the encoded file.
+	 *
+	 * In principle, locked mounts can prevent the caller from mounting the
+	 * filesystem but that only applies to procfs and sysfs neither of which
+	 * support decoding file handles.
+	 *
 	 * Restrict to O_DIRECTORY to provide a deterministic API that avoids a
 	 * confusing api in the face of disconnected non-dir dentries.
 	 *
@@ -293,7 +299,7 @@ static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
 	if (retval)
 		goto out_err;
 
-	if (!capable(CAP_DAC_READ_SEARCH) && !may_decode_fh(&ctx, o_flags)) {
+	if (!may_decode_fh(&ctx, o_flags)) {
 		retval = -EPERM;
 		goto out_path;
 	}
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index 893a1d21dc1c4abc7e52325d7a4cf0adb407f039..459508b53e77ed0597cee217ffe3d82cc7cc11a4 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -247,6 +247,9 @@ struct export_operations {
 						*/
 #define EXPORT_OP_FLUSH_ON_CLOSE	(0x20) /* fs flushes file data on close */
 #define EXPORT_OP_ASYNC_LOCK		(0x40) /* fs can do async lock request */
+#define EXPORT_OP_UNRESTRICTED_OPEN	(0x80) /* FS allows open_by_handle_at
+						  without CAP_DAC_READ_SEARCH
+						*/
 	unsigned long	flags;
 };
 

-- 
2.46.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check
  2024-11-13 17:55         ` [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check Erin Shepherd
@ 2024-11-13 22:50           ` kernel test robot
  2024-11-14  1:29           ` kernel test robot
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 57+ messages in thread
From: kernel test robot @ 2024-11-13 22:50 UTC (permalink / raw)
  To: Erin Shepherd, Christian Brauner, Alexander Viro, Jan Kara,
	Chuck Lever
  Cc: llvm, oe-kbuild-all, linux-fsdevel, linux-kernel, Jeff Layton,
	Amir Goldstein, linux-nfs, Erin Shepherd

Hi Erin,

kernel test robot noticed the following build errors:

[auto build test ERROR on 14b6320953a3f856a3f93bf9a0e423395baa593d]

url:    https://github.com/intel-lab-lkp/linux/commits/Erin-Shepherd/pseudofs-add-support-for-export_ops/20241114-020539
base:   14b6320953a3f856a3f93bf9a0e423395baa593d
patch link:    https://lore.kernel.org/r/20241113-pidfs_fh-v2-2-9a4d28155a37%40e43.eu
patch subject: [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20241114/202411140603.E03vXsdz-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241114/202411140603.E03vXsdz-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411140603.E03vXsdz-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from fs/fhandle.c:2:
   In file included from include/linux/syscalls.h:93:
   In file included from include/trace/syscall.h:7:
   In file included from include/linux/trace_events.h:6:
   In file included from include/linux/ring_buffer.h:5:
   In file included from include/linux/mm.h:2213:
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
>> fs/fhandle.c:242:28: error: initializing 'struct export_operations *' with an expression of type 'const struct export_operations *' discards qualifiers [-Werror,-Wincompatible-pointer-types-discards-qualifiers]
     242 |         struct export_operations *nop = root->mnt->mnt_sb->s_export_op;
         |                                   ^     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   1 warning and 1 error generated.


vim +242 fs/fhandle.c

   237	
   238	static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
   239					 unsigned int o_flags)
   240	{
   241		struct path *root = &ctx->root;
 > 242		struct export_operations *nop = root->mnt->mnt_sb->s_export_op;
   243	
   244		if (nop && nop->flags & EXPORT_OP_UNRESTRICTED_OPEN)
   245			return true;
   246	
   247		if (capable(CAP_DAC_READ_SEARCH))
   248			return true;
   249	
   250		/*
   251		 * Allow relaxed permissions of file handles if the caller has the
   252		 * ability to mount the filesystem or create a bind-mount of the
   253		 * provided @mountdirfd.
   254		 *
   255		 * In both cases the caller may be able to get an unobstructed way to
   256		 * the encoded file handle. If the caller is only able to create a
   257		 * bind-mount we need to verify that there are no locked mounts on top
   258		 * of it that could prevent us from getting to the encoded file.
   259		 *
   260		 * In principle, locked mounts can prevent the caller from mounting the
   261		 * filesystem but that only applies to procfs and sysfs neither of which
   262		 * support decoding file handles.
   263		 *
   264		 * Restrict to O_DIRECTORY to provide a deterministic API that avoids a
   265		 * confusing api in the face of disconnected non-dir dentries.
   266		 *
   267		 * There's only one dentry for each directory inode (VFS rule)...
   268		 */
   269		if (!(o_flags & O_DIRECTORY))
   270			return false;
   271	
   272		if (ns_capable(root->mnt->mnt_sb->s_user_ns, CAP_SYS_ADMIN))
   273			ctx->flags = HANDLE_CHECK_PERMS;
   274		else if (is_mounted(root->mnt) &&
   275			 ns_capable(real_mount(root->mnt)->mnt_ns->user_ns,
   276				    CAP_SYS_ADMIN) &&
   277			 !has_locked_children(real_mount(root->mnt), root->dentry))
   278			ctx->flags = HANDLE_CHECK_PERMS | HANDLE_CHECK_SUBTREE;
   279		else
   280			return false;
   281	
   282		/* Are we able to override DAC permissions? */
   283		if (!ns_capable(current_user_ns(), CAP_DAC_READ_SEARCH))
   284			return false;
   285	
   286		ctx->fh_flags = EXPORT_FH_DIR_ONLY;
   287		return true;
   288	}
   289	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check
  2024-11-13 17:55         ` [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check Erin Shepherd
  2024-11-13 22:50           ` kernel test robot
@ 2024-11-14  1:29           ` kernel test robot
  2024-11-14  4:37           ` Christoph Hellwig
  2024-11-14  6:37           ` Amir Goldstein
  3 siblings, 0 replies; 57+ messages in thread
From: kernel test robot @ 2024-11-14  1:29 UTC (permalink / raw)
  To: Erin Shepherd, Christian Brauner, Alexander Viro, Jan Kara,
	Chuck Lever
  Cc: oe-kbuild-all, linux-fsdevel, linux-kernel, Jeff Layton,
	Amir Goldstein, linux-nfs, Erin Shepherd

Hi Erin,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 14b6320953a3f856a3f93bf9a0e423395baa593d]

url:    https://github.com/intel-lab-lkp/linux/commits/Erin-Shepherd/pseudofs-add-support-for-export_ops/20241114-020539
base:   14b6320953a3f856a3f93bf9a0e423395baa593d
patch link:    https://lore.kernel.org/r/20241113-pidfs_fh-v2-2-9a4d28155a37%40e43.eu
patch subject: [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check
config: openrisc-defconfig (https://download.01.org/0day-ci/archive/20241114/202411140905.a0ntnQQG-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241114/202411140905.a0ntnQQG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411140905.a0ntnQQG-lkp@intel.com/

All warnings (new ones prefixed by >>):

   fs/fhandle.c: In function 'may_decode_fh':
>> fs/fhandle.c:242:41: warning: initialization discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
     242 |         struct export_operations *nop = root->mnt->mnt_sb->s_export_op;
         |                                         ^~~~


vim +/const +242 fs/fhandle.c

   237	
   238	static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
   239					 unsigned int o_flags)
   240	{
   241		struct path *root = &ctx->root;
 > 242		struct export_operations *nop = root->mnt->mnt_sb->s_export_op;
   243	
   244		if (nop && nop->flags & EXPORT_OP_UNRESTRICTED_OPEN)
   245			return true;
   246	
   247		if (capable(CAP_DAC_READ_SEARCH))
   248			return true;
   249	
   250		/*
   251		 * Allow relaxed permissions of file handles if the caller has the
   252		 * ability to mount the filesystem or create a bind-mount of the
   253		 * provided @mountdirfd.
   254		 *
   255		 * In both cases the caller may be able to get an unobstructed way to
   256		 * the encoded file handle. If the caller is only able to create a
   257		 * bind-mount we need to verify that there are no locked mounts on top
   258		 * of it that could prevent us from getting to the encoded file.
   259		 *
   260		 * In principle, locked mounts can prevent the caller from mounting the
   261		 * filesystem but that only applies to procfs and sysfs neither of which
   262		 * support decoding file handles.
   263		 *
   264		 * Restrict to O_DIRECTORY to provide a deterministic API that avoids a
   265		 * confusing api in the face of disconnected non-dir dentries.
   266		 *
   267		 * There's only one dentry for each directory inode (VFS rule)...
   268		 */
   269		if (!(o_flags & O_DIRECTORY))
   270			return false;
   271	
   272		if (ns_capable(root->mnt->mnt_sb->s_user_ns, CAP_SYS_ADMIN))
   273			ctx->flags = HANDLE_CHECK_PERMS;
   274		else if (is_mounted(root->mnt) &&
   275			 ns_capable(real_mount(root->mnt)->mnt_ns->user_ns,
   276				    CAP_SYS_ADMIN) &&
   277			 !has_locked_children(real_mount(root->mnt), root->dentry))
   278			ctx->flags = HANDLE_CHECK_PERMS | HANDLE_CHECK_SUBTREE;
   279		else
   280			return false;
   281	
   282		/* Are we able to override DAC permissions? */
   283		if (!ns_capable(current_user_ns(), CAP_DAC_READ_SEARCH))
   284			return false;
   285	
   286		ctx->fh_flags = EXPORT_FH_DIR_ONLY;
   287		return true;
   288	}
   289	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check
  2024-11-13 17:55         ` [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check Erin Shepherd
  2024-11-13 22:50           ` kernel test robot
  2024-11-14  1:29           ` kernel test robot
@ 2024-11-14  4:37           ` Christoph Hellwig
  2024-11-14 12:56             ` Erin Shepherd
  2024-11-14  6:37           ` Amir Goldstein
  3 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2024-11-14  4:37 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, Jeff Layton, Amir Goldstein,
	linux-nfs

On Wed, Nov 13, 2024 at 05:55:24PM +0000, Erin Shepherd wrote:
> For pidfs, there is no reason to restrict file handle decoding by
> CAP_DAC_READ_SEARCH.

Why is there no reason, i.e. why do you think it is safe.

>Introduce an export_ops flag that can indicate
> this

Also why is is desirable?

To be this looks more than sketchy with the actual exporting hat on,
but I guess that's now how the cool kids use open by handle these days.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check
  2024-11-14  4:37           ` Christoph Hellwig
@ 2024-11-14 12:56             ` Erin Shepherd
  0 siblings, 0 replies; 57+ messages in thread
From: Erin Shepherd @ 2024-11-14 12:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, Jeff Layton, Amir Goldstein,
	linux-nfs


On 14/11/2024 05:37, Christoph Hellwig wrote:
> On Wed, Nov 13, 2024 at 05:55:24PM +0000, Erin Shepherd wrote:
>> For pidfs, there is no reason to restrict file handle decoding by
>> CAP_DAC_READ_SEARCH.
> Why is there no reason, i.e. why do you think it is safe.

A process can use both open_by_handle_at to open the exact same set of
pidfds as they can by pidfd_open. i.e. there is no reason to additionally
restrict access to the former API.

>> Introduce an export_ops flag that can indicate
>> this
> Also why is is desirable?
>
> To be this looks more than sketchy with the actual exporting hat on,
> but I guess that's now how the cool kids use open by handle these days.
Right - we have a bunch of API file systems where userspace wants stable
non-reused file references for the same reasons network filesystems do.
The first example of this was cgroupfs, but the same rationale exists for
pidfs and process tracking.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check
  2024-11-13 17:55         ` [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check Erin Shepherd
                             ` (2 preceding siblings ...)
  2024-11-14  4:37           ` Christoph Hellwig
@ 2024-11-14  6:37           ` Amir Goldstein
  2024-11-14 14:16             ` Christian Brauner
  3 siblings, 1 reply; 57+ messages in thread
From: Amir Goldstein @ 2024-11-14  6:37 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, Jeff Layton, linux-nfs

On Wed, Nov 13, 2024 at 8:11 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>
> For pidfs, there is no reason to restrict file handle decoding by
> CAP_DAC_READ_SEARCH. Introduce an export_ops flag that can indicate
> this
>
> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> ---
>  fs/fhandle.c             | 36 +++++++++++++++++++++---------------
>  include/linux/exportfs.h |  3 +++
>  2 files changed, 24 insertions(+), 15 deletions(-)
>
> diff --git a/fs/fhandle.c b/fs/fhandle.c
> index 82df28d45cd70a7df525f50bbb398d646110cd99..056116e58f43983bc7bb86da170fb554c7a2fac7 100644
> --- a/fs/fhandle.c
> +++ b/fs/fhandle.c
> @@ -235,26 +235,32 @@ static int do_handle_to_path(struct file_handle *handle, struct path *path,
>         return 0;
>  }
>
> -/*
> - * Allow relaxed permissions of file handles if the caller has the
> - * ability to mount the filesystem or create a bind-mount of the
> - * provided @mountdirfd.
> - *
> - * In both cases the caller may be able to get an unobstructed way to
> - * the encoded file handle. If the caller is only able to create a
> - * bind-mount we need to verify that there are no locked mounts on top
> - * of it that could prevent us from getting to the encoded file.
> - *
> - * In principle, locked mounts can prevent the caller from mounting the
> - * filesystem but that only applies to procfs and sysfs neither of which
> - * support decoding file handles.
> - */
>  static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
>                                  unsigned int o_flags)
>  {
>         struct path *root = &ctx->root;
> +       struct export_operations *nop = root->mnt->mnt_sb->s_export_op;
> +
> +       if (nop && nop->flags & EXPORT_OP_UNRESTRICTED_OPEN)
> +               return true;
> +
> +       if (capable(CAP_DAC_READ_SEARCH))
> +               return true;
>
>         /*
> +        * Allow relaxed permissions of file handles if the caller has the
> +        * ability to mount the filesystem or create a bind-mount of the
> +        * provided @mountdirfd.
> +        *
> +        * In both cases the caller may be able to get an unobstructed way to
> +        * the encoded file handle. If the caller is only able to create a
> +        * bind-mount we need to verify that there are no locked mounts on top
> +        * of it that could prevent us from getting to the encoded file.
> +        *
> +        * In principle, locked mounts can prevent the caller from mounting the
> +        * filesystem but that only applies to procfs and sysfs neither of which
> +        * support decoding file handles.
> +        *
>          * Restrict to O_DIRECTORY to provide a deterministic API that avoids a
>          * confusing api in the face of disconnected non-dir dentries.
>          *
> @@ -293,7 +299,7 @@ static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
>         if (retval)
>                 goto out_err;
>
> -       if (!capable(CAP_DAC_READ_SEARCH) && !may_decode_fh(&ctx, o_flags)) {
> +       if (!may_decode_fh(&ctx, o_flags)) {
>                 retval = -EPERM;
>                 goto out_path;
>         }
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index 893a1d21dc1c4abc7e52325d7a4cf0adb407f039..459508b53e77ed0597cee217ffe3d82cc7cc11a4 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -247,6 +247,9 @@ struct export_operations {
>                                                 */
>  #define EXPORT_OP_FLUSH_ON_CLOSE       (0x20) /* fs flushes file data on close */
>  #define EXPORT_OP_ASYNC_LOCK           (0x40) /* fs can do async lock request */
> +#define EXPORT_OP_UNRESTRICTED_OPEN    (0x80) /* FS allows open_by_handle_at
> +                                                 without CAP_DAC_READ_SEARCH
> +                                               */

Don't love the name, but I wonder, isn't SB_NOUSER already a good
enough indication that CAP_DAC_READ_SEARCH is irrelevant?

Essentially, mnt_fd is the user's proof that they can access the mount
and CAP_DAC_READ_SEARCH is the legacy "proof" that the user can
reach from mount the inode by path lookup.

Which reminds me, what is the mnt_fd expected for opening a pidfd
file by handle?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check
  2024-11-14  6:37           ` Amir Goldstein
@ 2024-11-14 14:16             ` Christian Brauner
  0 siblings, 0 replies; 57+ messages in thread
From: Christian Brauner @ 2024-11-14 14:16 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Erin Shepherd, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, Jeff Layton, linux-nfs

On Thu, Nov 14, 2024 at 07:37:19AM +0100, Amir Goldstein wrote:
> On Wed, Nov 13, 2024 at 8:11 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
> >
> > For pidfs, there is no reason to restrict file handle decoding by
> > CAP_DAC_READ_SEARCH. Introduce an export_ops flag that can indicate
> > this
> >
> > Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> > ---
> >  fs/fhandle.c             | 36 +++++++++++++++++++++---------------
> >  include/linux/exportfs.h |  3 +++
> >  2 files changed, 24 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/fhandle.c b/fs/fhandle.c
> > index 82df28d45cd70a7df525f50bbb398d646110cd99..056116e58f43983bc7bb86da170fb554c7a2fac7 100644
> > --- a/fs/fhandle.c
> > +++ b/fs/fhandle.c
> > @@ -235,26 +235,32 @@ static int do_handle_to_path(struct file_handle *handle, struct path *path,
> >         return 0;
> >  }
> >
> > -/*
> > - * Allow relaxed permissions of file handles if the caller has the
> > - * ability to mount the filesystem or create a bind-mount of the
> > - * provided @mountdirfd.
> > - *
> > - * In both cases the caller may be able to get an unobstructed way to
> > - * the encoded file handle. If the caller is only able to create a
> > - * bind-mount we need to verify that there are no locked mounts on top
> > - * of it that could prevent us from getting to the encoded file.
> > - *
> > - * In principle, locked mounts can prevent the caller from mounting the
> > - * filesystem but that only applies to procfs and sysfs neither of which
> > - * support decoding file handles.
> > - */
> >  static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
> >                                  unsigned int o_flags)
> >  {
> >         struct path *root = &ctx->root;
> > +       struct export_operations *nop = root->mnt->mnt_sb->s_export_op;
> > +
> > +       if (nop && nop->flags & EXPORT_OP_UNRESTRICTED_OPEN)
> > +               return true;
> > +
> > +       if (capable(CAP_DAC_READ_SEARCH))
> > +               return true;
> >
> >         /*
> > +        * Allow relaxed permissions of file handles if the caller has the
> > +        * ability to mount the filesystem or create a bind-mount of the
> > +        * provided @mountdirfd.
> > +        *
> > +        * In both cases the caller may be able to get an unobstructed way to
> > +        * the encoded file handle. If the caller is only able to create a
> > +        * bind-mount we need to verify that there are no locked mounts on top
> > +        * of it that could prevent us from getting to the encoded file.
> > +        *
> > +        * In principle, locked mounts can prevent the caller from mounting the
> > +        * filesystem but that only applies to procfs and sysfs neither of which
> > +        * support decoding file handles.
> > +        *
> >          * Restrict to O_DIRECTORY to provide a deterministic API that avoids a
> >          * confusing api in the face of disconnected non-dir dentries.
> >          *
> > @@ -293,7 +299,7 @@ static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
> >         if (retval)
> >                 goto out_err;
> >
> > -       if (!capable(CAP_DAC_READ_SEARCH) && !may_decode_fh(&ctx, o_flags)) {
> > +       if (!may_decode_fh(&ctx, o_flags)) {
> >                 retval = -EPERM;
> >                 goto out_path;
> >         }
> > diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> > index 893a1d21dc1c4abc7e52325d7a4cf0adb407f039..459508b53e77ed0597cee217ffe3d82cc7cc11a4 100644
> > --- a/include/linux/exportfs.h
> > +++ b/include/linux/exportfs.h
> > @@ -247,6 +247,9 @@ struct export_operations {
> >                                                 */
> >  #define EXPORT_OP_FLUSH_ON_CLOSE       (0x20) /* fs flushes file data on close */
> >  #define EXPORT_OP_ASYNC_LOCK           (0x40) /* fs can do async lock request */
> > +#define EXPORT_OP_UNRESTRICTED_OPEN    (0x80) /* FS allows open_by_handle_at
> > +                                                 without CAP_DAC_READ_SEARCH
> > +                                               */
> 
> Don't love the name, but I wonder, isn't SB_NOUSER already a good
> enough indication that CAP_DAC_READ_SEARCH is irrelevant?
> 
> Essentially, mnt_fd is the user's proof that they can access the mount
> and CAP_DAC_READ_SEARCH is the legacy "proof" that the user can
> reach from mount the inode by path lookup.
> 
> Which reminds me, what is the mnt_fd expected for opening a pidfd
> file by handle?

int pidfd_self = pidfd_open(getpid(), 0);
open_by_handle_at(pidfd_self, ...);

is sufficient.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 3/3] pidfs: implement file handle support
  2024-11-13 17:55       ` [PATCH v2 0/3] " Erin Shepherd
  2024-11-13 17:55         ` [PATCH v2 1/3] pseudofs: add support for export_ops Erin Shepherd
  2024-11-13 17:55         ` [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check Erin Shepherd
@ 2024-11-13 17:55         ` Erin Shepherd
  2024-11-14  7:07           ` Amir Goldstein
  2024-11-14 12:52           ` Christian Brauner
  2024-11-14  7:02         ` [PATCH v2 0/3] " Amir Goldstein
  3 siblings, 2 replies; 57+ messages in thread
From: Erin Shepherd @ 2024-11-13 17:55 UTC (permalink / raw)
  To: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, Jeff Layton, Amir Goldstein,
	linux-nfs, Erin Shepherd

On 64-bit platforms, userspace can read the pidfd's inode in order to
get a never-repeated PID identifier. On 32-bit platforms this identifier
is not exposed, as inodes are limited to 32 bits. Instead expose the
identifier via export_fh, which makes it available to userspace via
name_to_handle_at

In addition we implement fh_to_dentry, which allows userspace to
recover a pidfd from a PID file handle.

We stash the process' PID in the root pid namespace inside the handle,
and use that to recover the pid (validating that pid->ino matches the
value in the handle, i.e. that the pid has not been reused).

We use the root namespace in order to ensure that file handles can be
moved across namespaces; however, we validate that the PID exists in
the current namespace before returning the inode.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
---
 fs/pidfs.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 80675b6bf88459c22787edaa68db360bdc0d0782..0684a9b8fe71c5205fb153b2714bc9c672045fd5 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/anon_inodes.h>
+#include <linux/exportfs.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/magic.h>
@@ -347,11 +348,69 @@ static const struct dentry_operations pidfs_dentry_operations = {
 	.d_prune	= stashed_dentry_prune,
 };
 
+#define PIDFD_FID_LEN 3
+
+struct pidfd_fid {
+	u64 ino;
+	s32 pid;
+} __packed;
+
+static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
+			   struct inode *parent)
+{
+	struct pid *pid = inode->i_private;
+	struct pidfd_fid *fid = (struct pidfd_fid *)fh;
+
+	if (*max_len < PIDFD_FID_LEN) {
+		*max_len = PIDFD_FID_LEN;
+		return FILEID_INVALID;
+	}
+
+	fid->ino = pid->ino;
+	fid->pid = pid_nr(pid);
+	*max_len = PIDFD_FID_LEN;
+	return FILEID_INO64_GEN;
+}
+
+static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
+					 struct fid *gen_fid,
+					 int fh_len, int fh_type)
+{
+	int ret;
+	struct path path;
+	struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
+	struct pid *pid;
+
+	if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
+		return NULL;
+
+	scoped_guard(rcu) {
+		pid = find_pid_ns(fid->pid, &init_pid_ns);
+		if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0)
+			return NULL;
+
+		pid = get_pid(pid);
+	}
+
+	ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	mntput(path.mnt);
+	return path.dentry;
+}
+
+static const struct export_operations pidfs_export_operations = {
+	.encode_fh = pidfs_encode_fh,
+	.fh_to_dentry = pidfs_fh_to_dentry,
+	.flags = EXPORT_OP_UNRESTRICTED_OPEN,
+};
+
 static int pidfs_init_inode(struct inode *inode, void *data)
 {
 	inode->i_private = data;
 	inode->i_flags |= S_PRIVATE;
-	inode->i_mode |= S_IRWXU;
+	inode->i_mode |= S_IRWXU | S_IRWXG | S_IRWXO;
 	inode->i_op = &pidfs_inode_operations;
 	inode->i_fop = &pidfs_file_operations;
 	/*
@@ -382,6 +441,7 @@ static int pidfs_init_fs_context(struct fs_context *fc)
 		return -ENOMEM;
 
 	ctx->ops = &pidfs_sops;
+	ctx->eops = &pidfs_export_operations;
 	ctx->dops = &pidfs_dentry_operations;
 	fc->s_fs_info = (void *)&pidfs_stashed_ops;
 	return 0;

-- 
2.46.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 3/3] pidfs: implement file handle support
  2024-11-13 17:55         ` [PATCH v2 3/3] pidfs: implement file handle support Erin Shepherd
@ 2024-11-14  7:07           ` Amir Goldstein
  2024-11-14 12:42             ` Erin Shepherd
  2024-11-14 12:52           ` Christian Brauner
  1 sibling, 1 reply; 57+ messages in thread
From: Amir Goldstein @ 2024-11-14  7:07 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, Jeff Layton, linux-nfs

On Wed, Nov 13, 2024 at 7:11 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>
> On 64-bit platforms, userspace can read the pidfd's inode in order to
> get a never-repeated PID identifier. On 32-bit platforms this identifier
> is not exposed, as inodes are limited to 32 bits. Instead expose the
> identifier via export_fh, which makes it available to userspace via
> name_to_handle_at
>
> In addition we implement fh_to_dentry, which allows userspace to
> recover a pidfd from a PID file handle.

"In addition" is a good indication that a separate patch was a good idea..

>
> We stash the process' PID in the root pid namespace inside the handle,
> and use that to recover the pid (validating that pid->ino matches the
> value in the handle, i.e. that the pid has not been reused).
>
> We use the root namespace in order to ensure that file handles can be
> moved across namespaces; however, we validate that the PID exists in
> the current namespace before returning the inode.
>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>

This patch has changed enough that you should not have kept my RVB.

BTW, please refrain from using the same subject for the cover letter and
a single patch. They are not the same thing, so if they get the same
name, one of them has an inaccurate description.

> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> ---
>  fs/pidfs.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 61 insertions(+), 1 deletion(-)
>
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 80675b6bf88459c22787edaa68db360bdc0d0782..0684a9b8fe71c5205fb153b2714bc9c672045fd5 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -1,5 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <linux/anon_inodes.h>
> +#include <linux/exportfs.h>
>  #include <linux/file.h>
>  #include <linux/fs.h>
>  #include <linux/magic.h>
> @@ -347,11 +348,69 @@ static const struct dentry_operations pidfs_dentry_operations = {
>         .d_prune        = stashed_dentry_prune,
>  };
>
> +#define PIDFD_FID_LEN 3
> +
> +struct pidfd_fid {
> +       u64 ino;
> +       s32 pid;
> +} __packed;
> +
> +static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
> +                          struct inode *parent)
> +{
> +       struct pid *pid = inode->i_private;
> +       struct pidfd_fid *fid = (struct pidfd_fid *)fh;
> +
> +       if (*max_len < PIDFD_FID_LEN) {
> +               *max_len = PIDFD_FID_LEN;
> +               return FILEID_INVALID;
> +       }
> +
> +       fid->ino = pid->ino;
> +       fid->pid = pid_nr(pid);
> +       *max_len = PIDFD_FID_LEN;
> +       return FILEID_INO64_GEN;
> +}
> +
> +static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
> +                                        struct fid *gen_fid,
> +                                        int fh_len, int fh_type)
> +{
> +       int ret;
> +       struct path path;
> +       struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
> +       struct pid *pid;
> +
> +       if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
> +               return NULL;
> +
> +       scoped_guard(rcu) {
> +               pid = find_pid_ns(fid->pid, &init_pid_ns);
> +               if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0)
> +                       return NULL;
> +
> +               pid = get_pid(pid);
> +       }
> +
> +       ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
> +       if (ret < 0)
> +               return ERR_PTR(ret);

How come no need to put_pid() in case of error?

> +
> +       mntput(path.mnt);
> +       return path.dentry;
> +}
> +
> +static const struct export_operations pidfs_export_operations = {
> +       .encode_fh = pidfs_encode_fh,
> +       .fh_to_dentry = pidfs_fh_to_dentry,
> +       .flags = EXPORT_OP_UNRESTRICTED_OPEN,
> +};
> +
>  static int pidfs_init_inode(struct inode *inode, void *data)
>  {
>         inode->i_private = data;
>         inode->i_flags |= S_PRIVATE;
> -       inode->i_mode |= S_IRWXU;
> +       inode->i_mode |= S_IRWXU | S_IRWXG | S_IRWXO;

This change is not explained.
Why is it here?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 3/3] pidfs: implement file handle support
  2024-11-14  7:07           ` Amir Goldstein
@ 2024-11-14 12:42             ` Erin Shepherd
  0 siblings, 0 replies; 57+ messages in thread
From: Erin Shepherd @ 2024-11-14 12:42 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, Jeff Layton, linux-nfs


On 14/11/2024 08:07, Amir Goldstein wrote:
> On Wed, Nov 13, 2024 at 7:11 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>> On 64-bit platforms, userspace can read the pidfd's inode in order to
>> get a never-repeated PID identifier. On 32-bit platforms this identifier
>> is not exposed, as inodes are limited to 32 bits. Instead expose the
>> identifier via export_fh, which makes it available to userspace via
>> name_to_handle_at
>>
>> In addition we implement fh_to_dentry, which allows userspace to
>> recover a pidfd from a PID file handle.
> "In addition" is a good indication that a separate patch was a good idea..
>
>> We stash the process' PID in the root pid namespace inside the handle,
>> and use that to recover the pid (validating that pid->ino matches the
>> value in the handle, i.e. that the pid has not been reused).
>>
>> We use the root namespace in order to ensure that file handles can be
>> moved across namespaces; however, we validate that the PID exists in
>> the current namespace before returning the inode.
>>
>> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> This patch has changed enough that you should not have kept my RVB.
>
> BTW, please refrain from using the same subject for the cover letter and
> a single patch. They are not the same thing, so if they get the same
> name, one of them has an inaccurate description.
>
ACK to all three.


>> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
>> ---
>>  fs/pidfs.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 61 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/pidfs.c b/fs/pidfs.c
>> index 80675b6bf88459c22787edaa68db360bdc0d0782..0684a9b8fe71c5205fb153b2714bc9c672045fd5 100644
>> --- a/fs/pidfs.c
>> +++ b/fs/pidfs.c
>> @@ -1,5 +1,6 @@
>>  // SPDX-License-Identifier: GPL-2.0
>>  #include <linux/anon_inodes.h>
>> +#include <linux/exportfs.h>
>>  #include <linux/file.h>
>>  #include <linux/fs.h>
>>  #include <linux/magic.h>
>> @@ -347,11 +348,69 @@ static const struct dentry_operations pidfs_dentry_operations = {
>>         .d_prune        = stashed_dentry_prune,
>>  };
>>
>> +#define PIDFD_FID_LEN 3
>> +
>> +struct pidfd_fid {
>> +       u64 ino;
>> +       s32 pid;
>> +} __packed;
>> +
>> +static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
>> +                          struct inode *parent)
>> +{
>> +       struct pid *pid = inode->i_private;
>> +       struct pidfd_fid *fid = (struct pidfd_fid *)fh;
>> +
>> +       if (*max_len < PIDFD_FID_LEN) {
>> +               *max_len = PIDFD_FID_LEN;
>> +               return FILEID_INVALID;
>> +       }
>> +
>> +       fid->ino = pid->ino;
>> +       fid->pid = pid_nr(pid);
>> +       *max_len = PIDFD_FID_LEN;
>> +       return FILEID_INO64_GEN;
>> +}
>> +
>> +static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
>> +                                        struct fid *gen_fid,
>> +                                        int fh_len, int fh_type)
>> +{
>> +       int ret;
>> +       struct path path;
>> +       struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
>> +       struct pid *pid;
>> +
>> +       if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
>> +               return NULL;
>> +
>> +       scoped_guard(rcu) {
>> +               pid = find_pid_ns(fid->pid, &init_pid_ns);
>> +               if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0)
>> +                       return NULL;
>> +
>> +               pid = get_pid(pid);
>> +       }
>> +
>> +       ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
>> +       if (ret < 0)
>> +               return ERR_PTR(ret);
> How come no need to put_pid() in case of error?

This one confused me at first too, but path_from_stashed frees it (via
stashed_ops.put_data) on error. You can see the same pattern in
pidfs_alloc_file.

(It already needs to know how to free it for the case where a stashed
dentry already exists)

>> +
>> +       mntput(path.mnt);
>> +       return path.dentry;
>> +}
>> +
>> +static const struct export_operations pidfs_export_operations = {
>> +       .encode_fh = pidfs_encode_fh,
>> +       .fh_to_dentry = pidfs_fh_to_dentry,
>> +       .flags = EXPORT_OP_UNRESTRICTED_OPEN,
>> +};
>> +
>>  static int pidfs_init_inode(struct inode *inode, void *data)
>>  {
>>         inode->i_private = data;
>>         inode->i_flags |= S_PRIVATE;
>> -       inode->i_mode |= S_IRWXU;
>> +       inode->i_mode |= S_IRWXU | S_IRWXG | S_IRWXO;
> This change is not explained.
> Why is it here?

open_by_handle_at eventually passes through the may_open permission check.
The existing permissions only permits root to open them (since the owning
uid & gid is 0). So, it was necessary to widen them to align with how
pidfd_open works.

If I stick with this approach (see [1]) I'll ensure to document this change
better in the commit message.

[1] https://lore.kernel.org/all/6a3ed633-311d-47ff-8a7e-5121d6186139@e43.eu/



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 3/3] pidfs: implement file handle support
  2024-11-13 17:55         ` [PATCH v2 3/3] pidfs: implement file handle support Erin Shepherd
  2024-11-14  7:07           ` Amir Goldstein
@ 2024-11-14 12:52           ` Christian Brauner
  2024-11-14 13:13             ` Erin Shepherd
  1 sibling, 1 reply; 57+ messages in thread
From: Christian Brauner @ 2024-11-14 12:52 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Alexander Viro, Jan Kara, Chuck Lever, linux-fsdevel,
	linux-kernel, Jeff Layton, Amir Goldstein, linux-nfs

On Wed, Nov 13, 2024 at 05:55:25PM +0000, Erin Shepherd wrote:
> On 64-bit platforms, userspace can read the pidfd's inode in order to
> get a never-repeated PID identifier. On 32-bit platforms this identifier
> is not exposed, as inodes are limited to 32 bits. Instead expose the
> identifier via export_fh, which makes it available to userspace via
> name_to_handle_at
> 
> In addition we implement fh_to_dentry, which allows userspace to
> recover a pidfd from a PID file handle.
> 
> We stash the process' PID in the root pid namespace inside the handle,
> and use that to recover the pid (validating that pid->ino matches the
> value in the handle, i.e. that the pid has not been reused).
> 
> We use the root namespace in order to ensure that file handles can be
> moved across namespaces; however, we validate that the PID exists in
> the current namespace before returning the inode.
> 
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> ---

I think you need at least something like the following completely
untested draft on top:

- the pidfs_finish_open_by_handle_at() is somewhat of a clutch to handle
  thread vs thread-group pidfds but it works.

- In contrast to pidfd_open() that uses dentry_open() to create a pidfd
  open_by_handle_at() uses file_open_root(). That's overall fine but
  makes pidfds subject to security hooks which they aren't via
  pidfd_open(). It also necessitats pidfs_finish_open_by_handle_at().
  There's probably other solutions I'm not currently seeing.

- The exportfs_decode_fh_raw() call that's used to decode the pidfd is
  passed vfs_dentry_acceptable() as acceptability callback. For pidfds
  we don't need any of that functionality and we don't need any of the
  disconnected dentry handling logic. So the easiest way to fix that is
  to rely on EXPORT_OP_UNRESTRICTED_OPEN to skip everything. That in
  turns means the only acceptability we have is the nop->fh_to_dentry()
  callback for pidfs.

- This all really needs rigorous selftests before we can even think of
  merging any of this.

Anway, here's the _completely untested, quickly drafted_ diff on top:

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 4f2dd4ab4486..65c93f7132d4 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -450,6 +450,13 @@ exportfs_decode_fh_raw(struct vfsmount *mnt, struct fid *fid, int fh_len,
 		goto err_result;
 	}
 
+	/*
+	 * The filesystem has no acceptance criteria other than those in
+	 * nop->fh_to_dentry().
+	 */
+	if (nop->flags & EXPORT_OP_UNRESTRICTED_OPEN)
+		return result;
+
 	/*
 	 * If no acceptance criteria was specified by caller, a disconnected
 	 * dentry is also accepatable. Callers may use this mode to query if
diff --git a/fs/fhandle.c b/fs/fhandle.c
index 056116e58f43..89c2efacc0c3 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -11,6 +11,7 @@
 #include <linux/personality.h>
 #include <linux/uaccess.h>
 #include <linux/compat.h>
+#include <linux/pidfs.h>
 #include "internal.h"
 #include "mount.h"
 
@@ -218,20 +219,21 @@ static int do_handle_to_path(struct file_handle *handle, struct path *path,
 {
 	int handle_dwords;
 	struct vfsmount *mnt = ctx->root.mnt;
+	struct dentry *dentry;
 
 	/* change the handle size to multiple of sizeof(u32) */
 	handle_dwords = handle->handle_bytes >> 2;
-	path->dentry = exportfs_decode_fh_raw(mnt,
-					  (struct fid *)handle->f_handle,
-					  handle_dwords, handle->handle_type,
-					  ctx->fh_flags,
-					  vfs_dentry_acceptable, ctx);
-	if (IS_ERR_OR_NULL(path->dentry)) {
-		if (path->dentry == ERR_PTR(-ENOMEM))
+	dentry = exportfs_decode_fh_raw(mnt, (struct fid *)handle->f_handle,
+					handle_dwords, handle->handle_type,
+					ctx->fh_flags, vfs_dentry_acceptable,
+					ctx);
+	if (IS_ERR_OR_NULL(dentry)) {
+		if (dentry == ERR_PTR(-ENOMEM))
 			return -ENOMEM;
 		return -ESTALE;
 	}
 	path->mnt = mntget(mnt);
+	path->dentry = dentry;
 	return 0;
 }
 
@@ -239,7 +241,7 @@ static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
 				 unsigned int o_flags)
 {
 	struct path *root = &ctx->root;
-	struct export_operations *nop = root->mnt->mnt_sb->s_export_op;
+	const struct export_operations *nop = root->mnt->mnt_sb->s_export_op;
 
 	if (nop && nop->flags & EXPORT_OP_UNRESTRICTED_OPEN)
 		return true;
@@ -342,7 +344,7 @@ static long do_handle_open(int mountdirfd, struct file_handle __user *ufh,
 			   int open_flag)
 {
 	long retval = 0;
-	struct path path;
+	struct path path __free(path_put) = {};
 	struct file *file;
 	int fd;
 
@@ -351,19 +353,24 @@ static long do_handle_open(int mountdirfd, struct file_handle __user *ufh,
 		return retval;
 
 	fd = get_unused_fd_flags(open_flag);
-	if (fd < 0) {
-		path_put(&path);
+	if (fd < 0)
 		return fd;
-	}
+
 	file = file_open_root(&path, "", open_flag, 0);
 	if (IS_ERR(file)) {
 		put_unused_fd(fd);
-		retval =  PTR_ERR(file);
-	} else {
-		retval = fd;
-		fd_install(fd, file);
+		return PTR_ERR(file);
 	}
-	path_put(&path);
+
+	retval = pidfs_finish_open_by_handle_at(file, open_flag);
+	if (retval) {
+		put_unused_fd(fd);
+		fput(file);
+		return retval;
+	}
+
+	retval = fd;
+	fd_install(fd, file);
 	return retval;
 }
 
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 0684a9b8fe71..19948002f395 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -237,6 +237,24 @@ struct pid *pidfd_pid(const struct file *file)
 	return file_inode(file)->i_private;
 }
 
+int pidfs_finish_open_by_handle_at(struct file *file, unsigned int oflags)
+{
+	struct pid *pid;
+	bool thread = oflags & PIDFD_THREAD;
+
+	pid = pidfd_pid(file);
+	if (IS_ERR(pid))
+		return 0;
+
+	if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID))
+		return -EINVAL;
+
+	if (thread)
+		file->f_flags |= PIDFD_THREAD;
+
+	return 0;
+}
+
 static struct vfsmount *pidfs_mnt __ro_after_init;
 
 #if BITS_PER_LONG == 32
@@ -377,7 +395,7 @@ static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
 					 int fh_len, int fh_type)
 {
 	int ret;
-	struct path path;
+	struct path path __free(path_put) = {};
 	struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
 	struct pid *pid;
 
@@ -393,11 +411,12 @@ static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
 	}
 
 	ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
-	if (ret < 0)
+	if (ret < 0) {
+		put_pid(pid);
 		return ERR_PTR(ret);
+	}
 
-	mntput(path.mnt);
-	return path.dentry;
+	return dget(path.dentry);
 }
 
 static const struct export_operations pidfs_export_operations = {
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index 75bdf9807802..9a4130056e7d 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -3,6 +3,7 @@
 #define _LINUX_PID_FS_H
 
 struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
+int pidfs_finish_open_by_handle_at(struct file *file, unsigned int oflags);
 void __init pidfs_init(void);
 
 #endif /* _LINUX_PID_FS_H */


>  fs/pidfs.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 61 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 80675b6bf88459c22787edaa68db360bdc0d0782..0684a9b8fe71c5205fb153b2714bc9c672045fd5 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -1,5 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <linux/anon_inodes.h>
> +#include <linux/exportfs.h>
>  #include <linux/file.h>
>  #include <linux/fs.h>
>  #include <linux/magic.h>
> @@ -347,11 +348,69 @@ static const struct dentry_operations pidfs_dentry_operations = {
>  	.d_prune	= stashed_dentry_prune,
>  };
>  
> +#define PIDFD_FID_LEN 3
> +
> +struct pidfd_fid {
> +	u64 ino;
> +	s32 pid;
> +} __packed;
> +
> +static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
> +			   struct inode *parent)
> +{
> +	struct pid *pid = inode->i_private;
> +	struct pidfd_fid *fid = (struct pidfd_fid *)fh;
> +
> +	if (*max_len < PIDFD_FID_LEN) {
> +		*max_len = PIDFD_FID_LEN;
> +		return FILEID_INVALID;
> +	}
> +
> +	fid->ino = pid->ino;
> +	fid->pid = pid_nr(pid);
> +	*max_len = PIDFD_FID_LEN;
> +	return FILEID_INO64_GEN;
> +}
> +
> +static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
> +					 struct fid *gen_fid,
> +					 int fh_len, int fh_type)
> +{
> +	int ret;
> +	struct path path;
> +	struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
> +	struct pid *pid;
> +
> +	if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
> +		return NULL;
> +
> +	scoped_guard(rcu) {
> +		pid = find_pid_ns(fid->pid, &init_pid_ns);
> +		if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0)
> +			return NULL;
> +
> +		pid = get_pid(pid);
> +	}
> +
> +	ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +
> +	mntput(path.mnt);
> +	return path.dentry;
> +}
> +
> +static const struct export_operations pidfs_export_operations = {
> +	.encode_fh = pidfs_encode_fh,
> +	.fh_to_dentry = pidfs_fh_to_dentry,
> +	.flags = EXPORT_OP_UNRESTRICTED_OPEN,
> +};
> +
>  static int pidfs_init_inode(struct inode *inode, void *data)
>  {
>  	inode->i_private = data;
>  	inode->i_flags |= S_PRIVATE;
> -	inode->i_mode |= S_IRWXU;
> +	inode->i_mode |= S_IRWXU | S_IRWXG | S_IRWXO;
>  	inode->i_op = &pidfs_inode_operations;
>  	inode->i_fop = &pidfs_file_operations;
>  	/*
> @@ -382,6 +441,7 @@ static int pidfs_init_fs_context(struct fs_context *fc)
>  		return -ENOMEM;
>  
>  	ctx->ops = &pidfs_sops;
> +	ctx->eops = &pidfs_export_operations;
>  	ctx->dops = &pidfs_dentry_operations;
>  	fc->s_fs_info = (void *)&pidfs_stashed_ops;
>  	return 0;
> 
> -- 
> 2.46.1
> 

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 3/3] pidfs: implement file handle support
  2024-11-14 12:52           ` Christian Brauner
@ 2024-11-14 13:13             ` Erin Shepherd
  2024-11-14 14:13               ` Christian Brauner
  0 siblings, 1 reply; 57+ messages in thread
From: Erin Shepherd @ 2024-11-14 13:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Chuck Lever, linux-fsdevel,
	linux-kernel, Jeff Layton, Amir Goldstein, linux-nfs

On 14/11/2024 13:52, Christian Brauner wrote:

> I think you need at least something like the following completely
> untested draft on top:
>
> - the pidfs_finish_open_by_handle_at() is somewhat of a clutch to handle
>   thread vs thread-group pidfds but it works.
>
> - In contrast to pidfd_open() that uses dentry_open() to create a pidfd
>   open_by_handle_at() uses file_open_root(). That's overall fine but
>   makes pidfds subject to security hooks which they aren't via
>   pidfd_open(). It also necessitats pidfs_finish_open_by_handle_at().
>   There's probably other solutions I'm not currently seeing.

These two concerns combined with the special flag make me wonder if pidfs
is so much of a special snowflake we should just special case it up front
and skip all of the shared handle decode logic?

> - The exportfs_decode_fh_raw() call that's used to decode the pidfd is
>   passed vfs_dentry_acceptable() as acceptability callback. For pidfds
>   we don't need any of that functionality and we don't need any of the
>   disconnected dentry handling logic. So the easiest way to fix that is
>   to rely on EXPORT_OP_UNRESTRICTED_OPEN to skip everything. That in
>   turns means the only acceptability we have is the nop->fh_to_dentry()
>   callback for pidfs.

With the current logic we go exportfs_decode_fh_raw(...) ->
find_acceptable_alias(result) -> vfs_dentry_acceptable(context, result).


vfs_dentry_acceptable immediately returns 1 if ctx->flags is 0, which will
always be the case if EXPORT_OP_UNRESTRICTED_OPEN was set, so we immediately
fall back out of the call tree and return result.

So I'm not 100% sure we actually need this special case but I'm not opposed.

> - This all really needs rigorous selftests before we can even think of
>   merging any of this.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 3/3] pidfs: implement file handle support
  2024-11-14 13:13             ` Erin Shepherd
@ 2024-11-14 14:13               ` Christian Brauner
  2024-11-14 21:52                 ` Erin Shepherd
  0 siblings, 1 reply; 57+ messages in thread
From: Christian Brauner @ 2024-11-14 14:13 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Alexander Viro, Jan Kara, Chuck Lever, linux-fsdevel,
	linux-kernel, Jeff Layton, Amir Goldstein, linux-nfs

On Thu, Nov 14, 2024 at 02:13:06PM +0100, Erin Shepherd wrote:
> On 14/11/2024 13:52, Christian Brauner wrote:
> 
> > I think you need at least something like the following completely
> > untested draft on top:
> >
> > - the pidfs_finish_open_by_handle_at() is somewhat of a clutch to handle
> >   thread vs thread-group pidfds but it works.
> >
> > - In contrast to pidfd_open() that uses dentry_open() to create a pidfd
> >   open_by_handle_at() uses file_open_root(). That's overall fine but
> >   makes pidfds subject to security hooks which they aren't via
> >   pidfd_open(). It also necessitats pidfs_finish_open_by_handle_at().
> >   There's probably other solutions I'm not currently seeing.
> 
> These two concerns combined with the special flag make me wonder if pidfs
> is so much of a special snowflake we should just special case it up front
> and skip all of the shared handle decode logic?

Care to try a patch and see what it looks like?

> 
> > - The exportfs_decode_fh_raw() call that's used to decode the pidfd is
> >   passed vfs_dentry_acceptable() as acceptability callback. For pidfds
> >   we don't need any of that functionality and we don't need any of the
> >   disconnected dentry handling logic. So the easiest way to fix that is
> >   to rely on EXPORT_OP_UNRESTRICTED_OPEN to skip everything. That in
> >   turns means the only acceptability we have is the nop->fh_to_dentry()
> >   callback for pidfs.
> 
> With the current logic we go exportfs_decode_fh_raw(...) ->
> find_acceptable_alias(result) -> vfs_dentry_acceptable(context, result).
> vfs_dentry_acceptable immediately returns 1 if ctx->flags is 0, which will
> always be the case if EXPORT_OP_UNRESTRICTED_OPEN was set, so we immediately
> fall back out of the call tree and return result.
> 
> So I'm not 100% sure we actually need this special case but I'm not opposed.

Oh right, I completely forgot that find_acceptable_alias() is a no-op
for pidfs.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 3/3] pidfs: implement file handle support
  2024-11-14 14:13               ` Christian Brauner
@ 2024-11-14 21:52                 ` Erin Shepherd
  2024-11-15  7:50                   ` Amir Goldstein
  0 siblings, 1 reply; 57+ messages in thread
From: Erin Shepherd @ 2024-11-14 21:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Chuck Lever, linux-fsdevel,
	linux-kernel, Jeff Layton, Amir Goldstein, linux-nfs

On 14/11/2024 15:13, Christian Brauner wrote:

> On Thu, Nov 14, 2024 at 02:13:06PM +0100, Erin Shepherd wrote:
>> These two concerns combined with the special flag make me wonder if pidfs
>> is so much of a special snowflake we should just special case it up front
>> and skip all of the shared handle decode logic?
> Care to try a patch and see what it looks like?

The following is a completely untested sketch on top of the existing patch series.
Some notes:

- I made heavy use of the cleanup macros. I'm happy to convert things back to
  goto out_xx style if preferred - writing things this way just made bashing out
  the code without dropping resources on the floor easier
- If you don't implement fh_to_dentry then name_to_handle_at will just return an error
  unless called with AT_HANDLE_FID. We need to decide what to do about that
- The GET_PATH_FD_IS_NORMAL/etc constants don't match (what I see as) usual kernel style
  but I'm not sure how to conventionally express something like that

Otherwise, I'm fairly happy with how it came out in the end. Maybe handle_to_path and
do_handle_to_path could be collapsed into each other at this point.

diff --git a/fs/fhandle.c b/fs/fhandle.c
index 056116e58f43..697246085b69 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -11,6 +11,7 @@
 #include <linux/personality.h>
 #include <linux/uaccess.h>
 #include <linux/compat.h>
+#include <linux/pidfs.h>
 #include "internal.h"
 #include "mount.h"

@@ -130,6 +131,11 @@ SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name,
     return err;
 }

+enum {
+    GET_PATH_FD_IS_NORMAL = 0,
+    GET_PATH_FD_IS_PIDFD  = 1,
+};
+
 static int get_path_from_fd(int fd, struct path *root)
 {
     if (fd == AT_FDCWD) {
@@ -139,15 +145,16 @@ static int get_path_from_fd(int fd, struct path *root)
         path_get(root);
         spin_unlock(&fs->lock);
     } else {
-        struct fd f = fdget(fd);
+        CLASS(fd, f)(fd);
         if (!fd_file(f))
             return -EBADF;
+        if (pidfd_pid(fd_file(f)))
+            return GET_PATH_FD_IS_PIDFD;
         *root = fd_file(f)->f_path;
         path_get(root);
-        fdput(f);
     }

-    return 0;
+    return GET_PATH_FD_IS_NORMAL;
 }

 enum handle_to_path_flags {
@@ -287,84 +294,94 @@ static inline bool may_decode_fh(struct handle_to_path_ctx *ctx,
     return true;
 }

-static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
-           struct path *path, unsigned int o_flags)
+static int copy_handle_from_user(struct file_handle __user *ufh, struct file_handle **ret)
 {
-    int retval = 0;
-    struct file_handle f_handle;
-    struct file_handle *handle = NULL;
-    struct handle_to_path_ctx ctx = {};
-
-    retval = get_path_from_fd(mountdirfd, &ctx.root);
-    if (retval)
-        goto out_err;
-
-    if (!may_decode_fh(&ctx, o_flags)) {
-        retval = -EPERM;
-        goto out_path;
-    }
-
+    struct file_handle f_handle, *handle;
     if (copy_from_user(&f_handle, ufh, sizeof(struct file_handle))) {
-        retval = -EFAULT;
-        goto out_path;
+        return -EFAULT;
     }
     if ((f_handle.handle_bytes > MAX_HANDLE_SZ) ||
         (f_handle.handle_bytes == 0)) {
-        retval = -EINVAL;
-        goto out_path;
+        return -EINVAL;
     }
     handle = kmalloc(struct_size(handle, f_handle, f_handle.handle_bytes),
              GFP_KERNEL);
     if (!handle) {
-        retval = -ENOMEM;
-        goto out_path;
+        return -ENOMEM;
     }
+
     /* copy the full handle */
     *handle = f_handle;
     if (copy_from_user(&handle->f_handle,
                &ufh->f_handle,
                f_handle.handle_bytes)) {
-        retval = -EFAULT;
-        goto out_handle;
+        kfree(handle);
+        return -EFAULT;
     }

-    retval = do_handle_to_path(handle, path, &ctx);
+    *ret = handle;
+    return 0;
+}

-out_handle:
-    kfree(handle);
-out_path:
-    path_put(&ctx.root);
-out_err:
-    return retval;
+static int handle_to_path(struct path root, struct file_handle __user *ufh,
+           struct path *path, unsigned int o_flags)
+{
+    struct file_handle *handle = NULL;
+    struct handle_to_path_ctx ctx = {
+        .root = root,
+    };
+
+    if (!may_decode_fh(&ctx, o_flags)) {
+        return -EPERM;
+    }
+
+    return do_handle_to_path(handle, path, &ctx);
 }

 static long do_handle_open(int mountdirfd, struct file_handle __user *ufh,
                int open_flag)
 {
     long retval = 0;
-    struct path path;
-    struct file *file;
-    int fd;
+    struct file_handle *handle __free(kfree) = NULL;
+    struct path __free(path_put) root = {};
+    struct path __free(path_put) path = {};
+    struct file *file = NULL;
+    int root_type;

-    retval = handle_to_path(mountdirfd, ufh, &path, open_flag);
+    root_type = get_path_from_fd(mountdirfd, &root);
+    if (root_type < 0)
+        return root_type;
+
+    retval = copy_handle_from_user(ufh, &handle);
     if (retval)
         return retval;

-    fd = get_unused_fd_flags(open_flag);
-    if (fd < 0) {
-        path_put(&path);
+    CLASS(get_unused_fd, fd)(open_flag);
+    if (fd < 0)
         return fd;
+
+    switch (root_type) {
+    case GET_PATH_FD_IS_NORMAL:
+        retval = handle_to_path(root, handle, &path, open_flag);
+        if (retval)
+            return retval;
+        file = file_open_root(&path, "", open_flag, 0);
+        if (IS_ERR(file)) {
+            return PTR_ERR(file);
+        }
+        break;
+
+    case GET_PATH_FD_IS_PIDFD:
+        file = pidfd_open_by_handle(
+            handle->f_handle, handle->handle_bytes >> 2, handle->handle_type,
+            open_flag);
+        if (IS_ERR(file))
+            return retval;
+        break;
     }
-    file = file_open_root(&path, "", open_flag, 0);
-    if (IS_ERR(file)) {
-        put_unused_fd(fd);
-        retval =  PTR_ERR(file);
-    } else {
-        retval = fd;
-        fd_install(fd, file);
-    }
-    path_put(&path);
-    return retval;
+
+    fd_install(fd, file);
+    return take_fd(fd);
 }

 /**
diff --git a/fs/pidfs.c b/fs/pidfs.c
index 0684a9b8fe71..65b72dc05380 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -453,22 +453,53 @@ static struct file_system_type pidfs_type = {
     .kill_sb        = kill_anon_super,
 };

-struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
+static struct file *__pidfs_alloc_file(struct pid *pid, unsigned int flags)
 {
-
     struct file *pidfd_file;
     struct path path;
     int ret;

-    ret = path_from_stashed(&pid->stashed, pidfs_mnt, get_pid(pid), &path);
+    ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
     if (ret < 0)
         return ERR_PTR(ret);

     pidfd_file = dentry_open(&path, flags, current_cred());
+    if (!IS_ERR(pidfd_file))
+        pidfd_file->f_flags |= (flags & PIDFD_THREAD);
     path_put(&path);
     return pidfd_file;
 }

+struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
+{
+    return __pidfs_alloc_file(get_pid(pid), flags);
+}
+
+struct file *pidfd_open_by_handle(void *gen_fid, int fh_len, int fh_type,
+                  unsigned int flags)
+{
+    bool thread = flags & PIDFD_THREAD;
+    struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
+    struct pid *pid;
+
+    if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
+        return ERR_PTR(-ESTALE);
+
+    scoped_guard(rcu) {
+        pid = find_pid_ns(fid->pid, &init_pid_ns);
+        if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0)
+            return ERR_PTR(-ESTALE);
+        if(!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID))
+            return ERR_PTR(-ESTALE);
+            /* Something better? EINVAL like pidfd_open wouldn't be
+               very obvious... */
+
+        pid = get_pid(pid);
+    }
+
+    return __pidfs_alloc_file(pid, flags);
+}
+
 void __init pidfs_init(void)
 {
     pidfs_mnt = kern_mount(&pidfs_type);
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index 75bdf9807802..fba2654ae60f 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -3,6 +3,8 @@
 #define _LINUX_PID_FS_H

 struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
+struct file *pidfd_open_by_handle(void *gen_fid, int fh_len, int fh_type,
+                  unsigned int flags);
 void __init pidfs_init(void);

 #endif /* _LINUX_PID_FS_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 22f43721d031..fc47c76e4ff4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2015,11 +2015,6 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
         put_unused_fd(pidfd);
         return PTR_ERR(pidfd_file);
     }
-    /*
-     * anon_inode_getfile() ignores everything outside of the
-     * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
-     */
-    pidfd_file->f_flags |= (flags & PIDFD_THREAD);
     *ret = pidfd_file;
     return pidfd;
 }


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 3/3] pidfs: implement file handle support
  2024-11-14 21:52                 ` Erin Shepherd
@ 2024-11-15  7:50                   ` Amir Goldstein
  0 siblings, 0 replies; 57+ messages in thread
From: Amir Goldstein @ 2024-11-15  7:50 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, Jeff Layton, linux-nfs

On Thu, Nov 14, 2024 at 11:51 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>
> On 14/11/2024 15:13, Christian Brauner wrote:
>
> > On Thu, Nov 14, 2024 at 02:13:06PM +0100, Erin Shepherd wrote:
> >> These two concerns combined with the special flag make me wonder if pidfs
> >> is so much of a special snowflake we should just special case it up front
> >> and skip all of the shared handle decode logic?
> > Care to try a patch and see what it looks like?
>
> The following is a completely untested sketch on top of the existing patch series.
> Some notes:
>
> - I made heavy use of the cleanup macros. I'm happy to convert things back to
>   goto out_xx style if preferred - writing things this way just made bashing out
>   the code without dropping resources on the floor easier

Your cleanup is very welcome, just please! not in the same patch as refactoring
and logic changes. Please do these 3 different things in different commits.
This patch is unreviewable as far as I am concerned.

> - If you don't implement fh_to_dentry then name_to_handle_at will just return an error
>   unless called with AT_HANDLE_FID. We need to decide what to do about that

What's to decide? I did not understand the problem.

> - The GET_PATH_FD_IS_NORMAL/etc constants don't match (what I see as) usual kernel style
>   but I'm not sure how to conventionally express something like that

I believe the conventional way to express a custom operation is an
optional method.

For example:

static int exportfs_get_name(struct vfsmount *mnt, struct dentry *dir,
                char *name, struct dentry *child)
{
        const struct export_operations *nop = dir->d_sb->s_export_op;
        struct path path = {.mnt = mnt, .dentry = dir};

        if (nop->get_name)
                return nop->get_name(dir, name, child);
        else
                return get_name(&path, name, child);
}

There are plenty of optional custom inode, file, sb, dentry
operations with default fallback. some examples:

        if (dir_inode->i_op->atomic_open) {
                dentry = atomic_open(nd, dentry, file, open_flag, mode);

        if (!splice && file_out->f_op->copy_file_range) {
                ret = file_out->f_op->copy_file_range(file_in, pos_in,
                                                      file_out, pos_out,
                                                      len, flags);
        } else if (!splice && file_in->f_op->remap_file_range && samesb) {
                ret = file_in->f_op->remap_file_range(file_in, pos_in,

So I think the right model for you to follow is a custom optional
s_export_op->open_by_handle() operation.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 0/3] pidfs: implement file handle support
  2024-11-13 17:55       ` [PATCH v2 0/3] " Erin Shepherd
                           ` (2 preceding siblings ...)
  2024-11-13 17:55         ` [PATCH v2 3/3] pidfs: implement file handle support Erin Shepherd
@ 2024-11-14  7:02         ` Amir Goldstein
  2024-11-14 12:48           ` Erin Shepherd
  3 siblings, 1 reply; 57+ messages in thread
From: Amir Goldstein @ 2024-11-14  7:02 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, Jeff Layton, linux-nfs

On Wed, Nov 13, 2024 at 7:01 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>
> Since the introduction of pidfs, we have had 64-bit process identifiers
> that will not be reused for the entire uptime of the system. This greatly
> facilitates process tracking in userspace.
>
> There are two limitations at present:
>
>  * These identifiers are currently only exposed to processes on 64-bit
>    systems. On 32-bit systems, inode space is also limited to 32 bits and
>    therefore is subject to the same reuse issues.
>  * There is no way to go from one of these unique identifiers to a pid or
>    pidfd.
>
> This patch implements fh_export and fh_to_dentry which enables userspace to
> convert PIDs to and from PID file handles. A process can convert a pidfd into
> a file handle using name_to_handle_at, store it (in memory, on disk, or
> elsewhere) and then convert it back into a pidfd suing open_by_handle_at.
>
> To support us going from a file handle to a pidfd, we have to store a pid
> inside the file handle. To ensure file handles are invariant and can move
> between pid namespaces, we stash a pid from the initial namespace inside
> the file handle.
>
>   (There has been some discussion as to whether or not it is OK to include
>   the PID in the initial pid namespace, but so far there hasn't been any
>   conclusive reason given as to why this would be a bad idea)

IIUC, this is already exposed as st_ino on a 64bit arch?
If that is the case, then there is certainly no new info leak in this patch.

>
> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> ---
> Changes in v2:
> - Permit filesystems to opt out of CAP_DAC_READ_SEARCH
> - Inline find_pid_ns/get_pid logic; remove unnecessary put_pid
> - Squash fh_export & fh_to_dentry into one commit

Not sure why you did that.
It was pretty nice as separate commits if you ask me. Whatever.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 0/3] pidfs: implement file handle support
  2024-11-14  7:02         ` [PATCH v2 0/3] " Amir Goldstein
@ 2024-11-14 12:48           ` Erin Shepherd
  2024-11-14 14:27             ` Christian Brauner
  2024-11-14 16:10             ` [PATCH v2 0/3] pidfs: implement file handle support Amir Goldstein
  0 siblings, 2 replies; 57+ messages in thread
From: Erin Shepherd @ 2024-11-14 12:48 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, Jeff Layton, linux-nfs

   

On 14/11/2024 08:02, Amir Goldstein wrote:
> On Wed, Nov 13, 2024 at 7:01 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>> Since the introduction of pidfs, we have had 64-bit process identifiers
>> that will not be reused for the entire uptime of the system. This greatly
>> facilitates process tracking in userspace.
>>
>> There are two limitations at present:
>>
>>  * These identifiers are currently only exposed to processes on 64-bit
>>    systems. On 32-bit systems, inode space is also limited to 32 bits and
>>    therefore is subject to the same reuse issues.
>>  * There is no way to go from one of these unique identifiers to a pid or
>>    pidfd.
>>
>> This patch implements fh_export and fh_to_dentry which enables userspace to
>> convert PIDs to and from PID file handles. A process can convert a pidfd into
>> a file handle using name_to_handle_at, store it (in memory, on disk, or
>> elsewhere) and then convert it back into a pidfd suing open_by_handle_at.
>>
>> To support us going from a file handle to a pidfd, we have to store a pid
>> inside the file handle. To ensure file handles are invariant and can move
>> between pid namespaces, we stash a pid from the initial namespace inside
>> the file handle.
>>
>>   (There has been some discussion as to whether or not it is OK to include
>>   the PID in the initial pid namespace, but so far there hasn't been any
>>   conclusive reason given as to why this would be a bad idea)
> IIUC, this is already exposed as st_ino on a 64bit arch?
> If that is the case, then there is certainly no new info leak in this patch.

pid.ino is exposed, but the init-ns pid isn't exposed anywhere to my knowledge.

>> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
>> ---
>> Changes in v2:
>> - Permit filesystems to opt out of CAP_DAC_READ_SEARCH
>> - Inline find_pid_ns/get_pid logic; remove unnecessary put_pid
>> - Squash fh_export & fh_to_dentry into one commit
> Not sure why you did that.
> It was pretty nice as separate commits if you ask me. Whatever.

I can revert that if you prefer. I squashed them because there was some churn
when adding the init-ns-pid necessary to restore them, but I am happy to do
things in two steps.

Do you prefer having the final handle format in the first step, or letting it
evolve into final form over the series?


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 0/3] pidfs: implement file handle support
  2024-11-14 12:48           ` Erin Shepherd
@ 2024-11-14 14:27             ` Christian Brauner
  2024-11-28 12:33               ` [PATCH RFC 0/2] pidfs: file handle preliminaries Christian Brauner
  2024-11-14 16:10             ` [PATCH v2 0/3] pidfs: implement file handle support Amir Goldstein
  1 sibling, 1 reply; 57+ messages in thread
From: Christian Brauner @ 2024-11-14 14:27 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Amir Goldstein, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, Jeff Layton, linux-nfs

On Thu, Nov 14, 2024 at 01:48:02PM +0100, Erin Shepherd wrote:
>    
> 
> On 14/11/2024 08:02, Amir Goldstein wrote:
> > On Wed, Nov 13, 2024 at 7:01 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
> >> Since the introduction of pidfs, we have had 64-bit process identifiers
> >> that will not be reused for the entire uptime of the system. This greatly
> >> facilitates process tracking in userspace.
> >>
> >> There are two limitations at present:
> >>
> >>  * These identifiers are currently only exposed to processes on 64-bit
> >>    systems. On 32-bit systems, inode space is also limited to 32 bits and
> >>    therefore is subject to the same reuse issues.
> >>  * There is no way to go from one of these unique identifiers to a pid or
> >>    pidfd.
> >>
> >> This patch implements fh_export and fh_to_dentry which enables userspace to
> >> convert PIDs to and from PID file handles. A process can convert a pidfd into
> >> a file handle using name_to_handle_at, store it (in memory, on disk, or
> >> elsewhere) and then convert it back into a pidfd suing open_by_handle_at.
> >>
> >> To support us going from a file handle to a pidfd, we have to store a pid
> >> inside the file handle. To ensure file handles are invariant and can move
> >> between pid namespaces, we stash a pid from the initial namespace inside
> >> the file handle.
> >>
> >>   (There has been some discussion as to whether or not it is OK to include
> >>   the PID in the initial pid namespace, but so far there hasn't been any
> >>   conclusive reason given as to why this would be a bad idea)
> > IIUC, this is already exposed as st_ino on a 64bit arch?
> > If that is the case, then there is certainly no new info leak in this patch.
> 
> pid.ino is exposed, but the init-ns pid isn't exposed anywhere to my knowledge.

I see what you mean. That might be an information leak. Not a very
interesting one, I think but I need to think about it.

> 
> >> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> >> ---
> >> Changes in v2:
> >> - Permit filesystems to opt out of CAP_DAC_READ_SEARCH
> >> - Inline find_pid_ns/get_pid logic; remove unnecessary put_pid
> >> - Squash fh_export & fh_to_dentry into one commit
> > Not sure why you did that.
> > It was pretty nice as separate commits if you ask me. Whatever.
> 
> I can revert that if you prefer. I squashed them because there was some churn
> when adding the init-ns-pid necessary to restore them, but I am happy to do
> things in two steps.
> 
> Do you prefer having the final handle format in the first step, or letting it
> evolve into final form over the series?
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH RFC 0/2] pidfs: file handle preliminaries
  2024-11-14 14:27             ` Christian Brauner
@ 2024-11-28 12:33               ` Christian Brauner
  2024-11-28 12:33                 ` [PATCH RFC 1/2] pidfs: rework inode number allocation Christian Brauner
                                   ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Christian Brauner @ 2024-11-28 12:33 UTC (permalink / raw)
  To: Erin Shepherd, Amir Goldstein, Jeff Layton
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs

Hey,

This reworks the inode number allocation for pidfs in order to support
file handles properly.

Recently we received a patchset that aims to enable file handle encoding
and decoding via name_to_handle_at(2) and open_by_handle_at(2).

A crucical step in the patch series is how to go from inode number to
struct pid without leaking information into unprivileged contexts. The
issue is that in order to find a struct pid the pid number in the
initial pid namespace must be encoded into the file handle via
name_to_handle_at(2). This can be used by containers using a separate
pid namespace to learn what the pid number of a given process in the
initial pid namespace is. While this is a weak information leak it could
be used in various exploits and in general is an ugly wart in the
design.

To solve this problem a new way is needed to lookup a struct pid based
on the inode number allocated for that struct pid. The other part is to
remove the custom inode number allocation on 32bit systems that is also
an ugly wart that should go away.

So, a new scheme is used that I was discusssing with Tejun some time
back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
are used for the generation number. This gives a 64 bit inode number
that is unique on both 32 bit and 64 bit. The lower 32 bit number is
recycled slowly and can be used to lookup struct pids.

So after applying the pidfs file handle series at
https://lore.kernel.org/r/20241101135452.19359-1-erin.shepherd@e43.eu on
top of the patches here we should be able to simplify encoding and
decoding to something like:

diff --git a/fs/pidfs.c b/fs/pidfs.c
index e71294d3d607..a38b833a2d38 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -78,7 +78,7 @@ void pidfs_remove_pid(struct pid *pid)
 }

 /* Find a struct pid based on the inode number. */
-static __maybe_unused struct pid *pidfs_ino_get_pid(u64 ino)
+static struct pid *pidfs_ino_get_pid(u64 ino)
 {
 	ino_t pid_ino = pidfs_ino(ino);
 	u32 gen = pidfs_gen(ino);
@@ -475,49 +475,37 @@ static const struct dentry_operations pidfs_dentry_operations = {
 	.d_prune	= stashed_dentry_prune,
 };

-#define PIDFD_FID_LEN 3
-
-struct pidfd_fid {
-	u64 ino;
-	s32 pid;
-} __packed;
-
-static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
+static int pidfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
 			   struct inode *parent)
 {
 	struct pid *pid = inode->i_private;
-	struct pidfd_fid *fid = (struct pidfd_fid *)fh;

-	if (*max_len < PIDFD_FID_LEN) {
-		*max_len = PIDFD_FID_LEN;
+	if (*max_len < 2) {
+		*max_len = 2;
 		return FILEID_INVALID;
 	}

-	fid->ino = pid->ino;
-	fid->pid = pid_nr(pid);
-	*max_len = PIDFD_FID_LEN;
+	*max_len = 2;
+	*(u64 *)fh = pid->ino;
 	return FILEID_INO64_GEN;
 }

 static struct dentry *pidfs_fh_to_dentry(struct super_block *sb,
-					 struct fid *gen_fid,
+					 struct fid *fid,
 					 int fh_len, int fh_type)
 {
 	int ret;
 	struct path path;
-	struct pidfd_fid *fid = (struct pidfd_fid *)gen_fid;
 	struct pid *pid;
+	u64 pid_ino;

-	if (fh_type != FILEID_INO64_GEN || fh_len < PIDFD_FID_LEN)
+	if (fh_type != FILEID_INO64_GEN || fh_len < 2)
 		return NULL;

-	scoped_guard(rcu) {
-		pid = find_pid_ns(fid->pid, &init_pid_ns);
-		if (!pid || pid->ino != fid->ino || pid_vnr(pid) == 0)
-			return NULL;
-
-		pid = get_pid(pid);
-	}
+	pid_ino = *(u64 *)fid;
+	pid = pidfs_ino_get_pid(pid_ino);
+	if (!pid)
+		return NULL;

 	ret = path_from_stashed(&pid->stashed, pidfs_mnt, pid, &path);
 	if (ret < 0)

Thanks!
Christian

---
Christian Brauner (2):
      pidfs: rework inode number allocation
      pidfs: remove 32bit inode number handling

 fs/pidfs.c            | 138 +++++++++++++++++++++++++++++++++++---------------
 include/linux/pidfs.h |   2 +
 kernel/pid.c          |  11 ++--
 3 files changed, 103 insertions(+), 48 deletions(-)
---
base-commit: b86545e02e8c22fb89218f29d381fa8e8b91d815
change-id: 20241128-work-pidfs-2bd42c7ea772

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH RFC 1/2] pidfs: rework inode number allocation
  2024-11-28 12:33               ` [PATCH RFC 0/2] pidfs: file handle preliminaries Christian Brauner
@ 2024-11-28 12:33                 ` Christian Brauner
  2024-11-28 17:19                   ` Amir Goldstein
  2024-11-28 12:33                 ` [PATCH RFC 2/2] pidfs: remove 32bit inode number handling Christian Brauner
  2024-11-28 17:06                 ` [PATCH RFC 0/2] pidfs: file handle preliminaries Amir Goldstein
  2 siblings, 1 reply; 57+ messages in thread
From: Christian Brauner @ 2024-11-28 12:33 UTC (permalink / raw)
  To: Erin Shepherd, Amir Goldstein, Jeff Layton
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs

Recently we received a patchset that aims to enable file handle encoding
and decoding via name_to_handle_at(2) and open_by_handle_at(2).

A crucical step in the patch series is how to go from inode number to
struct pid without leaking information into unprivileged contexts. The
issue is that in order to find a struct pid the pid number in the
initial pid namespace must be encoded into the file handle via
name_to_handle_at(2). This can be used by containers using a separate
pid namespace to learn what the pid number of a given process in the
initial pid namespace is. While this is a weak information leak it could
be used in various exploits and in general is an ugly wart in the design.

To solve this problem a new way is needed to lookup a struct pid based
on the inode number allocated for that struct pid. The other part is to
remove the custom inode number allocation on 32bit systems that is also
an ugly wart that should go away.

So, a new scheme is used that I was discusssing with Tejun some time
back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
are used for the generation number. This gives a 64 bit inode number
that is unique on both 32 bit and 64 bit. The lower 32 bit number is
recycled slowly and can be used to lookup struct pids.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c            | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pidfs.h |  2 ++
 kernel/pid.c          | 11 +++---
 3 files changed, 98 insertions(+), 7 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 618abb1fa1b84cf31282c922374e28d60cd49d00..09a0c8ac805301927a94758b3f7d1e513826daf9 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -23,6 +23,88 @@
 #include "internal.h"
 #include "mount.h"
 
+static u32 pidfs_ino_highbits;
+static u32 pidfs_ino_last_ino_lowbits;
+
+static DEFINE_IDR(pidfs_ino_idr);
+
+static inline ino_t pidfs_ino(u64 ino)
+{
+	/* On 32 bit low 32 bits are the inode. */
+	if (sizeof(ino_t) < sizeof(u64))
+		return (u32)ino;
+
+	/* On 64 bit simply return ino. */
+	return ino;
+}
+
+static inline u32 pidfs_gen(u64 ino)
+{
+	/* On 32 bit the generation number are the upper 32 bits. */
+	if (sizeof(ino_t) < sizeof(u64))
+		return ino >> 32;
+
+	/* On 64 bit the generation number is 1. */
+	return 1;
+}
+
+/*
+ * Construct an inode number for struct pid in a way that we can use the
+ * lower 32bit to lookup struct pid independent of any pid numbers that
+ * could be leaked into userspace (e.g., via file handle encoding).
+ */
+int pidfs_add_pid(struct pid *pid)
+{
+	u32 ino_highbits;
+	int ret;
+
+	ret = idr_alloc_cyclic(&pidfs_ino_idr, pid, 1, 0, GFP_ATOMIC);
+	if (ret >= 0 && ret < pidfs_ino_last_ino_lowbits)
+		pidfs_ino_highbits++;
+	ino_highbits = pidfs_ino_highbits;
+	pidfs_ino_last_ino_lowbits = ret;
+	if (ret < 0)
+		return ret;
+
+	pid->ino = (u64)ino_highbits << 32 | ret;
+	pid->stashed = NULL;
+	return 0;
+}
+
+void pidfs_remove_pid(struct pid *pid)
+{
+	idr_remove(&pidfs_ino_idr, (u32)pidfs_ino(pid->ino));
+}
+
+/* Find a struct pid based on the inode number. */
+static __maybe_unused struct pid *pidfs_ino_get_pid(u64 ino)
+{
+	ino_t pid_ino = pidfs_ino(ino);
+	u32 gen = pidfs_gen(ino);
+	struct pid *pid;
+
+	guard(rcu)();
+
+	/* Handle @pid lookup carefully so there's no risk of UAF. */
+	pid = idr_find(&pidfs_ino_idr, (u32)ino);
+	if (!pid)
+		return NULL;
+
+	if (sizeof(ino_t) < sizeof(u64)) {
+		if (gen && pidfs_gen(pid->ino) != gen)
+			pid = NULL;
+	} else {
+		if (pidfs_ino(pid->ino) != pid_ino)
+			pid = NULL;
+	}
+
+	/* Within our pid namespace hierarchy? */
+	if (pid_vnr(pid) == 0)
+		pid = NULL;
+
+	return get_pid(pid);
+}
+
 #ifdef CONFIG_PROC_FS
 /**
  * pidfd_show_fdinfo - print information about a pidfd
@@ -491,6 +573,16 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
 
 void __init pidfs_init(void)
 {
+	/*
+	 * On 32 bit systems the lower 32 bits are the inode number and
+	 * the higher 32 bits are the generation number. The starting
+	 * value for the inode number and the generation number is one.
+	 */
+	if (sizeof(ino_t) < sizeof(u64))
+		pidfs_ino_highbits = 1;
+	else
+		pidfs_ino_highbits = 0;
+
 	pidfs_mnt = kern_mount(&pidfs_type);
 	if (IS_ERR(pidfs_mnt))
 		panic("Failed to mount pidfs pseudo filesystem");
diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
index 75bdf9807802a5d1a9699c99aa42648c2bd34170..2958652bb108b8a2e02128e17317be4545b40a01 100644
--- a/include/linux/pidfs.h
+++ b/include/linux/pidfs.h
@@ -4,5 +4,7 @@
 
 struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
 void __init pidfs_init(void);
+int pidfs_add_pid(struct pid *pid);
+void pidfs_remove_pid(struct pid *pid);
 
 #endif /* _LINUX_PID_FS_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index 115448e89c3e9e664d0d51c8d853e8167ba0540c..9f321c6456d24af705c28f0256ca4de771f5e681 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -64,11 +64,6 @@ int pid_max = PID_MAX_DEFAULT;
 
 int pid_max_min = RESERVED_PIDS + 1;
 int pid_max_max = PID_MAX_LIMIT;
-/*
- * Pseudo filesystems start inode numbering after one. We use Reserved
- * PIDs as a natural offset.
- */
-static u64 pidfs_ino = RESERVED_PIDS;
 
 /*
  * PID-map pages start out as NULL, they get allocated upon
@@ -157,6 +152,7 @@ void free_pid(struct pid *pid)
 		}
 
 		idr_remove(&ns->idr, upid->nr);
+		pidfs_remove_pid(pid);
 	}
 	spin_unlock_irqrestore(&pidmap_lock, flags);
 
@@ -276,8 +272,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	spin_lock_irq(&pidmap_lock);
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
-	pid->stashed = NULL;
-	pid->ino = ++pidfs_ino;
+	retval = pidfs_add_pid(pid);
+	if (retval)
+		goto out_unlock;
 	for ( ; upid >= pid->numbers; --upid) {
 		/* Make the PID visible to find_pid_ns. */
 		idr_replace(&upid->ns->idr, pid, upid->nr);

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH RFC 1/2] pidfs: rework inode number allocation
  2024-11-28 12:33                 ` [PATCH RFC 1/2] pidfs: rework inode number allocation Christian Brauner
@ 2024-11-28 17:19                   ` Amir Goldstein
  0 siblings, 0 replies; 57+ messages in thread
From: Amir Goldstein @ 2024-11-28 17:19 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Erin Shepherd, Jeff Layton, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs

On Thu, Nov 28, 2024 at 1:34 PM Christian Brauner <brauner@kernel.org> wrote:
>
> Recently we received a patchset that aims to enable file handle encoding
> and decoding via name_to_handle_at(2) and open_by_handle_at(2).
>
> A crucical step in the patch series is how to go from inode number to
> struct pid without leaking information into unprivileged contexts. The
> issue is that in order to find a struct pid the pid number in the
> initial pid namespace must be encoded into the file handle via
> name_to_handle_at(2). This can be used by containers using a separate
> pid namespace to learn what the pid number of a given process in the
> initial pid namespace is. While this is a weak information leak it could
> be used in various exploits and in general is an ugly wart in the design.
>
> To solve this problem a new way is needed to lookup a struct pid based
> on the inode number allocated for that struct pid. The other part is to
> remove the custom inode number allocation on 32bit systems that is also
> an ugly wart that should go away.
>
> So, a new scheme is used that I was discusssing with Tejun some time
> back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
> are used for the generation number. This gives a 64 bit inode number
> that is unique on both 32 bit and 64 bit. The lower 32 bit number is
> recycled slowly and can be used to lookup struct pids.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/pidfs.c            | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pidfs.h |  2 ++
>  kernel/pid.c          | 11 +++---
>  3 files changed, 98 insertions(+), 7 deletions(-)
>
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 618abb1fa1b84cf31282c922374e28d60cd49d00..09a0c8ac805301927a94758b3f7d1e513826daf9 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -23,6 +23,88 @@
>  #include "internal.h"
>  #include "mount.h"
>
> +static u32 pidfs_ino_highbits;
> +static u32 pidfs_ino_last_ino_lowbits;
> +
> +static DEFINE_IDR(pidfs_ino_idr);
> +
> +static inline ino_t pidfs_ino(u64 ino)
> +{
> +       /* On 32 bit low 32 bits are the inode. */
> +       if (sizeof(ino_t) < sizeof(u64))
> +               return (u32)ino;
> +
> +       /* On 64 bit simply return ino. */
> +       return ino;
> +}
> +
> +static inline u32 pidfs_gen(u64 ino)
> +{
> +       /* On 32 bit the generation number are the upper 32 bits. */
> +       if (sizeof(ino_t) < sizeof(u64))
> +               return ino >> 32;
> +
> +       /* On 64 bit the generation number is 1. */
> +       return 1;
> +}
> +
> +/*
> + * Construct an inode number for struct pid in a way that we can use the
> + * lower 32bit to lookup struct pid independent of any pid numbers that
> + * could be leaked into userspace (e.g., via file handle encoding).
> + */
> +int pidfs_add_pid(struct pid *pid)
> +{
> +       u32 ino_highbits;
> +       int ret;
> +
> +       ret = idr_alloc_cyclic(&pidfs_ino_idr, pid, 1, 0, GFP_ATOMIC);
> +       if (ret >= 0 && ret < pidfs_ino_last_ino_lowbits)
> +               pidfs_ino_highbits++;
> +       ino_highbits = pidfs_ino_highbits;
> +       pidfs_ino_last_ino_lowbits = ret;
> +       if (ret < 0)
> +               return ret;
> +

This code looks "too useful" to be in a random filesystem.
Maybe work a generation counter into idr_alloc_cyclic or
just let it let the caller know that the range has been cycled?
Only if this is not too complicated to do.

> +       pid->ino = (u64)ino_highbits << 32 | ret;
> +       pid->stashed = NULL;
> +       return 0;
> +}
> +
> +void pidfs_remove_pid(struct pid *pid)
> +{
> +       idr_remove(&pidfs_ino_idr, (u32)pidfs_ino(pid->ino));
> +}
> +
> +/* Find a struct pid based on the inode number. */
> +static __maybe_unused struct pid *pidfs_ino_get_pid(u64 ino)
> +{
> +       ino_t pid_ino = pidfs_ino(ino);
> +       u32 gen = pidfs_gen(ino);
> +       struct pid *pid;
> +
> +       guard(rcu)();
> +
> +       /* Handle @pid lookup carefully so there's no risk of UAF. */
> +       pid = idr_find(&pidfs_ino_idr, (u32)ino);
> +       if (!pid)
> +               return NULL;
> +
> +       if (sizeof(ino_t) < sizeof(u64)) {
> +               if (gen && pidfs_gen(pid->ino) != gen)
> +                       pid = NULL;
> +       } else {
> +               if (pidfs_ino(pid->ino) != pid_ino)
> +                       pid = NULL;
> +       }
> +
> +       /* Within our pid namespace hierarchy? */
> +       if (pid_vnr(pid) == 0)
> +               pid = NULL;
> +
> +       return get_pid(pid);
> +}
> +
>  #ifdef CONFIG_PROC_FS
>  /**
>   * pidfd_show_fdinfo - print information about a pidfd
> @@ -491,6 +573,16 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
>
>  void __init pidfs_init(void)
>  {
> +       /*
> +        * On 32 bit systems the lower 32 bits are the inode number and
> +        * the higher 32 bits are the generation number. The starting
> +        * value for the inode number and the generation number is one.
> +        */
> +       if (sizeof(ino_t) < sizeof(u64))
> +               pidfs_ino_highbits = 1;
> +       else
> +               pidfs_ino_highbits = 0;
> +
>         pidfs_mnt = kern_mount(&pidfs_type);
>         if (IS_ERR(pidfs_mnt))
>                 panic("Failed to mount pidfs pseudo filesystem");
> diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
> index 75bdf9807802a5d1a9699c99aa42648c2bd34170..2958652bb108b8a2e02128e17317be4545b40a01 100644
> --- a/include/linux/pidfs.h
> +++ b/include/linux/pidfs.h
> @@ -4,5 +4,7 @@
>
>  struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
>  void __init pidfs_init(void);
> +int pidfs_add_pid(struct pid *pid);
> +void pidfs_remove_pid(struct pid *pid);
>
>  #endif /* _LINUX_PID_FS_H */
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 115448e89c3e9e664d0d51c8d853e8167ba0540c..9f321c6456d24af705c28f0256ca4de771f5e681 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -64,11 +64,6 @@ int pid_max = PID_MAX_DEFAULT;
>
>  int pid_max_min = RESERVED_PIDS + 1;
>  int pid_max_max = PID_MAX_LIMIT;
> -/*
> - * Pseudo filesystems start inode numbering after one. We use Reserved
> - * PIDs as a natural offset.
> - */
> -static u64 pidfs_ino = RESERVED_PIDS;
>
>  /*
>   * PID-map pages start out as NULL, they get allocated upon
> @@ -157,6 +152,7 @@ void free_pid(struct pid *pid)
>                 }
>
>                 idr_remove(&ns->idr, upid->nr);
> +               pidfs_remove_pid(pid);
>         }
>         spin_unlock_irqrestore(&pidmap_lock, flags);
>
> @@ -276,8 +272,9 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>         spin_lock_irq(&pidmap_lock);
>         if (!(ns->pid_allocated & PIDNS_ADDING))
>                 goto out_unlock;
> -       pid->stashed = NULL;
> -       pid->ino = ++pidfs_ino;
> +       retval = pidfs_add_pid(pid);

I hope this does not create a scalability issue compared to the prev method?

> +       if (retval)
> +               goto out_unlock;
>         for ( ; upid >= pid->numbers; --upid) {
>                 /* Make the PID visible to find_pid_ns. */
>                 idr_replace(&upid->ns->idr, pid, upid->nr);
>
> --
> 2.45.2
>

Nice idea!
If there is something wrong with it, then I could not find it ;)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH RFC 2/2] pidfs: remove 32bit inode number handling
  2024-11-28 12:33               ` [PATCH RFC 0/2] pidfs: file handle preliminaries Christian Brauner
  2024-11-28 12:33                 ` [PATCH RFC 1/2] pidfs: rework inode number allocation Christian Brauner
@ 2024-11-28 12:33                 ` Christian Brauner
  2024-11-28 17:06                 ` [PATCH RFC 0/2] pidfs: file handle preliminaries Amir Goldstein
  2 siblings, 0 replies; 57+ messages in thread
From: Christian Brauner @ 2024-11-28 12:33 UTC (permalink / raw)
  To: Erin Shepherd, Amir Goldstein, Jeff Layton
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs

Now that we have a unified inode number handling model remove the custom
ida-based allocation for 32bit.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c | 46 +++++-----------------------------------------
 1 file changed, 5 insertions(+), 41 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 09a0c8ac805301927a94758b3f7d1e513826daf9..334d2fc3f01ee6f219dc3e52309bfe00726885ca 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -400,40 +400,6 @@ struct pid *pidfd_pid(const struct file *file)
 
 static struct vfsmount *pidfs_mnt __ro_after_init;
 
-#if BITS_PER_LONG == 32
-/*
- * Provide a fallback mechanism for 32-bit systems so processes remain
- * reliably comparable by inode number even on those systems.
- */
-static DEFINE_IDA(pidfd_inum_ida);
-
-static int pidfs_inum(struct pid *pid, unsigned long *ino)
-{
-	int ret;
-
-	ret = ida_alloc_range(&pidfd_inum_ida, RESERVED_PIDS + 1,
-			      UINT_MAX, GFP_ATOMIC);
-	if (ret < 0)
-		return -ENOSPC;
-
-	*ino = ret;
-	return 0;
-}
-
-static inline void pidfs_free_inum(unsigned long ino)
-{
-	if (ino > 0)
-		ida_free(&pidfd_inum_ida, ino);
-}
-#else
-static inline int pidfs_inum(struct pid *pid, unsigned long *ino)
-{
-	*ino = pid->ino;
-	return 0;
-}
-#define pidfs_free_inum(ino) ((void)(ino))
-#endif
-
 /*
  * The vfs falls back to simple_setattr() if i_op->setattr() isn't
  * implemented. Let's reject it completely until we have a clean
@@ -485,7 +451,6 @@ static void pidfs_evict_inode(struct inode *inode)
 
 	clear_inode(inode);
 	put_pid(pid);
-	pidfs_free_inum(inode->i_ino);
 }
 
 static const struct super_operations pidfs_sops = {
@@ -511,17 +476,16 @@ static const struct dentry_operations pidfs_dentry_operations = {
 
 static int pidfs_init_inode(struct inode *inode, void *data)
 {
+	struct pid *pid = data;
+
 	inode->i_private = data;
 	inode->i_flags |= S_PRIVATE;
 	inode->i_mode |= S_IRWXU;
 	inode->i_op = &pidfs_inode_operations;
 	inode->i_fop = &pidfs_file_operations;
-	/*
-	 * Inode numbering for pidfs start at RESERVED_PIDS + 1. This
-	 * avoids collisions with the root inode which is 1 for pseudo
-	 * filesystems.
-	 */
-	return pidfs_inum(data, &inode->i_ino);
+	inode->i_ino = pidfs_ino(pid->ino);
+	inode->i_generation = pidfs_gen(pid->ino);
+	return 0;
 }
 
 static void pidfs_put_data(void *data)

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH RFC 0/2] pidfs: file handle preliminaries
  2024-11-28 12:33               ` [PATCH RFC 0/2] pidfs: file handle preliminaries Christian Brauner
  2024-11-28 12:33                 ` [PATCH RFC 1/2] pidfs: rework inode number allocation Christian Brauner
  2024-11-28 12:33                 ` [PATCH RFC 2/2] pidfs: remove 32bit inode number handling Christian Brauner
@ 2024-11-28 17:06                 ` Amir Goldstein
  2 siblings, 0 replies; 57+ messages in thread
From: Amir Goldstein @ 2024-11-28 17:06 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Erin Shepherd, Jeff Layton, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs

On Thu, Nov 28, 2024 at 1:34 PM Christian Brauner <brauner@kernel.org> wrote:
>
> Hey,
>
> This reworks the inode number allocation for pidfs in order to support
> file handles properly.
>
> Recently we received a patchset that aims to enable file handle encoding
> and decoding via name_to_handle_at(2) and open_by_handle_at(2).
>
> A crucical step in the patch series is how to go from inode number to
> struct pid without leaking information into unprivileged contexts. The
> issue is that in order to find a struct pid the pid number in the
> initial pid namespace must be encoded into the file handle via
> name_to_handle_at(2). This can be used by containers using a separate
> pid namespace to learn what the pid number of a given process in the
> initial pid namespace is. While this is a weak information leak it could
> be used in various exploits and in general is an ugly wart in the
> design.
>
> To solve this problem a new way is needed to lookup a struct pid based
> on the inode number allocated for that struct pid. The other part is to
> remove the custom inode number allocation on 32bit systems that is also
> an ugly wart that should go away.
>
> So, a new scheme is used that I was discusssing with Tejun some time
> back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
> are used for the generation number. This gives a 64 bit inode number
> that is unique on both 32 bit and 64 bit. The lower 32 bit number is
> recycled slowly and can be used to lookup struct pids.
>
> So after applying the pidfs file handle series at
> https://lore.kernel.org/r/20241101135452.19359-1-erin.shepherd@e43.eu on
> top of the patches here we should be able to simplify encoding and
> decoding to something like:
>
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index e71294d3d607..a38b833a2d38 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -78,7 +78,7 @@ void pidfs_remove_pid(struct pid *pid)
>  }
>
>  /* Find a struct pid based on the inode number. */
> -static __maybe_unused struct pid *pidfs_ino_get_pid(u64 ino)
> +static struct pid *pidfs_ino_get_pid(u64 ino)
>  {
>         ino_t pid_ino = pidfs_ino(ino);
>         u32 gen = pidfs_gen(ino);
> @@ -475,49 +475,37 @@ static const struct dentry_operations pidfs_dentry_operations = {
>         .d_prune        = stashed_dentry_prune,
>  };
>
> -#define PIDFD_FID_LEN 3
> -
> -struct pidfd_fid {
> -       u64 ino;
> -       s32 pid;
> -} __packed;
> -
> -static int pidfs_encode_fh(struct inode *inode, u32 *fh, int *max_len,
> +static int pidfs_encode_fh(struct inode *inode, __u32 *fh, int *max_len,
>                            struct inode *parent)
>  {
>         struct pid *pid = inode->i_private;
> -       struct pidfd_fid *fid = (struct pidfd_fid *)fh;
>
> -       if (*max_len < PIDFD_FID_LEN) {
> -               *max_len = PIDFD_FID_LEN;
> +       if (*max_len < 2) {
> +               *max_len = 2;
>                 return FILEID_INVALID;
>         }
>
> -       fid->ino = pid->ino;
> -       fid->pid = pid_nr(pid);
> -       *max_len = PIDFD_FID_LEN;
> +       *max_len = 2;
> +       *(u64 *)fh = pid->ino;
>         return FILEID_INO64_GEN;

Semantic remark:
        /*
         * 64 bit inode number, 32 bit generation number.
         */
        FILEID_INO64_GEN = 0x81,

filesystems are free to abuse the constants and return whatever id they want
(e.g. shmem_encode_fh()), but if you want to play by the rules, this would be
either:
        /*
         * 64 bit unique kernfs id
         */
        FILEID_KERNFS = 0xfe,

or:
        /*
         * 32bit inode number, 32 bit generation number.
         */
        FILEID_INO32_GEN = 1,

which is at least sometimes correct.
or define:
        /*
         * 64 bit inode number.
         */
        FILEID_INO64 = 0x80,


Thanks,
Amir.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 0/3] pidfs: implement file handle support
  2024-11-14 12:48           ` Erin Shepherd
  2024-11-14 14:27             ` Christian Brauner
@ 2024-11-14 16:10             ` Amir Goldstein
  1 sibling, 0 replies; 57+ messages in thread
From: Amir Goldstein @ 2024-11-14 16:10 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Christian Brauner, Alexander Viro, Jan Kara, Chuck Lever,
	linux-fsdevel, linux-kernel, Jeff Layton, linux-nfs

On Thu, Nov 14, 2024 at 4:51 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
>
>
>
> On 14/11/2024 08:02, Amir Goldstein wrote:
> > On Wed, Nov 13, 2024 at 7:01 PM Erin Shepherd <erin.shepherd@e43.eu> wrote:
> >> Since the introduction of pidfs, we have had 64-bit process identifiers
> >> that will not be reused for the entire uptime of the system. This greatly
> >> facilitates process tracking in userspace.
> >>
> >> There are two limitations at present:
> >>
> >>  * These identifiers are currently only exposed to processes on 64-bit
> >>    systems. On 32-bit systems, inode space is also limited to 32 bits and
> >>    therefore is subject to the same reuse issues.
> >>  * There is no way to go from one of these unique identifiers to a pid or
> >>    pidfd.
> >>
> >> This patch implements fh_export and fh_to_dentry which enables userspace to
> >> convert PIDs to and from PID file handles. A process can convert a pidfd into
> >> a file handle using name_to_handle_at, store it (in memory, on disk, or
> >> elsewhere) and then convert it back into a pidfd suing open_by_handle_at.
> >>
> >> To support us going from a file handle to a pidfd, we have to store a pid
> >> inside the file handle. To ensure file handles are invariant and can move
> >> between pid namespaces, we stash a pid from the initial namespace inside
> >> the file handle.
> >>
> >>   (There has been some discussion as to whether or not it is OK to include
> >>   the PID in the initial pid namespace, but so far there hasn't been any
> >>   conclusive reason given as to why this would be a bad idea)
> > IIUC, this is already exposed as st_ino on a 64bit arch?
> > If that is the case, then there is certainly no new info leak in this patch.
>
> pid.ino is exposed, but the init-ns pid isn't exposed anywhere to my knowledge.
>
> >> Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
> >> ---
> >> Changes in v2:
> >> - Permit filesystems to opt out of CAP_DAC_READ_SEARCH
> >> - Inline find_pid_ns/get_pid logic; remove unnecessary put_pid
> >> - Squash fh_export & fh_to_dentry into one commit
> > Not sure why you did that.
> > It was pretty nice as separate commits if you ask me. Whatever.
>
> I can revert that if you prefer. I squashed them because there was some churn
> when adding the init-ns-pid necessary to restore them, but I am happy to do
> things in two steps.
>
> Do you prefer having the final handle format in the first step, or letting it
> evolve into final form over the series?
>

I don't really mind. Was just wondering. Either way is fine.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/4] pidfs: implement file handle support
  2024-11-12 13:10 ` [PATCH 0/4] pidfs: implement file handle support Christian Brauner
  2024-11-12 13:57   ` Jeff Layton
@ 2024-11-12 23:03   ` Erin Shepherd
  2024-11-13  0:40     ` Darrick J. Wong
  1 sibling, 1 reply; 57+ messages in thread
From: Erin Shepherd @ 2024-11-12 23:03 UTC (permalink / raw)
  To: Christian Brauner, Jeff Layton, Amir Goldstein
  Cc: linux-kernel, linux-fsdevel, christian, paul, bluca, Chuck Lever


On 12/11/2024 14:10, Christian Brauner wrote:
> Sorry for the delayed reply (I'm recovering from a lengthy illness.).
No worries on my part, and I hope you're feeling better!
> I like the idea in general. I think this is really useful. A few of my
> thoughts but I need input from Amir and Jeff:
>
> * In the last patch of the series you already implement decoding of
>   pidfd file handles by adding a .fh_to_dentry export_operations method.
>
>   There are a few things to consider because of how open_by_handle_at()
>   works.
>
>   - open_by_handle_at() needs to be restricted so it only creates pidfds
>     from pidfs file handles that resolve to a struct pid that is
>     reachable in the caller's pid namespace. In other words, it should
>     mirror pidfd_open().
>
>     Put another way, open_by_handle_at() must not be usable to open
>     arbitrary pids to prevent a container from constructing a pidfd file
>     handle for a process that lives outside it's pid namespace
>     hierarchy.
>
>     With this restriction in place open_by_handle_at() can be available
>     to let unprivileged processes open pidfd file handles.
>
>     Related to that, I don't think we need to make open_by_handle_at()
>     open arbitrary pidfd file handles via CAP_DAC_READ_SEARCH. Simply
>     because any process in the initial pid namespace can open any other
>     process via pidfd_open() anyway because pid namespaces are
>     hierarchical.
>
>     IOW, CAP_DAC_READ_SEARCH must not override the restriction that the
>     provided pidfs file handle must be reachable from the caller's pid
>     namespace.

The pid_vnr(pid) == 0 check catches this case -- we return an error to the
caller if there isn't a pid mapping in the caller's namespace

Perhaps I should have called this out explicitly.

>   - open_by_handle_at() uses may_decode_fh() to determine whether it's
>     possible to decode a file handle as an unprivileged user. The
>     current checks don't make sense for pidfs. Conceptually, I think
>     there don't need to place any restrictions based on global
>     CAP_DAC_READ_SEARCH, owning user namespace of the superblock or
>     mount on pidfs file handles.
>
>     The only restriction that matters is that the requested pidfs file
>     handle is reachable from the caller's pid namespace.

I wonder if this could be handled through an addition to export_operations'
flags member?

>   - A pidfd always has exactly a single inode and a single dentry.
>     There's no aliases.
>
>   - Generally, in my naive opinion, I think that decoding pidfs file
>     handles should be a lot simpler than decoding regular path based
>     file handles. Because there should be no need to verify any
>     ancestors, or reconnect paths. Pidfs also doesn't have directory
>     inodes, only regular inodes. In other words, any dentry is
>     acceptable.
>
>     Essentially, the only thing we need is for exportfs_decode_fh_raw()
>     to verify that the provided pidfs file handle is resolvable in the
>     caller's pid namespace. If so we're done. The challenge is how to
>     nicely plumb this into the code without it sticking out like a sore
>     thumb.

Theoretically you should be able to use PIDFD_SELF as well (assuming that
makes its way into mainline this release :-)) but I am a bit concerned about
potentially polluting the open_by_handle_at logic with pidfd specificities.

>   - Pidfs should not be exportable via NFS. It doesn't make sense.

Hmm, I guess I might have made that possible, though I'm certainly not
familiar enough with the internals of nfsd to be able to test if I've done
so.

I guess probably this case calls for another export_ops flag? Not like we're
short on them

Thanks,
    - Erin


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/4] pidfs: implement file handle support
  2024-11-12 23:03   ` [PATCH 0/4] " Erin Shepherd
@ 2024-11-13  0:40     ` Darrick J. Wong
  2024-11-13 10:17       ` Erin Shepherd
  0 siblings, 1 reply; 57+ messages in thread
From: Darrick J. Wong @ 2024-11-13  0:40 UTC (permalink / raw)
  To: Erin Shepherd
  Cc: Christian Brauner, Jeff Layton, Amir Goldstein, linux-kernel,
	linux-fsdevel, christian, paul, bluca, Chuck Lever

On Wed, Nov 13, 2024 at 12:03:08AM +0100, Erin Shepherd wrote:
> 
> On 12/11/2024 14:10, Christian Brauner wrote:
> > Sorry for the delayed reply (I'm recovering from a lengthy illness.).
> No worries on my part, and I hope you're feeling better!
> > I like the idea in general. I think this is really useful. A few of my
> > thoughts but I need input from Amir and Jeff:
> >
> > * In the last patch of the series you already implement decoding of
> >   pidfd file handles by adding a .fh_to_dentry export_operations method.
> >
> >   There are a few things to consider because of how open_by_handle_at()
> >   works.
> >
> >   - open_by_handle_at() needs to be restricted so it only creates pidfds
> >     from pidfs file handles that resolve to a struct pid that is
> >     reachable in the caller's pid namespace. In other words, it should
> >     mirror pidfd_open().
> >
> >     Put another way, open_by_handle_at() must not be usable to open
> >     arbitrary pids to prevent a container from constructing a pidfd file
> >     handle for a process that lives outside it's pid namespace
> >     hierarchy.
> >
> >     With this restriction in place open_by_handle_at() can be available
> >     to let unprivileged processes open pidfd file handles.
> >
> >     Related to that, I don't think we need to make open_by_handle_at()
> >     open arbitrary pidfd file handles via CAP_DAC_READ_SEARCH. Simply
> >     because any process in the initial pid namespace can open any other
> >     process via pidfd_open() anyway because pid namespaces are
> >     hierarchical.
> >
> >     IOW, CAP_DAC_READ_SEARCH must not override the restriction that the
> >     provided pidfs file handle must be reachable from the caller's pid
> >     namespace.
> 
> The pid_vnr(pid) == 0 check catches this case -- we return an error to the
> caller if there isn't a pid mapping in the caller's namespace
> 
> Perhaps I should have called this out explicitly.
> 
> >   - open_by_handle_at() uses may_decode_fh() to determine whether it's
> >     possible to decode a file handle as an unprivileged user. The
> >     current checks don't make sense for pidfs. Conceptually, I think
> >     there don't need to place any restrictions based on global
> >     CAP_DAC_READ_SEARCH, owning user namespace of the superblock or
> >     mount on pidfs file handles.
> >
> >     The only restriction that matters is that the requested pidfs file
> >     handle is reachable from the caller's pid namespace.
> 
> I wonder if this could be handled through an addition to export_operations'
> flags member?
> 
> >   - A pidfd always has exactly a single inode and a single dentry.
> >     There's no aliases.
> >
> >   - Generally, in my naive opinion, I think that decoding pidfs file
> >     handles should be a lot simpler than decoding regular path based
> >     file handles. Because there should be no need to verify any
> >     ancestors, or reconnect paths. Pidfs also doesn't have directory
> >     inodes, only regular inodes. In other words, any dentry is
> >     acceptable.
> >
> >     Essentially, the only thing we need is for exportfs_decode_fh_raw()
> >     to verify that the provided pidfs file handle is resolvable in the
> >     caller's pid namespace. If so we're done. The challenge is how to
> >     nicely plumb this into the code without it sticking out like a sore
> >     thumb.
> 
> Theoretically you should be able to use PIDFD_SELF as well (assuming that
> makes its way into mainline this release :-)) but I am a bit concerned about
> potentially polluting the open_by_handle_at logic with pidfd specificities.
> 
> >   - Pidfs should not be exportable via NFS. It doesn't make sense.
> 
> Hmm, I guess I might have made that possible, though I'm certainly not
> familiar enough with the internals of nfsd to be able to test if I've done
> so.

AFAIK check_export() in fs/nfsd/export.c spells this it out:

	/* There are two requirements on a filesystem to be exportable.
	 * 1:  We must be able to identify the filesystem from a number.
	 *       either a device number (so FS_REQUIRES_DEV needed)
	 *       or an FSID number (so NFSEXP_FSID or ->uuid is needed).
	 * 2:  We must be able to find an inode from a filehandle.
	 *       This means that s_export_op must be set.
	 * 3: We must not currently be on an idmapped mount.
	 */

Granted I've been wrong on account of stale docs before. :$

Though it would be kinda funny if you *could* mess with another
machine's processes over NFS.

--D

> I guess probably this case calls for another export_ops flag? Not like we're
> short on them
> 
> Thanks,
>     - Erin
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/4] pidfs: implement file handle support
  2024-11-13  0:40     ` Darrick J. Wong
@ 2024-11-13 10:17       ` Erin Shepherd
  2024-11-13 13:29         ` Jeff Layton
  0 siblings, 1 reply; 57+ messages in thread
From: Erin Shepherd @ 2024-11-13 10:17 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christian Brauner, Jeff Layton, Amir Goldstein, linux-kernel,
	linux-fsdevel, christian, paul, bluca, Chuck Lever

On 13/11/2024 01:40, Darrick J. Wong wrote:
>> Hmm, I guess I might have made that possible, though I'm certainly not
>> familiar enough with the internals of nfsd to be able to test if I've done
>> so.
> AFAIK check_export() in fs/nfsd/export.c spells this it out:
>
> 	/* There are two requirements on a filesystem to be exportable.
> 	 * 1:  We must be able to identify the filesystem from a number.
> 	 *       either a device number (so FS_REQUIRES_DEV needed)
> 	 *       or an FSID number (so NFSEXP_FSID or ->uuid is needed).
> 	 * 2:  We must be able to find an inode from a filehandle.
> 	 *       This means that s_export_op must be set.
> 	 * 3: We must not currently be on an idmapped mount.
> 	 */
>
> Granted I've been wrong on account of stale docs before. :$
>
> Though it would be kinda funny if you *could* mess with another
> machine's processes over NFS.
>
> --D

To be clear I'm not familiar enough with the workings of nfsd to tell if
pidfs fails those requirements and therefore wouldn't become exportable as
a result of this patch, though I gather from you're message that we're in the
clear?

Regardless I think my question is: do we think either those requirements could
change in the future, or the properties of pidfs could change in the future,
in ways that could accidentally make the filesystem exportable?

I guess though that the same concern would apply to cgroupfs and it hasn't posed
an issue so far.

- Erin


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/4] pidfs: implement file handle support
  2024-11-13 10:17       ` Erin Shepherd
@ 2024-11-13 13:29         ` Jeff Layton
  2024-11-13 14:41           ` Chuck Lever III
  2024-11-14  6:55           ` Amir Goldstein
  0 siblings, 2 replies; 57+ messages in thread
From: Jeff Layton @ 2024-11-13 13:29 UTC (permalink / raw)
  To: Erin Shepherd, Darrick J. Wong
  Cc: Christian Brauner, Amir Goldstein, linux-kernel, linux-fsdevel,
	christian, paul, bluca, Chuck Lever

On Wed, 2024-11-13 at 11:17 +0100, Erin Shepherd wrote:
> On 13/11/2024 01:40, Darrick J. Wong wrote:
> > > Hmm, I guess I might have made that possible, though I'm certainly not
> > > familiar enough with the internals of nfsd to be able to test if I've done
> > > so.
> > AFAIK check_export() in fs/nfsd/export.c spells this it out:
> > 
> > 	/* There are two requirements on a filesystem to be exportable.
> > 	 * 1:  We must be able to identify the filesystem from a number.
> > 	 *       either a device number (so FS_REQUIRES_DEV needed)
> > 	 *       or an FSID number (so NFSEXP_FSID or ->uuid is needed).
> > 	 * 2:  We must be able to find an inode from a filehandle.
> > 	 *       This means that s_export_op must be set.
> > 	 * 3: We must not currently be on an idmapped mount.
> > 	 */
> > 
> > Granted I've been wrong on account of stale docs before. :$
> > 
> > Though it would be kinda funny if you *could* mess with another
> > machine's processes over NFS.
> > 
> > --D
> 
> To be clear I'm not familiar enough with the workings of nfsd to tell if
> pidfs fails those requirements and therefore wouldn't become exportable as
> a result of this patch, though I gather from you're message that we're in the
> clear?
> 
> Regardless I think my question is: do we think either those requirements could
> change in the future, or the properties of pidfs could change in the future,
> in ways that could accidentally make the filesystem exportable?
> 
> I guess though that the same concern would apply to cgroupfs and it hasn't posed
> an issue so far.

We have other filesystems that do this sort of thing (like cgroupfs),
and we don't allow them to be exportable. We'll need to make sure that
that's the case before we merge this, of course, as I forget the
details of how that works.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/4] pidfs: implement file handle support
  2024-11-13 13:29         ` Jeff Layton
@ 2024-11-13 14:41           ` Chuck Lever III
  2024-11-14 10:39             ` Christian Brauner
  2024-11-14  6:55           ` Amir Goldstein
  1 sibling, 1 reply; 57+ messages in thread
From: Chuck Lever III @ 2024-11-13 14:41 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Erin Shepherd, Darrick J. Wong, Christian Brauner, Amir Goldstein,
	Linux Kernel Mailing List, Linux FS Devel, christian@brauner.io,
	Paul Moore, bluca@debian.org



> On Nov 13, 2024, at 8:29 AM, Jeff Layton <jlayton@kernel.org> wrote:
> 
> On Wed, 2024-11-13 at 11:17 +0100, Erin Shepherd wrote:
>> On 13/11/2024 01:40, Darrick J. Wong wrote:
>>>> Hmm, I guess I might have made that possible, though I'm certainly not
>>>> familiar enough with the internals of nfsd to be able to test if I've done
>>>> so.
>>> AFAIK check_export() in fs/nfsd/export.c spells this it out:
>>> 
>>> /* There are two requirements on a filesystem to be exportable.
>>> * 1:  We must be able to identify the filesystem from a number.
>>> *       either a device number (so FS_REQUIRES_DEV needed)
>>> *       or an FSID number (so NFSEXP_FSID or ->uuid is needed).
>>> * 2:  We must be able to find an inode from a filehandle.
>>> *       This means that s_export_op must be set.
>>> * 3: We must not currently be on an idmapped mount.
>>> */
>>> 
>>> Granted I've been wrong on account of stale docs before. :$
>>> 
>>> Though it would be kinda funny if you *could* mess with another
>>> machine's processes over NFS.
>>> 
>>> --D
>> 
>> To be clear I'm not familiar enough with the workings of nfsd to tell if
>> pidfs fails those requirements and therefore wouldn't become exportable as
>> a result of this patch, though I gather from you're message that we're in the
>> clear?
>> 
>> Regardless I think my question is: do we think either those requirements could
>> change in the future, or the properties of pidfs could change in the future,
>> in ways that could accidentally make the filesystem exportable?
>> 
>> I guess though that the same concern would apply to cgroupfs and it hasn't posed
>> an issue so far.
> 
> We have other filesystems that do this sort of thing (like cgroupfs),
> and we don't allow them to be exportable. We'll need to make sure that
> that's the case before we merge this, of course, as I forget the
> details of how that works.

It's far easier to add exportability later than it is
to remove it if we think it was a mistake. I would err
on the side of caution if there isn't an immediate
need/use-case for exposure via NFS.

--
Chuck Lever



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/4] pidfs: implement file handle support
  2024-11-13 14:41           ` Chuck Lever III
@ 2024-11-14 10:39             ` Christian Brauner
  0 siblings, 0 replies; 57+ messages in thread
From: Christian Brauner @ 2024-11-14 10:39 UTC (permalink / raw)
  To: Chuck Lever III, Darrick J. Wong
  Cc: Jeff Layton, Erin Shepherd, Amir Goldstein,
	Linux Kernel Mailing List, Linux FS Devel, christian@brauner.io,
	Paul Moore, bluca@debian.org

On Wed, Nov 13, 2024 at 02:41:29PM +0000, Chuck Lever III wrote:
> 
> 
> > On Nov 13, 2024, at 8:29 AM, Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Wed, 2024-11-13 at 11:17 +0100, Erin Shepherd wrote:
> >> On 13/11/2024 01:40, Darrick J. Wong wrote:
> >>>> Hmm, I guess I might have made that possible, though I'm certainly not
> >>>> familiar enough with the internals of nfsd to be able to test if I've done
> >>>> so.
> >>> AFAIK check_export() in fs/nfsd/export.c spells this it out:
> >>> 
> >>> /* There are two requirements on a filesystem to be exportable.
> >>> * 1:  We must be able to identify the filesystem from a number.
> >>> *       either a device number (so FS_REQUIRES_DEV needed)
> >>> *       or an FSID number (so NFSEXP_FSID or ->uuid is needed).
> >>> * 2:  We must be able to find an inode from a filehandle.
> >>> *       This means that s_export_op must be set.
> >>> * 3: We must not currently be on an idmapped mount.
> >>> */
> >>> 
> >>> Granted I've been wrong on account of stale docs before. :$
> >>> 
> >>> Though it would be kinda funny if you *could* mess with another
> >>> machine's processes over NFS.
> >>> 
> >>> --D
> >> 
> >> To be clear I'm not familiar enough with the workings of nfsd to tell if
> >> pidfs fails those requirements and therefore wouldn't become exportable as
> >> a result of this patch, though I gather from you're message that we're in the
> >> clear?
> >> 
> >> Regardless I think my question is: do we think either those requirements could
> >> change in the future, or the properties of pidfs could change in the future,
> >> in ways that could accidentally make the filesystem exportable?
> >> 
> >> I guess though that the same concern would apply to cgroupfs and it hasn't posed
> >> an issue so far.
> > 
> > We have other filesystems that do this sort of thing (like cgroupfs),
> > and we don't allow them to be exportable. We'll need to make sure that
> > that's the case before we merge this, of course, as I forget the
> > details of how that works.
> 
> It's far easier to add exportability later than it is
> to remove it if we think it was a mistake. I would err
> on the side of caution if there isn't an immediate
> need/use-case for exposure via NFS.

Tbh, the idea itself of exporting pidfs via nfs is a) crazy and b)
pretty interesting. If we could really export pidfs over NFS cleanly
somehow then we would have a filesystem-like representation of a remote
machine's processes. There could be a lot of interesting things in this.
But I would think that this requires some proper massaging of how pidfs
works. But in principle it might be possible.

Again, I'm not saying it's a great idea and we definitely shouldn't do
it now. But it's an interesting thought experiment at least.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/4] pidfs: implement file handle support
  2024-11-13 13:29         ` Jeff Layton
  2024-11-13 14:41           ` Chuck Lever III
@ 2024-11-14  6:55           ` Amir Goldstein
  1 sibling, 0 replies; 57+ messages in thread
From: Amir Goldstein @ 2024-11-14  6:55 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Erin Shepherd, Darrick J. Wong, Christian Brauner, linux-kernel,
	linux-fsdevel, christian, paul, bluca, Chuck Lever

On Wed, Nov 13, 2024 at 2:29 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Wed, 2024-11-13 at 11:17 +0100, Erin Shepherd wrote:
> > On 13/11/2024 01:40, Darrick J. Wong wrote:
> > > > Hmm, I guess I might have made that possible, though I'm certainly not
> > > > familiar enough with the internals of nfsd to be able to test if I've done
> > > > so.
> > > AFAIK check_export() in fs/nfsd/export.c spells this it out:
> > >
> > >     /* There are two requirements on a filesystem to be exportable.
> > >      * 1:  We must be able to identify the filesystem from a number.
> > >      *       either a device number (so FS_REQUIRES_DEV needed)
> > >      *       or an FSID number (so NFSEXP_FSID or ->uuid is needed).
> > >      * 2:  We must be able to find an inode from a filehandle.
> > >      *       This means that s_export_op must be set.
> > >      * 3: We must not currently be on an idmapped mount.
> > >      */
> > >
> > > Granted I've been wrong on account of stale docs before. :$
> > >
> > > Though it would be kinda funny if you *could* mess with another
> > > machine's processes over NFS.
> > >
> > > --D
> >
> > To be clear I'm not familiar enough with the workings of nfsd to tell if
> > pidfs fails those requirements and therefore wouldn't become exportable as
> > a result of this patch, though I gather from you're message that we're in the
> > clear?
> >
> > Regardless I think my question is: do we think either those requirements could
> > change in the future, or the properties of pidfs could change in the future,
> > in ways that could accidentally make the filesystem exportable?
> >
> > I guess though that the same concern would apply to cgroupfs and it hasn't posed
> > an issue so far.
>
> We have other filesystems that do this sort of thing (like cgroupfs),
> and we don't allow them to be exportable. We'll need to make sure that
> that's the case before we merge this, of course, as I forget the
> details of how that works.

TBH, I cannot find how export of cgroups with NFSEXP_FSID
is prevented.

We should probably block nfs export of SB_NOUSER and anyway,
this should be tied to the flag for relaxing CAP_DAC_READ_SEARCH,
because this is a strong indication that it's not a traditional nfs file handle.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2024-11-28 17:19 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-01 13:54 [PATCH 0/4] pidfs: implement file handle support Erin Shepherd
2024-11-01 13:54 ` [PATCH 1/4] pseudofs: add support for export_ops Erin Shepherd
2024-11-12 15:56   ` Amir Goldstein
2024-11-01 13:54 ` [PATCH 2/4] pidfs: implement file handle export support Erin Shepherd
2024-11-12 15:55   ` Amir Goldstein
2024-11-01 13:54 ` [PATCH 3/4] pid: introduce find_get_pid_ns Erin Shepherd
2024-11-12 15:59   ` Amir Goldstein
2024-11-01 13:54 ` [PATCH 4/4] pidfs: implement fh_to_dentry Erin Shepherd
2024-11-12 16:33   ` Amir Goldstein
2024-11-12 23:51   ` Jeff Layton
2024-11-13  8:01     ` Amir Goldstein
2024-11-13 10:11       ` Erin Shepherd
2024-11-13 12:21         ` Christian Brauner
2024-11-13 12:09   ` Christian Brauner
2024-11-13 13:06     ` Erin Shepherd
2024-11-13 13:26       ` Christian Brauner
2024-11-13 13:48         ` Erin Shepherd
2024-11-14 10:29           ` Christian Brauner
2024-11-14 12:21             ` Erin Shepherd
2024-11-12 13:10 ` [PATCH 0/4] pidfs: implement file handle support Christian Brauner
2024-11-12 13:57   ` Jeff Layton
2024-11-12 22:43     ` Erin Shepherd
2024-11-13  0:37       ` Darrick J. Wong
2024-11-13 11:35       ` Christian Brauner
2024-11-13 17:55       ` [PATCH v2 0/3] " Erin Shepherd
2024-11-13 17:55         ` [PATCH v2 1/3] pseudofs: add support for export_ops Erin Shepherd
2024-11-13 17:55         ` [PATCH v2 2/3] exportfs: allow fs to disable CAP_DAC_READ_SEARCH check Erin Shepherd
2024-11-13 22:50           ` kernel test robot
2024-11-14  1:29           ` kernel test robot
2024-11-14  4:37           ` Christoph Hellwig
2024-11-14 12:56             ` Erin Shepherd
2024-11-14  6:37           ` Amir Goldstein
2024-11-14 14:16             ` Christian Brauner
2024-11-13 17:55         ` [PATCH v2 3/3] pidfs: implement file handle support Erin Shepherd
2024-11-14  7:07           ` Amir Goldstein
2024-11-14 12:42             ` Erin Shepherd
2024-11-14 12:52           ` Christian Brauner
2024-11-14 13:13             ` Erin Shepherd
2024-11-14 14:13               ` Christian Brauner
2024-11-14 21:52                 ` Erin Shepherd
2024-11-15  7:50                   ` Amir Goldstein
2024-11-14  7:02         ` [PATCH v2 0/3] " Amir Goldstein
2024-11-14 12:48           ` Erin Shepherd
2024-11-14 14:27             ` Christian Brauner
2024-11-28 12:33               ` [PATCH RFC 0/2] pidfs: file handle preliminaries Christian Brauner
2024-11-28 12:33                 ` [PATCH RFC 1/2] pidfs: rework inode number allocation Christian Brauner
2024-11-28 17:19                   ` Amir Goldstein
2024-11-28 12:33                 ` [PATCH RFC 2/2] pidfs: remove 32bit inode number handling Christian Brauner
2024-11-28 17:06                 ` [PATCH RFC 0/2] pidfs: file handle preliminaries Amir Goldstein
2024-11-14 16:10             ` [PATCH v2 0/3] pidfs: implement file handle support Amir Goldstein
2024-11-12 23:03   ` [PATCH 0/4] " Erin Shepherd
2024-11-13  0:40     ` Darrick J. Wong
2024-11-13 10:17       ` Erin Shepherd
2024-11-13 13:29         ` Jeff Layton
2024-11-13 14:41           ` Chuck Lever III
2024-11-14 10:39             ` Christian Brauner
2024-11-14  6:55           ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox