[PATCH RFC v3 0/2] pidfs: make the effective {g,u}id the owner of the inode

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC v3 0/2] pidfs: make the effective {g,u}id the owner of the inode
@ 2026-02-23 13:20 Christian Brauner
  2026-02-23 13:20 ` [PATCH RFC v3 1/2] pidfs: add inode ownership and permission checks Christian Brauner
  2026-02-23 13:20 ` [PATCH RFC v3 2/2] selftests/pidfd: add inode ownership and permission tests Christian Brauner
  0 siblings, 2 replies; 5+ messages in thread
From: Christian Brauner @ 2026-02-23 13:20 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn
  Cc: Kees Cook, Andy Lutomirski, Alexander Viro, Jan Kara,
	linux-kernel, Christian Brauner

Hey,

This adds inode ownership and permission checking to pidfs.

Right now pidfs only supports trusted.* xattrs which require
CAP_SYS_ADMIN so there was never a need for real permission checking.
In order to support user.* xattrs and custom pidfs.* xattrs in the
future we need a permission model for pidfs inodes.

The effective {u,g}id of the target task becomes the owner of the pidfs
inode similar to what procfs does. Ownership is reported dynamically via
getattr since credentials may change due to setuid() and similar
operations. For kernel threads the owner is root, for exited tasks the
credentials saved at exit time via pidfs_exit() are used.

The permission callback checks access in two steps. First it verifies
the caller is either in the same thread group as the target or has
equivalent signal permissions reusing the same uid-based logic as
kill(). Then it performs standard POSIX permission checking via
generic_permission() against the inode's mode bits (S_IRWXU / 0700).

This is intentionally less strict than ptrace_may_access() because pidfs
currently does not allow operating on data that is completely private to
the process such as its mm or file descriptors. Additional checks can be
layered on once that changes.

The second patch adds selftests covering ownership reporting via fstat
and the permission model via user.* xattr operations which trigger
pidfs_permission() through xattr_permission(). The tests exercise live
credential changes, exited tasks with saved exit credentials, same-user
cross-process access, cross-user denial, and kernel thread denial.

Christian

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v3:
- Simplify pidfs_fill_owner() into pidfs_update_owner() writing directly
  to the inode via WRITE_ONCE() instead of using output parameters.
- Drop the separate pidfs_update_inode() helper and the
  security_task_to_inode() call.
- Update pidfs_getattr() to write ownership to the inode via
  pidfs_update_owner() instead of writing directly to stat.
- Update pidfs_permission() to also write ownership to the inode before
  calling generic_permission(), handling kernel threads with -EPERM.
- Drop VFS_WARN_ON_ONCE() for idmap check from pidfs_permission().
- Link to v2: https://patch.msgid.link/20260217-work-pidfs-inode-owner-v2-1-f04b5638315a@kernel.org

Changes in v2:
- Fix an obvious null-deref during PIDFD_STALE (CLONE_PIDFD).
- Link to v1: https://patch.msgid.link/20260216-work-pidfs-inode-owner-v1-1-f8faa6b73983@kernel.org

---
Christian Brauner (2):
      pidfs: add inode ownership and permission checks
      selftests/pidfd: add inode ownership and permission tests

 fs/pidfs.c                                         | 133 +++++++++-
 include/linux/cred.h                               |   2 +
 kernel/signal.c                                    |  19 +-
 tools/testing/selftests/pidfd/.gitignore           |   1 +
 tools/testing/selftests/pidfd/Makefile             |   2 +-
 .../selftests/pidfd/pidfd_inode_owner_test.c       | 289 +++++++++++++++++++++
 6 files changed, 427 insertions(+), 19 deletions(-)
---
base-commit: 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f
change-id: 20260211-work-pidfs-inode-owner-0ca20de9ef23

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH RFC v3 1/2] pidfs: add inode ownership and permission checks
  2026-02-23 13:20 [PATCH RFC v3 0/2] pidfs: make the effective {g,u}id the owner of the inode Christian Brauner
@ 2026-02-23 13:20 ` Christian Brauner
  2026-02-24 21:22   ` Jann Horn
  2026-02-23 13:20 ` [PATCH RFC v3 2/2] selftests/pidfd: add inode ownership and permission tests Christian Brauner
  1 sibling, 1 reply; 5+ messages in thread
From: Christian Brauner @ 2026-02-23 13:20 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn
  Cc: Kees Cook, Andy Lutomirski, Alexander Viro, Jan Kara,
	linux-kernel, Christian Brauner

Right now we only support trusted.* xattrs which require CAP_SYS_ADMIN
which doesn't really require any meaningful permission checking. But in
order to support user.* xattrs and custom pidfs.* xattrs in the future
we need permission checking for pidfs inodes. Add baseline permission
checking that can later be extended with additional write-time checks
for specific pidfs.* xattrs.

Make the effective {u,g}id of the task the owner of the pidfs inode
(like procfs does). The ownership is set when the dentry is first
stashed and reported dynamically via getattr since credentials may
change due to setuid() and similar operations. For kernel threads use
root, for exited tasks use the credentials saved at exit time.

The inode's ownership is updated via WRITE_ONCE() from the getattr()
and permission() callbacks. This doesn't serialize against
inode->i_op->setattr() but since pidfs rejects setattr() this isn't
currently an issue. A seqcount-based approach can be used if setattr()
support is added in the future [1].

Save the task's credentials and thread group pid inode number at exit
time so that ownership and permission checks remain functional after
the task has been reaped.

Add a permission callback that checks access in two steps:

 (1) Verify the caller is either in the same thread group as the target
     or has equivalent signal permissions. This reuses the same
     uid-based logic as kill() by extracting may_signal_creds() from
     kill_ok_by_cred() so it can operate on credential pointers
     directly. For exited tasks the check uses the saved exit
     credentials and compares thread group identity.

 (2) Perform standard POSIX permission checking via generic_permission()
     against the inode's ownership and mode bits.

This is intentionally less strict than ptrace_may_access() because pidfs
currently does not allow operating on data that is completely private to
the process such as its mm or file descriptors. Additional checks will
be needed once that changes.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.inode.seqcount [1]
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/pidfs.c           | 133 +++++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/cred.h |   2 +
 kernel/signal.c      |  19 ++++----
 3 files changed, 136 insertions(+), 18 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
index 318253344b5c..16a3cfa84af4 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -14,6 +14,7 @@
 #include <linux/proc_ns.h>
 #include <linux/pseudo_fs.h>
 #include <linux/ptrace.h>
+#include <linux/security.h>
 #include <linux/seq_file.h>
 #include <uapi/linux/pidfd.h>
 #include <linux/ipc_namespace.h>
@@ -46,15 +47,23 @@ enum pidfs_attr_mask_bits {
 	PIDFS_ATTR_BIT_COREDUMP	= 1,
 };
 
+struct pidfs_exit_attr {
+	__u64 cgroupid;
+	__s32 exit_code;
+	const struct cred *exit_cred;
+	u64 exit_tgid_ino;
+};
+
+struct pidfs_coredump_attr {
+	__u32 coredump_mask;
+	__u32 coredump_signal;
+};
+
 struct pidfs_attr {
 	unsigned long attr_mask;
 	struct simple_xattrs *xattrs;
-	struct /* exit info */ {
-		__u64 cgroupid;
-		__s32 exit_code;
-	};
-	__u32 coredump_mask;
-	__u32 coredump_signal;
+	struct pidfs_exit_attr;
+	struct pidfs_coredump_attr;
 };
 
 static struct rhashtable pidfs_ino_ht;
@@ -200,6 +209,7 @@ void pidfs_free_pid(struct pid *pid)
 	if (IS_ERR(attr))
 		return;
 
+	put_cred(attr->exit_cred);
 	xattrs = no_free_ptr(attr->xattrs);
 	if (xattrs)
 		simple_xattrs_free(xattrs, NULL);
@@ -703,12 +713,14 @@ void pidfs_exit(struct task_struct *tsk)
 	 * is put
 	 */
 
-#ifdef CONFIG_CGROUPS
 	rcu_read_lock();
+#ifdef CONFIG_CGROUPS
 	cgrp = task_dfl_cgroup(tsk);
 	attr->cgroupid = cgroup_id(cgrp);
-	rcu_read_unlock();
 #endif
+	attr->exit_cred = get_cred(__task_cred(tsk));
+	rcu_read_unlock();
+	attr->exit_tgid_ino = task_tgid(tsk)->ino;
 	attr->exit_code = tsk->exit_code;
 
 	/* Ensure that PIDFD_GET_INFO sees either all or nothing. */
@@ -741,6 +753,47 @@ void pidfs_coredump(const struct coredump_params *cprm)
 
 static struct vfsmount *pidfs_mnt __ro_after_init;
 
+static void pidfs_update_owner(struct inode *inode)
+{
+	struct pid *pid = inode->i_private;
+	struct task_struct *task;
+	const struct cred *cred;
+	kuid_t kuid;
+	kgid_t kgid;
+
+	VFS_WARN_ON_ONCE(!pid);
+
+	guard(rcu)();
+	task = pid_task(pid, PIDTYPE_PID);
+	if (!task) {
+		struct pidfs_attr *attr = READ_ONCE(pid->attr);
+
+		VFS_WARN_ON_ONCE(!attr);
+		/*
+		 * During copy_process() with CLONE_PIDFD the
+		 * task hasn't been attached to the pid yet so
+		 * pid_task() returns NULL and there's no
+		 * exit_cred as the task obviously hasn't
+		 * exited. Use the parent's credentials.
+		 */
+		cred = attr->exit_cred;
+		if (!cred)
+			cred = current_cred();
+		kuid = cred->euid;
+		kgid = cred->egid;
+	} else if (unlikely(task->flags & PF_KTHREAD)) {
+		kuid = GLOBAL_ROOT_UID;
+		kgid = GLOBAL_ROOT_GID;
+	} else {
+		cred = __task_cred(task);
+		kuid = cred->euid;
+		kgid = cred->egid;
+	}
+
+	WRITE_ONCE(inode->i_uid, kuid);
+	WRITE_ONCE(inode->i_gid, kgid);
+}
+
 /*
  * The vfs falls back to simple_setattr() if i_op->setattr() isn't
  * implemented. Let's reject it completely until we have a clean
@@ -756,7 +809,11 @@ static int pidfs_getattr(struct mnt_idmap *idmap, const struct path *path,
 			 struct kstat *stat, u32 request_mask,
 			 unsigned int query_flags)
 {
-	return anon_inode_getattr(idmap, path, stat, request_mask, query_flags);
+	struct inode *inode = d_inode(path->dentry);
+
+	pidfs_update_owner(inode);
+	anon_inode_getattr(idmap, path, stat, request_mask, query_flags);
+	return 0;
 }
 
 static ssize_t pidfs_listxattr(struct dentry *dentry, char *buf, size_t size)
@@ -773,10 +830,64 @@ static ssize_t pidfs_listxattr(struct dentry *dentry, char *buf, size_t size)
 	return simple_xattr_list(inode, xattrs, buf, size);
 }
 
+static int pidfs_permission(struct mnt_idmap *idmap, struct inode *inode,
+			    int mask)
+{
+	struct pid *pid = inode->i_private;
+	struct task_struct *task;
+	const struct cred *cred;
+	u64 pid_tg_ino;
+
+	scoped_guard(rcu) {
+		task = pid_task(pid, PIDTYPE_PID);
+		if (task) {
+			if (unlikely(task->flags & PF_KTHREAD))
+				return -EPERM;
+
+			cred = __task_cred(task);
+			pid_tg_ino = task_tgid(task)->ino;
+		} else {
+			struct pidfs_attr *attr;
+
+			attr = READ_ONCE(pid->attr);
+			VFS_WARN_ON_ONCE(!attr);
+			/*
+			 * During copy_process() with CLONE_PIDFD the
+			 * task hasn't been attached to the pid yet so
+			 * pid_task() returns NULL and there's no
+			 * exit_cred as the task obviously hasn't
+			 * exited. Use the parent's credentials.
+			 */
+			cred = attr->exit_cred;
+			if (!cred)
+				cred = current_cred();
+			pid_tg_ino = attr->exit_tgid_ino;
+		}
+
+		/*
+		 * If the caller and the target are in the same
+		 * thread-group or the caller can signal the target
+		 * we're good.
+		 */
+		if (pid_tg_ino != task_tgid(current)->ino &&
+		    !may_signal_creds(current_cred(), cred))
+			return -EPERM;
+
+		/*
+		 * This is racy but not more racy then what we generally
+		 * do for permission checking.
+		 */
+		WRITE_ONCE(inode->i_uid, cred->euid);
+		WRITE_ONCE(inode->i_gid, cred->egid);
+	}
+	return generic_permission(&nop_mnt_idmap, inode, mask);
+}
+
 static const struct inode_operations pidfs_inode_operations = {
 	.getattr	= pidfs_getattr,
 	.setattr	= pidfs_setattr,
 	.listxattr	= pidfs_listxattr,
+	.permission	= pidfs_permission,
 };
 
 static void pidfs_evict_inode(struct inode *inode)
@@ -983,7 +1094,8 @@ static struct dentry *pidfs_stash_dentry(struct dentry **stashed,
 					 struct dentry *dentry)
 {
 	int ret;
-	struct pid *pid = d_inode(dentry)->i_private;
+	struct inode *inode = d_inode(dentry);
+	struct pid *pid = inode->i_private;
 
 	VFS_WARN_ON_ONCE(stashed != &pid->stashed);
 
@@ -991,6 +1103,7 @@ static struct dentry *pidfs_stash_dentry(struct dentry **stashed,
 	if (ret)
 		return ERR_PTR(ret);
 
+	pidfs_update_owner(inode);
 	return stash_dentry(stashed, dentry);
 }
 
diff --git a/include/linux/cred.h b/include/linux/cred.h
index ed1609d78cd7..d14b29fe9fee 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -168,6 +168,8 @@ extern int set_create_files_as(struct cred *, struct inode *);
 extern int cred_fscmp(const struct cred *, const struct cred *);
 extern void __init cred_init(void);
 extern int set_cred_ucounts(struct cred *);
+bool may_signal_creds(const struct cred *signaler_cred,
+		      const struct cred *signalee_cred);
 
 static inline bool cap_ambient_invariant_ok(const struct cred *cred)
 {
diff --git a/kernel/signal.c b/kernel/signal.c
index d65d0fe24bfb..e20dabf143c2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -777,19 +777,22 @@ static inline bool si_fromuser(const struct kernel_siginfo *info)
 		(!is_si_special(info) && SI_FROMUSER(info));
 }
 
+bool may_signal_creds(const struct cred *signaler_cred,
+		      const struct cred *signalee_cred)
+{
+	return uid_eq(signaler_cred->euid, signalee_cred->suid) ||
+	       uid_eq(signaler_cred->euid, signalee_cred->uid) ||
+	       uid_eq(signaler_cred->uid, signalee_cred->suid) ||
+	       uid_eq(signaler_cred->uid, signalee_cred->uid) ||
+	       ns_capable(signalee_cred->user_ns, CAP_KILL);
+}
+
 /*
  * called with RCU read lock from check_kill_permission()
  */
 static bool kill_ok_by_cred(struct task_struct *t)
 {
-	const struct cred *cred = current_cred();
-	const struct cred *tcred = __task_cred(t);
-
-	return uid_eq(cred->euid, tcred->suid) ||
-	       uid_eq(cred->euid, tcred->uid) ||
-	       uid_eq(cred->uid, tcred->suid) ||
-	       uid_eq(cred->uid, tcred->uid) ||
-	       ns_capable(tcred->user_ns, CAP_KILL);
+	return may_signal_creds(current_cred(), __task_cred(t));
 }
 
 /*

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH RFC v3 2/2] selftests/pidfd: add inode ownership and permission tests
  2026-02-23 13:20 [PATCH RFC v3 0/2] pidfs: make the effective {g,u}id the owner of the inode Christian Brauner
  2026-02-23 13:20 ` [PATCH RFC v3 1/2] pidfs: add inode ownership and permission checks Christian Brauner
@ 2026-02-23 13:20 ` Christian Brauner
  1 sibling, 0 replies; 5+ messages in thread
From: Christian Brauner @ 2026-02-23 13:20 UTC (permalink / raw)
  To: linux-fsdevel, Jann Horn
  Cc: Kees Cook, Andy Lutomirski, Alexander Viro, Jan Kara,
	linux-kernel, Christian Brauner

Test the pidfs inode ownership reporting (via fstat) and the permission
model (via user.* xattr operations that trigger pidfs_permission()):

Ownership tests:
- owner_self: own pidfd reports caller's euid/egid
- owner_child: child pidfd reports correct ownership
- owner_child_changed_euid: ownership tracks live credential changes
- owner_exited_child: ownership persists after exit and reap
- owner_exited_child_changed_euid: exit_cred preserves changed credentials

Permission tests:
- permission_same_user: same-user xattr access succeeds (EOPNOTSUPP)
- permission_different_user_denied: cross-user access denied (EPERM)
- permission_kthread: kernel thread access always denied (EPERM)

The user.* xattr namespace is used to exercise pidfs_permission() from
userspace: xattr_permission() calls inode_permission() for user.* on
S_IFREG inodes, so fgetxattr() returns EOPNOTSUPP when permission is
granted (no handler) and EPERM when denied.

Tests requiring root skip gracefully via SKIP().

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/.gitignore           |   1 +
 tools/testing/selftests/pidfd/Makefile             |   2 +-
 .../selftests/pidfd/pidfd_inode_owner_test.c       | 289 +++++++++++++++++++++
 3 files changed, 291 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore
index 144e7ff65d6a..1981d39fe3dc 100644
--- a/tools/testing/selftests/pidfd/.gitignore
+++ b/tools/testing/selftests/pidfd/.gitignore
@@ -12,3 +12,4 @@ pidfd_info_test
 pidfd_exec_helper
 pidfd_xattr_test
 pidfd_setattr_test
+pidfd_inode_owner_test
diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index 764a8f9ecefa..904c9fd595c1 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -4,7 +4,7 @@ CFLAGS += -g $(KHDR_INCLUDES) $(TOOLS_INCLUDES) -pthread -Wall
 TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test \
 	pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test \
 	pidfd_file_handle_test pidfd_bind_mount pidfd_info_test \
-	pidfd_xattr_test pidfd_setattr_test
+	pidfd_xattr_test pidfd_setattr_test pidfd_inode_owner_test
 
 TEST_GEN_PROGS_EXTENDED := pidfd_exec_helper
 
diff --git a/tools/testing/selftests/pidfd/pidfd_inode_owner_test.c b/tools/testing/selftests/pidfd/pidfd_inode_owner_test.c
new file mode 100644
index 000000000000..58666b87638b
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_inode_owner_test.c
@@ -0,0 +1,289 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <linux/types.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+#include <sys/xattr.h>
+#include <unistd.h>
+
+#include "pidfd.h"
+#include "kselftest_harness.h"
+
+FIXTURE(pidfs_inode_owner)
+{
+	pid_t child_pid;
+	int child_pidfd;
+};
+
+FIXTURE_SETUP(pidfs_inode_owner)
+{
+	int pipe_fds[2];
+	char buf;
+
+	self->child_pid = -1;
+	self->child_pidfd = -1;
+
+	ASSERT_EQ(pipe(pipe_fds), 0);
+
+	self->child_pid = create_child(&self->child_pidfd, 0);
+	ASSERT_GE(self->child_pid, 0);
+
+	if (self->child_pid == 0) {
+		close(pipe_fds[0]);
+		write_nointr(pipe_fds[1], "c", 1);
+		close(pipe_fds[1]);
+		pause();
+		_exit(EXIT_SUCCESS);
+	}
+
+	close(pipe_fds[1]);
+	ASSERT_EQ(read_nointr(pipe_fds[0], &buf, 1), 1);
+	close(pipe_fds[0]);
+}
+
+FIXTURE_TEARDOWN(pidfs_inode_owner)
+{
+	if (self->child_pid > 0) {
+		kill(self->child_pid, SIGKILL);
+		sys_waitid(P_PID, self->child_pid, NULL, WEXITED);
+	}
+	if (self->child_pidfd >= 0)
+		close(self->child_pidfd);
+}
+
+/* Own pidfd reports correct ownership. */
+TEST_F(pidfs_inode_owner, owner_self)
+{
+	int pidfd;
+	struct stat st;
+
+	pidfd = sys_pidfd_open(getpid(), 0);
+	ASSERT_GE(pidfd, 0);
+
+	ASSERT_EQ(fstat(pidfd, &st), 0);
+	EXPECT_EQ(st.st_uid, geteuid());
+	EXPECT_EQ(st.st_gid, getegid());
+
+	close(pidfd);
+}
+
+/* Child pidfd reports correct ownership. */
+TEST_F(pidfs_inode_owner, owner_child)
+{
+	struct stat st;
+
+	ASSERT_EQ(fstat(self->child_pidfd, &st), 0);
+	EXPECT_EQ(st.st_uid, geteuid());
+	EXPECT_EQ(st.st_gid, getegid());
+}
+
+/* Ownership tracks credential changes in a live task. */
+TEST_F(pidfs_inode_owner, owner_child_changed_euid)
+{
+	pid_t pid;
+	int pidfd, pipe_fds[2];
+	struct stat st;
+	char buf;
+
+	if (getuid() != 0)
+		SKIP(return, "Test requires root");
+
+	ASSERT_EQ(pipe(pipe_fds), 0);
+
+	pid = create_child(&pidfd, 0);
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		close(pipe_fds[0]);
+		if (setresgid(65534, 65534, 65534))
+			_exit(PIDFD_ERROR);
+		if (setresuid(65534, 65534, 65534))
+			_exit(PIDFD_ERROR);
+		write_nointr(pipe_fds[1], "c", 1);
+		close(pipe_fds[1]);
+		pause();
+		_exit(EXIT_SUCCESS);
+	}
+
+	close(pipe_fds[1]);
+	ASSERT_EQ(read_nointr(pipe_fds[0], &buf, 1), 1);
+	close(pipe_fds[0]);
+
+	ASSERT_EQ(fstat(pidfd, &st), 0);
+	EXPECT_EQ(st.st_uid, (uid_t)65534);
+	EXPECT_EQ(st.st_gid, (gid_t)65534);
+
+	kill(pid, SIGKILL);
+	sys_waitid(P_PID, pid, NULL, WEXITED);
+	close(pidfd);
+}
+
+/* Ownership persists after the child exits and is reaped. */
+TEST_F(pidfs_inode_owner, owner_exited_child)
+{
+	pid_t pid;
+	int pidfd;
+	struct stat st;
+
+	pid = create_child(&pidfd, 0);
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0)
+		_exit(EXIT_SUCCESS);
+
+	ASSERT_EQ(sys_waitid(P_PID, pid, NULL, WEXITED), 0);
+
+	ASSERT_EQ(fstat(pidfd, &st), 0);
+	EXPECT_EQ(st.st_uid, geteuid());
+	EXPECT_EQ(st.st_gid, getegid());
+
+	close(pidfd);
+}
+
+/* Exit credentials preserve changed credentials. */
+TEST_F(pidfs_inode_owner, owner_exited_child_changed_euid)
+{
+	pid_t pid;
+	int pidfd;
+	struct stat st;
+
+	if (getuid() != 0)
+		SKIP(return, "Test requires root");
+
+	pid = create_child(&pidfd, 0);
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		if (setresgid(65534, 65534, 65534))
+			_exit(PIDFD_ERROR);
+		if (setresuid(65534, 65534, 65534))
+			_exit(PIDFD_ERROR);
+		_exit(EXIT_SUCCESS);
+	}
+
+	ASSERT_EQ(sys_waitid(P_PID, pid, NULL, WEXITED), 0);
+
+	ASSERT_EQ(fstat(pidfd, &st), 0);
+	EXPECT_EQ(st.st_uid, (uid_t)65534);
+	EXPECT_EQ(st.st_gid, (gid_t)65534);
+
+	close(pidfd);
+}
+
+/* Same-user cross-process permission check succeeds. */
+TEST_F(pidfs_inode_owner, permission_same_user)
+{
+	pid_t pid;
+	int pidfd;
+	pid_t parent_pid = getpid();
+
+	pid = create_child(&pidfd, 0);
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		char buf;
+
+		fd = sys_pidfd_open(parent_pid, 0);
+		if (fd < 0)
+			_exit(PIDFD_ERROR);
+
+		/*
+		 * user.* xattr access triggers pidfs_permission().
+		 * Same user passes may_signal_creds() and
+		 * generic_permission(), so we get EOPNOTSUPP
+		 * (no user.* xattr handler) instead of EPERM.
+		 */
+		if (fgetxattr(fd, "user.test", &buf, sizeof(buf)) < 0 &&
+		    errno == EOPNOTSUPP) {
+			close(fd);
+			_exit(PIDFD_PASS);
+		}
+
+		close(fd);
+		_exit(PIDFD_FAIL);
+	}
+
+	ASSERT_EQ(wait_for_pid(pid), PIDFD_PASS);
+	close(pidfd);
+}
+
+/* Cross-user access is denied without signal permission. */
+TEST_F(pidfs_inode_owner, permission_different_user_denied)
+{
+	pid_t pid;
+	int pidfd;
+
+	if (getuid() != 0)
+		SKIP(return, "Test requires root");
+
+	pid = create_child(&pidfd, 0);
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+		char buf;
+
+		/* Drop to uid/gid 65534 and lose all capabilities. */
+		if (setresgid(65534, 65534, 65534))
+			_exit(PIDFD_ERROR);
+		if (setresuid(65534, 65534, 65534))
+			_exit(PIDFD_ERROR);
+
+		/* Open pidfd for init (uid 0). */
+		fd = sys_pidfd_open(1, 0);
+		if (fd < 0)
+			_exit(PIDFD_ERROR);
+
+		/*
+		 * uid 65534 cannot signal uid 0 (no CAP_KILL),
+		 * so pidfs_permission() denies access.
+		 */
+		if (fgetxattr(fd, "user.test", &buf, sizeof(buf)) < 0 &&
+		    errno == EPERM) {
+			close(fd);
+			_exit(PIDFD_PASS);
+		}
+
+		close(fd);
+		_exit(PIDFD_FAIL);
+	}
+
+	ASSERT_EQ(wait_for_pid(pid), PIDFD_PASS);
+	close(pidfd);
+}
+
+/* Kernel thread access is always denied. */
+TEST_F(pidfs_inode_owner, permission_kthread)
+{
+	int pidfd;
+	struct stat st;
+	char buf;
+
+	/* pid 2 is kthreadd. */
+	pidfd = sys_pidfd_open(2, 0);
+	ASSERT_GE(pidfd, 0);
+
+	/* pidfs_permission() returns EPERM for kernel threads. */
+	ASSERT_LT(fgetxattr(pidfd, "user.test", &buf, sizeof(buf)), 0);
+	EXPECT_EQ(errno, EPERM);
+
+	/* fstat bypasses permission and reports root ownership. */
+	ASSERT_EQ(fstat(pidfd, &st), 0);
+	EXPECT_EQ(st.st_uid, (uid_t)0);
+	EXPECT_EQ(st.st_gid, (gid_t)0);
+
+	close(pidfd);
+}
+
+TEST_HARNESS_MAIN

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC v3 1/2] pidfs: add inode ownership and permission checks
  2026-02-23 13:20 ` [PATCH RFC v3 1/2] pidfs: add inode ownership and permission checks Christian Brauner
@ 2026-02-24 21:22   ` Jann Horn
  2026-02-25 15:32     ` Christian Brauner
  0 siblings, 1 reply; 5+ messages in thread
From: Jann Horn @ 2026-02-24 21:22 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Kees Cook, Andy Lutomirski, Alexander Viro,
	Jan Kara, linux-kernel

On Mon, Feb 23, 2026 at 2:20 PM Christian Brauner <brauner@kernel.org> wrote:
> Right now we only support trusted.* xattrs which require CAP_SYS_ADMIN
> which doesn't really require any meaningful permission checking. But in
> order to support user.* xattrs and custom pidfs.* xattrs in the future
> we need permission checking for pidfs inodes. Add baseline permission
> checking that can later be extended with additional write-time checks
> for specific pidfs.* xattrs.
>
> Make the effective {u,g}id of the task the owner of the pidfs inode
> (like procfs does). The ownership is set when the dentry is first
> stashed and reported dynamically via getattr since credentials may
> change due to setuid() and similar operations. For kernel threads use
> root, for exited tasks use the credentials saved at exit time.
>
> The inode's ownership is updated via WRITE_ONCE() from the getattr()
> and permission() callbacks. This doesn't serialize against
> inode->i_op->setattr() but since pidfs rejects setattr() this isn't
> currently an issue. A seqcount-based approach can be used if setattr()
> support is added in the future [1].
>
> Save the task's credentials and thread group pid inode number at exit
> time so that ownership and permission checks remain functional after
> the task has been reaped.
>
> Add a permission callback that checks access in two steps:
>
>  (1) Verify the caller is either in the same thread group as the target
>      or has equivalent signal permissions. This reuses the same
>      uid-based logic as kill() by extracting may_signal_creds() from
>      kill_ok_by_cred() so it can operate on credential pointers
>      directly. For exited tasks the check uses the saved exit
>      credentials and compares thread group identity.
>
>  (2) Perform standard POSIX permission checking via generic_permission()
>      against the inode's ownership and mode bits.
>
> This is intentionally less strict than ptrace_may_access() because pidfs
> currently does not allow operating on data that is completely private to
> the process such as its mm or file descriptors. Additional checks will
> be needed once that changes.
>
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.inode.seqcount [1]
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  fs/pidfs.c           | 133 +++++++++++++++++++++++++++++++++++++++++++++++----
>  include/linux/cred.h |   2 +
>  kernel/signal.c      |  19 ++++----
>  3 files changed, 136 insertions(+), 18 deletions(-)
>
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 318253344b5c..16a3cfa84af4 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
[...]
> +static int pidfs_permission(struct mnt_idmap *idmap, struct inode *inode,
> +                           int mask)
> +{
> +       struct pid *pid = inode->i_private;
> +       struct task_struct *task;
> +       const struct cred *cred;
> +       u64 pid_tg_ino;
> +
> +       scoped_guard(rcu) {
> +               task = pid_task(pid, PIDTYPE_PID);
> +               if (task) {
> +                       if (unlikely(task->flags & PF_KTHREAD))
> +                               return -EPERM;
> +
> +                       cred = __task_cred(task);
> +                       pid_tg_ino = task_tgid(task)->ino;

Can this NULL deref if the task concurrently gets reaped and has
detach_pid() called on it?

> +               } else {
> +                       struct pidfs_attr *attr;
> +
> +                       attr = READ_ONCE(pid->attr);
> +                       VFS_WARN_ON_ONCE(!attr);
> +                       /*
> +                        * During copy_process() with CLONE_PIDFD the
> +                        * task hasn't been attached to the pid yet so
> +                        * pid_task() returns NULL and there's no
> +                        * exit_cred as the task obviously hasn't
> +                        * exited. Use the parent's credentials.
> +                        */
> +                       cred = attr->exit_cred;
> +                       if (!cred)
> +                               cred = current_cred();
> +                       pid_tg_ino = attr->exit_tgid_ino;
> +               }
> +
> +               /*
> +                * If the caller and the target are in the same
> +                * thread-group or the caller can signal the target
> +                * we're good.
> +                */
> +               if (pid_tg_ino != task_tgid(current)->ino &&
> +                   !may_signal_creds(current_cred(), cred))
> +                       return -EPERM;
> +
> +               /*
> +                * This is racy but not more racy then what we generally
> +                * do for permission checking.
> +                */
> +               WRITE_ONCE(inode->i_uid, cred->euid);
> +               WRITE_ONCE(inode->i_gid, cred->egid);

I realize that using ->euid here matches what procfs does in
task_dump_owner(), but it doesn't make sense to me. The EUID is kinda
inherently "subjective" and doesn't really make sense in a context
like this where we're treating the process as an object. See also how,
when sending a signal to a process, kill_ok_by_cred() doesn't care
about the EUID of the target process, because it would be silly to
allow a user to send signals to a fileserver that happens to briefly
call seteuid() to access the filesystem in the name of the user, or
something like that.

The thing that IMO most expresses the objective identity of a process
is the RUID (notably that includes that it stays the same across
setuid execution unless the process explicitly changes RUID).

I think this should be using cred->uid and cred->gid so that the
permission check below makes more sense. That said, if you want to
instead follow the precedent of procfs and rely on the more explicit
permission checks above to actually provide security, I guess that
works too...

> +       }
> +       return generic_permission(&nop_mnt_idmap, inode, mask);
> +}
[...]
>  {
> diff --git a/kernel/signal.c b/kernel/signal.c
> index d65d0fe24bfb..e20dabf143c2 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -777,19 +777,22 @@ static inline bool si_fromuser(const struct kernel_siginfo *info)
>                 (!is_si_special(info) && SI_FROMUSER(info));
>  }
>
> +bool may_signal_creds(const struct cred *signaler_cred,
> +                     const struct cred *signalee_cred)
> +{
> +       return uid_eq(signaler_cred->euid, signalee_cred->suid) ||
> +              uid_eq(signaler_cred->euid, signalee_cred->uid) ||
> +              uid_eq(signaler_cred->uid, signalee_cred->suid) ||
> +              uid_eq(signaler_cred->uid, signalee_cred->uid) ||
> +              ns_capable(signalee_cred->user_ns, CAP_KILL);
> +}

I don't like reusing the signal sending permission checks here - in my
opinion, filesystem operations shouldn't be allowed based on the
caller's *RUID*. They should ideally be using the caller's FSUID.

> +
>  /*
>   * called with RCU read lock from check_kill_permission()
>   */
>  static bool kill_ok_by_cred(struct task_struct *t)
>  {
> -       const struct cred *cred = current_cred();
> -       const struct cred *tcred = __task_cred(t);
> -
> -       return uid_eq(cred->euid, tcred->suid) ||
> -              uid_eq(cred->euid, tcred->uid) ||
> -              uid_eq(cred->uid, tcred->suid) ||
> -              uid_eq(cred->uid, tcred->uid) ||
> -              ns_capable(tcred->user_ns, CAP_KILL);
> +       return may_signal_creds(current_cred(), __task_cred(t));
>  }
>
>  /*
>
> --
> 2.47.3
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC v3 1/2] pidfs: add inode ownership and permission checks
  2026-02-24 21:22   ` Jann Horn
@ 2026-02-25 15:32     ` Christian Brauner
  0 siblings, 0 replies; 5+ messages in thread
From: Christian Brauner @ 2026-02-25 15:32 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-fsdevel, Kees Cook, Andy Lutomirski, Alexander Viro,
	Jan Kara, linux-kernel

> > +static int pidfs_permission(struct mnt_idmap *idmap, struct inode *inode,
> > +                           int mask)
> > +{
> > +       struct pid *pid = inode->i_private;
> > +       struct task_struct *task;
> > +       const struct cred *cred;
> > +       u64 pid_tg_ino;
> > +
> > +       scoped_guard(rcu) {
> > +               task = pid_task(pid, PIDTYPE_PID);
> > +               if (task) {
> > +                       if (unlikely(task->flags & PF_KTHREAD))
> > +                               return -EPERM;
> > +
> > +                       cred = __task_cred(task);
> > +                       pid_tg_ino = task_tgid(task)->ino;
> 
> Can this NULL deref if the task concurrently gets reaped and has
> detach_pid() called on it?

So the thread-group leader task/pid is kept around until all subthreads
have exited. That's what delay_group_leader() does. So two cases:

(1) So if @task wasn't a thread-group leader but a subthread then
    task_tgid(task) cannot be NULL.

(2) If @task is itself a thread-group leader then yes, it's possible via
    __unhash_process().

So yeah, this needs handling.

> > +               } else {
> > +                       struct pidfs_attr *attr;
> > +
> > +                       attr = READ_ONCE(pid->attr);
> > +                       VFS_WARN_ON_ONCE(!attr);
> > +                       /*
> > +                        * During copy_process() with CLONE_PIDFD the
> > +                        * task hasn't been attached to the pid yet so
> > +                        * pid_task() returns NULL and there's no
> > +                        * exit_cred as the task obviously hasn't
> > +                        * exited. Use the parent's credentials.
> > +                        */
> > +                       cred = attr->exit_cred;
> > +                       if (!cred)
> > +                               cred = current_cred();
> > +                       pid_tg_ino = attr->exit_tgid_ino;
> > +               }
> > +
> > +               /*
> > +                * If the caller and the target are in the same
> > +                * thread-group or the caller can signal the target
> > +                * we're good.
> > +                */
> > +               if (pid_tg_ino != task_tgid(current)->ino &&
> > +                   !may_signal_creds(current_cred(), cred))
> > +                       return -EPERM;
> > +
> > +               /*
> > +                * This is racy but not more racy then what we generally
> > +                * do for permission checking.
> > +                */
> > +               WRITE_ONCE(inode->i_uid, cred->euid);
> > +               WRITE_ONCE(inode->i_gid, cred->egid);
> 
> I realize that using ->euid here matches what procfs does in
> task_dump_owner(), but it doesn't make sense to me. The EUID is kinda
> inherently "subjective" and doesn't really make sense in a context
> like this where we're treating the process as an object. See also how,
> when sending a signal to a process, kill_ok_by_cred() doesn't care
> about the EUID of the target process, because it would be silly to
> allow a user to send signals to a fileserver that happens to briefly
> call seteuid() to access the filesystem in the name of the user, or
> something like that.
> 
> The thing that IMO most expresses the objective identity of a process
> is the RUID (notably that includes that it stays the same across
> setuid execution unless the process explicitly changes RUID).
> 
> I think this should be using cred->uid and cred->gid so that the
> permission check below makes more sense. That said, if you want to
> instead follow the precedent of procfs and rely on the more explicit
> permission checks above to actually provide security, I guess that
> works too...
> 
> > +       }
> > +       return generic_permission(&nop_mnt_idmap, inode, mask);
> > +}
> [...]
> >  {
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index d65d0fe24bfb..e20dabf143c2 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -777,19 +777,22 @@ static inline bool si_fromuser(const struct kernel_siginfo *info)
> >                 (!is_si_special(info) && SI_FROMUSER(info));
> >  }
> >
> > +bool may_signal_creds(const struct cred *signaler_cred,
> > +                     const struct cred *signalee_cred)
> > +{
> > +       return uid_eq(signaler_cred->euid, signalee_cred->suid) ||
> > +              uid_eq(signaler_cred->euid, signalee_cred->uid) ||
> > +              uid_eq(signaler_cred->uid, signalee_cred->suid) ||
> > +              uid_eq(signaler_cred->uid, signalee_cred->uid) ||
> > +              ns_capable(signalee_cred->user_ns, CAP_KILL);
> > +}
> 
> I don't like reusing the signal sending permission checks here - in my
> opinion, filesystem operations shouldn't be allowed based on the
> caller's *RUID*. They should ideally be using the caller's FSUID.

Ok, a combined model sounds great to me based on ruid/rgid of the target
task and the fsuid/fsgid of the caller.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-25 15:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-23 13:20 [PATCH RFC v3 0/2] pidfs: make the effective {g,u}id the owner of the inode Christian Brauner
2026-02-23 13:20 ` [PATCH RFC v3 1/2] pidfs: add inode ownership and permission checks Christian Brauner
2026-02-24 21:22   ` Jann Horn
2026-02-25 15:32     ` Christian Brauner
2026-02-23 13:20 ` [PATCH RFC v3 2/2] selftests/pidfd: add inode ownership and permission tests Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox