* [RFC PATCH 0/3] seccomp: SECCOMP_IOCTL_NOTIF_PIN_ARGS for race-free unotify
@ 2026-05-04 1:12 Cong Wang
2026-05-04 1:12 ` [RFC PATCH 1/3] seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU race Cong Wang
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Cong Wang @ 2026-05-04 1:12 UTC (permalink / raw)
To: Kees Cook, linux-kernel
Cc: Andy Lutomirski, Will Drewry, Christian Brauner, Cong Wang
From: Cong Wang <cwang@multikernel.io>
This RFC introduces SECCOMP_IOCTL_NOTIF_PIN_ARGS, a new ioctl on the
seccomp user-notification listener that lets an unprivileged supervisor
atomically snapshot pointer-arg payloads from a trapped task and bind
those snapshots to the task's resumed syscall body. It closes the
documented TOCTOU race that today makes content-aware policy on
SECCOMP_USER_NOTIF_FLAG_CONTINUE unsafe for unprivileged supervisors.
Posting as RFC because the UAPI shape, the consumption-hook placement,
and the v1 vs v2 cut are all design choices that benefit from review
before a non-RFC submission.
## Motivation
seccomp_unotify(2) lets a supervisor inspect a trapped task's syscall
arguments and either deny, allow, or CONTINUE the syscall. CONTINUE
re-runs the syscall body in the trapped task, which re-fetches every
pointer argument from user memory. A sibling thread or CLONE_VM peer
in the trapped task's address space can mutate that memory between
the supervisor's process_vm_readv() and the kernel's re-read, turning
any policy that examined the argument into a check on already-stale
bytes. The seccomp_unotify(2) man page documents this race explicitly.
There is no race-free workaround for unprivileged supervisors today:
- ptrace and /proc/pid/mem are not available to them;
- process_vm_readv into a userspace buffer doesn't help, because
the kernel will re-read user memory regardless on CONTINUE;
- SECCOMP_IOCTL_NOTIF_ADDFD only solves the fd-substitution case,
not the content-of-pointer-arg case.
The result is that unprivileged seccomp supervisors -- which are the
target audience of seccomp_unotify(2) in the first place -- cannot
implement content-aware allow policies. They can only deny or
unconditionally allow. Anything that depends on the actual contents
of a path, sockaddr, argv, or write buffer is unsafe.
## Concrete user: Sandlock
Sandlock <https://github.com/multikernel/sandlock> is a process-based
unprivileged sandbox for AI agents, built on seccomp_unotify(2). At
sandbox setup the agent process installs a seccomp filter with
SECCOMP_FILTER_FLAG_NEW_LISTENER and hands the listener fd to a
Sandlock supervisor; the supervisor then drives each filtered
syscall via SECCOMP_IOCTL_NOTIF_RECV and replies with either an
injected errno or SECCOMP_USER_NOTIF_FLAG_CONTINUE. This is what
lets Sandlock confine an AI-agent process (coding agents, tool-using
agents) to the filesystem and network surface its operator
authorized, without root, a container runtime, or virtualization.
AI-agent workloads routinely fork helpers, exec compilers and
language runtimes, and follow user- or model-controlled paths, so
every content-aware policy decision the supervisor makes depends on
the contents of syscall pointer arguments -- "allow open(path) iff
path is below /allowed/", "allow bind(addr) iff addr->sun_path is in
the permitlist", "allow execve(filename, argv) iff argv[0] is one of
N approved binaries". Each of these is implemented as a CONTINUE
reply, and CONTINUE re-fetches the pointer args from the trapped
task's user memory -- which is where the documented TOCTOU race
bites.
Today Sandlock has to either run privileged (defeating the
unprivileged-sandbox model) or accept the TOCTOU race (defeating the
guarantee against an adversarial agent that spawns a sibling thread
to mutate the buffer between the supervisor's inspection and the
kernel's re-read on CONTINUE). PIN_ARGS gives it the missing
primitive without either compromise.
This is a general primitive, not a Sandlock-specific feature, but
having a concrete user with a real and growing threat model (LLM
agents executing untrusted or partially-trusted code in-process)
should help anchor the design discussion.
## Approach
PIN_ARGS is a per-notification ioctl. The supervisor describes which
register slots to snapshot and what shape each one is. The kernel
walks the trapped task's mm under the existing remote-mm primitives
(access_remote_vm, copy_remote_vm_str), copies the bytes into kernel-
owned buffers, and stamps the snapshot onto the trapped task. On
SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED, the kernel's syscall fetch
points consume from the kernel buffer instead of re-reading user
memory.
Three v1 shapes:
- SECCOMP_PIN_FIXED (sockaddr, single-buffer read/write)
- SECCOMP_PIN_CSTRING (paths)
- SECCOMP_PIN_CSTRING_ARRAY (argv, envp)
Each per-arg copy is bounded by max_bytes; total cumulative bytes per
request are bounded at a hardcoded 1 MiB. Allocations use
GFP_KERNEL_ACCOUNT so the trapped task's memcg pays the cost.
The pin is one-shot: cleared on the trapped task's next return-to-
userspace via task_work, with fallback paths for task exit, listener
release, and explicit discard (CONTINUE without CONTINUE_PINNED).
The syscall number is captured at pin time and verified at
consumption, so a signal-handler-issued syscall during -ERESTART*
resolution will not consume the pin.
Pin orchestration uses a three-phase lock dance: validate the notif
and snapshot register args under filter->notify_lock, walk the
trapped task's mm without locks, then re-validate and attach the
snapshot. The walker uses primitives the kernel already uses for
arg fetch (copy_remote_vm_str, getname_kernel, copy_string_kernel,
iov_iter_kvec), so consumption sites are minimally invasive.
## Why copy and not page-pinning
Page-level FOLL_PIN doesn't solve content TOCTOU: the trapped task
(or its CLONE_VM peer) is the owner of the mm and can write through
the same mapping. There is no kernel primitive for "freeze the
contents of these user pages." Copying at decision time is the only
way to guarantee the bytes the supervisor inspected equal the bytes
the kernel acts on.
The kernel already does this copy in syscall bodies today --
getname(), copy_strings(), move_addr_to_kernel(), copy_from_iter()
for ITER_UBUF. PIN_ARGS shifts when that copy happens (at supervisor
decision time) and re-points the syscall fetch points at the
snapshot. Net new copies per syscall: zero.
## Why unprivileged
PIN_ARGS is gated by listener-fd possession, which is itself a
capability scoped by file-descriptor ownership and SCM_RIGHTS
passing. The supervisor already has equivalent remote-mm read access
via process_vm_readv() (subject to the same ptrace_may_access
checks). NO_NEW_PRIVS, required for unprivileged seccomp filter
installation, blocks the obvious execve escalation. The DoS surface
is bounded by the 1 MiB per-request cap, the one-shot lifetime, and
at-most-one-pin-per-trapped-task, with memcg accounting on top.
Requiring CAP_SYS_PTRACE would render PIN_ARGS useless for its only
real audience; privileged supervisors already have ptrace and
/proc/pid/mem.
## What's NOT covered in v1
- Vector I/O (readv/writev) -- needs per-iovec pin descriptors,
intentional v2.
- Nested-pointer payloads (sendmsg msghdr, futex_waitv) -- same.
- Per-iter consumption hooks beyond getname_flags,
move_addr_to_kernel, copy_strings, and import_ubuf. Other syscall
fetch sites that re-read user memory still race; v1 covers the
four most common cases (path, sockaddr, argv/envp, single-buffer
read/write) which together cover the bulk of practical
unprivileged-sandbox policies.
## Patches
[PATCH 1/3] seccomp: kernel implementation
(UAPI, walker, orchestrator, four consumption hooks,
one-shot lifecycle)
[PATCH 2/3] selftests/seccomp: end-to-end coverage
(10 cases across all three shapes + lifecycle)
[PATCH 3/3] Documentation: seccomp_filter.rst
("Pinned arguments" section)
## Testing
The selftest binary covers all three v1 shapes against real syscalls
(bind, openat, execve, write), plus negative paths (CONTINUE without
PINNED, double pin, mismatched flags) and the lifecycle (post-
syscall clear, SIGKILL teardown). All ten cases pass on x86_64.
Cong Wang (3):
seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU
race
selftests/seccomp: add seccomp_pin_args end-to-end coverage
Documentation: seccomp: document SECCOMP_IOCTL_NOTIF_PIN_ARGS
.../userspace-api/seccomp_filter.rst | 76 ++
MAINTAINERS | 2 +
fs/exec.c | 63 ++
fs/namei.c | 19 +
fs/read_write.c | 8 +-
include/linux/mm.h | 2 +-
include/linux/seccomp.h | 35 +
include/linux/seccomp_types.h | 33 +
include/uapi/linux/seccomp.h | 73 ++
kernel/Makefile | 1 +
kernel/exit.c | 1 +
kernel/fork.c | 5 +
kernel/seccomp.c | 189 +++-
kernel/seccomp_pin.c | 453 +++++++++
kernel/seccomp_pin.h | 109 +++
lib/iov_iter.c | 22 +
mm/memory.c | 4 +-
mm/nommu.c | 4 +-
net/socket.c | 16 +
tools/testing/selftests/seccomp/.gitignore | 1 +
tools/testing/selftests/seccomp/Makefile | 2 +-
.../selftests/seccomp/seccomp_pin_args.c | 857 ++++++++++++++++++
22 files changed, 1961 insertions(+), 14 deletions(-)
create mode 100644 kernel/seccomp_pin.c
create mode 100644 kernel/seccomp_pin.h
create mode 100644 tools/testing/selftests/seccomp/seccomp_pin_args.c
--
2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC PATCH 1/3] seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU race
2026-05-04 1:12 [RFC PATCH 0/3] seccomp: SECCOMP_IOCTL_NOTIF_PIN_ARGS for race-free unotify Cong Wang
@ 2026-05-04 1:12 ` Cong Wang
2026-05-04 12:51 ` Christian Brauner
2026-05-04 1:12 ` [RFC PATCH 2/3] selftests/seccomp: add seccomp_pin_args end-to-end coverage Cong Wang
2026-05-04 1:12 ` [RFC PATCH 3/3] Documentation: seccomp: document SECCOMP_IOCTL_NOTIF_PIN_ARGS Cong Wang
2 siblings, 1 reply; 6+ messages in thread
From: Cong Wang @ 2026-05-04 1:12 UTC (permalink / raw)
To: Kees Cook, linux-kernel
Cc: Andy Lutomirski, Will Drewry, Christian Brauner, Cong Wang
From: Cong Wang <cwang@multikernel.io>
seccomp_unotify(2) leaves a documented TOCTOU window for unprivileged
supervisors: a sibling thread or CLONE_VM peer can mutate pointer-arg
buffers between the supervisor's process_vm_readv() and the kernel's
re-read on SECCOMP_USER_NOTIF_FLAG_CONTINUE. ptrace()/proc/pid/mem are
not available to unprivileged supervisors, so today there is no
race-free path for argument-content policy on CONTINUE.
This patch adds SECCOMP_IOCTL_NOTIF_PIN_ARGS, which atomically copies
designated pointer-arg payloads from the trapped task's address space
into kernel-owned buffers and binds those buffers to the task's next
syscall execution. On SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED, the
syscall-body fetch points consume from the kernel buffer instead of
re-reading user memory; mutations after PIN_ARGS returns have no
effect.
Three v1 shapes are supported: a fixed-size copy (sockaddr, single-
buffer write/read content), a NUL-bounded C string (paths), and a
NULL-terminated array of C strings (argv/envp). Each per-arg
descriptor caps copy size; total cumulative bytes per request are
bounded at a hardcoded 1 MiB. Pinned-buffer allocations are tagged
GFP_KERNEL_ACCOUNT so the trapped task's memcg pays the cost.
Pin orchestration uses a three-phase lock dance: validate the notif
and snapshot register args under the filter notify lock, walk the
trapped task's mm without locks, then re-validate and attach the
snapshot. The pin is one-shot: a task_work clears it on the next
return-to-userspace after the resumed syscall body completes, with
fallback paths for task exit, listener release, and explicit discard
(CONTINUE without CONTINUE_PINNED). The syscall number is captured at
pin time and verified at consumption so a signal-handler-issued
syscall during -ERESTART* resolution will not consume the pin.
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
MAINTAINERS | 2 +
fs/exec.c | 63 +++++
fs/namei.c | 19 ++
fs/read_write.c | 8 +-
include/linux/mm.h | 2 +-
include/linux/seccomp.h | 35 +++
include/linux/seccomp_types.h | 33 +++
include/uapi/linux/seccomp.h | 73 ++++++
kernel/Makefile | 1 +
kernel/exit.c | 1 +
kernel/fork.c | 5 +
kernel/seccomp.c | 189 +++++++++++++-
kernel/seccomp_pin.c | 453 ++++++++++++++++++++++++++++++++++
kernel/seccomp_pin.h | 109 ++++++++
lib/iov_iter.c | 22 ++
mm/memory.c | 4 +-
mm/nommu.c | 4 +-
net/socket.c | 16 ++
18 files changed, 1026 insertions(+), 13 deletions(-)
create mode 100644 kernel/seccomp_pin.c
create mode 100644 kernel/seccomp_pin.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 882214b0e7db..d7904e8989ca 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -24086,6 +24086,8 @@ F: Documentation/userspace-api/seccomp_filter.rst
F: include/linux/seccomp.h
F: include/uapi/linux/seccomp.h
F: kernel/seccomp.c
+F: kernel/seccomp_pin.c
+F: kernel/seccomp_pin.h
F: tools/testing/selftests/kselftest_harness.h
F: tools/testing/selftests/kselftest_harness/
F: tools/testing/selftests/seccomp/*
diff --git a/fs/exec.c b/fs/exec.c
index ba12b4c466f6..99d4a3daaeeb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -38,6 +38,7 @@
#include <linux/sched/signal.h>
#include <linux/sched/numa_balancing.h>
#include <linux/sched/task.h>
+#include <linux/seccomp.h>
#include <linux/pagemap.h>
#include <linux/perf_event.h>
#include <linux/highmem.h>
@@ -445,6 +446,63 @@ static int bprm_stack_limits(struct linux_binprm *bprm)
* processes's memory to the new process's stack. The call to get_user_pages()
* ensures the destination page is created and not swapped out.
*/
+/*
+ * If a seccomp PIN_ARGS snapshot covers this argv/envp pointer table,
+ * push each pinned string onto the bprm stack directly via
+ * copy_string_kernel(), bypassing the per-string strnlen_user() and
+ * copy_from_user() that would otherwise re-read mutated user memory.
+ *
+ * Returns 0 on success, a negative errno on failure, or +1 if no pin
+ * applied and the caller should run the normal user-memory walk.
+ */
+static int copy_strings_from_pin(struct user_arg_ptr argv,
+ struct linux_binprm *bprm)
+{
+ const struct seccomp_pinned_arg *pin;
+ const u32 *header;
+ const char *strings;
+ u32 count, i;
+ u64 user_argv;
+
+#ifdef CONFIG_COMPAT
+ user_argv = (u64)(uintptr_t)(argv.is_compat ?
+ (const void __user *)argv.ptr.compat :
+ (const void __user *)argv.ptr.native);
+#else
+ user_argv = (u64)(uintptr_t)argv.ptr.native;
+#endif
+ if (!user_argv)
+ return 1;
+
+ pin = seccomp_pin_lookup_current(user_argv);
+ if (!pin || pin->kind != SECCOMP_PIN_CSTRING_ARRAY)
+ return 1;
+
+ header = pin->data;
+ count = header[0];
+ strings = (const char *)pin->data;
+
+ /*
+ * copy_strings() processes argv backwards (highest index first)
+ * because it grows the bprm stack downward. Match that ordering
+ * so the resulting stack layout is identical.
+ */
+ for (i = count; i-- > 0; ) {
+ u32 off = header[1 + i];
+ int ret;
+
+ if (off >= pin->size)
+ return -EINVAL;
+ ret = copy_string_kernel(strings + off, bprm);
+ if (ret < 0)
+ return ret;
+ if (fatal_signal_pending(current))
+ return -ERESTARTNOHAND;
+ cond_resched();
+ }
+ return 0;
+}
+
static int copy_strings(int argc, struct user_arg_ptr argv,
struct linux_binprm *bprm)
{
@@ -453,6 +511,11 @@ static int copy_strings(int argc, struct user_arg_ptr argv,
unsigned long kpos = 0;
int ret;
+ ret = copy_strings_from_pin(argv, bprm);
+ if (ret <= 0)
+ return ret;
+ /* No pin matched; continue with the normal user-memory walk. */
+
while (argc-- > 0) {
const char __user *str;
int len;
diff --git a/fs/namei.c b/fs/namei.c
index c7fac83c9a85..ee86f4c91cae 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -30,6 +30,7 @@
#include <linux/syscalls.h>
#include <linux/mount.h>
#include <linux/audit.h>
+#include <linux/seccomp.h>
#include <linux/capability.h>
#include <linux/file.h>
#include <linux/fcntl.h>
@@ -222,6 +223,24 @@ do_getname(const char __user *filename, int flags, bool incomplete)
struct filename *
getname_flags(const char __user *filename, int flags)
{
+ const struct seccomp_pinned_arg *pin;
+
+ /*
+ * If a seccomp supervisor pinned this path via PIN_ARGS and sent
+ * CONTINUE_PINNED, build the struct filename from the kernel-side
+ * snapshot instead of re-reading user memory. The pinned buffer
+ * is NUL-terminated by copy_remote_vm_str() in the walker, so
+ * getname_kernel() can consume it directly.
+ *
+ * The empty-path-with-LOOKUP_EMPTY policy is handled here because
+ * getname_kernel() does not reject empty strings.
+ */
+ pin = seccomp_pin_lookup_current((u64)(uintptr_t)filename);
+ if (pin && pin->kind == SECCOMP_PIN_CSTRING) {
+ if (pin->size <= 1 && !(flags & LOOKUP_EMPTY))
+ return ERR_PTR(-ENOENT);
+ return getname_kernel(pin->data);
+ }
return do_getname(filename, flags, false);
}
diff --git a/fs/read_write.c b/fs/read_write.c
index 50bff7edc91f..59877e8422a8 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -488,7 +488,9 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo
init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = (ppos ? *ppos : 0);
- iov_iter_ubuf(&iter, ITER_DEST, buf, len);
+ ret = import_ubuf(ITER_DEST, buf, len, &iter);
+ if (unlikely(ret))
+ return ret;
ret = filp->f_op->read_iter(&kiocb, &iter);
BUG_ON(ret == -EIOCBQUEUED);
@@ -590,7 +592,9 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t
init_sync_kiocb(&kiocb, filp);
kiocb.ki_pos = (ppos ? *ppos : 0);
- iov_iter_ubuf(&iter, ITER_SOURCE, (void __user *)buf, len);
+ ret = import_ubuf(ITER_SOURCE, (void __user *)buf, len, &iter);
+ if (unlikely(ret))
+ return ret;
ret = filp->f_op->write_iter(&kiocb, &iter);
BUG_ON(ret == -EIOCBQUEUED);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index af23453e9dbd..b0116e8ed407 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3187,7 +3187,7 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
void *buf, int len, unsigned int gup_flags);
-#ifdef CONFIG_BPF_SYSCALL
+#if defined(CONFIG_BPF_SYSCALL) || defined(CONFIG_SECCOMP_FILTER)
extern int copy_remote_vm_str(struct task_struct *tsk, unsigned long addr,
void *buf, int len, unsigned int gup_flags);
#endif
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 9b959972bf4a..fcc369d3dfca 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -75,6 +75,35 @@ static inline int seccomp_mode(struct seccomp *s)
#ifdef CONFIG_SECCOMP_FILTER
extern void seccomp_filter_release(struct task_struct *tsk);
extern void get_seccomp_filter(struct task_struct *tsk);
+extern void seccomp_clear_pinned_args(struct task_struct *tsk);
+
+/**
+ * seccomp_pin_lookup_current - find a live PIN_ARGS snapshot for current().
+ * @user_addr: the userspace address the syscall body is about to read.
+ *
+ * Called from syscall fetch points (getname_flags, copy_strings,
+ * move_addr_to_kernel, import_ubuf). Returns a pinned-arg entry whose
+ * @data / @size the caller may consume in place of re-reading user
+ * memory, or NULL if there is no live snapshot, the current syscall
+ * does not match the one captured at pin time, or no entry matches
+ * @user_addr.
+ *
+ * Safe to call lockless: current owns its seccomp.pinned_args field
+ * once the PIN_ARGS orchestrator has installed it via WRITE_ONCE.
+ */
+const struct seccomp_pinned_arg *seccomp_pin_lookup_current(u64 user_addr);
+
+/**
+ * seccomp_pin_kvec_for - return a stable kvec for the given pin entry.
+ * @pin: a pin returned by seccomp_pin_lookup_current(); must belong
+ * to the current task.
+ *
+ * The returned pointer references kvec storage that outlives the pin
+ * (freed at syscall exit), suitable for iov_iter_kvec() callers whose
+ * iov_iter consumes after the wrapping function returns.
+ */
+struct kvec;
+const struct kvec *seccomp_pin_kvec_for(const struct seccomp_pinned_arg *pin);
#else /* CONFIG_SECCOMP_FILTER */
static inline void seccomp_filter_release(struct task_struct *tsk)
{
@@ -84,6 +113,12 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
{
return;
}
+static inline void seccomp_clear_pinned_args(struct task_struct *tsk) { }
+static inline const struct seccomp_pinned_arg *
+seccomp_pin_lookup_current(u64 user_addr) { return NULL; }
+struct kvec;
+static inline const struct kvec *
+seccomp_pin_kvec_for(const struct seccomp_pinned_arg *pin) { return NULL; }
#endif /* CONFIG_SECCOMP_FILTER */
#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE)
diff --git a/include/linux/seccomp_types.h b/include/linux/seccomp_types.h
index cf0a0355024f..bd3fe17e659a 100644
--- a/include/linux/seccomp_types.h
+++ b/include/linux/seccomp_types.h
@@ -7,6 +7,34 @@
#ifdef CONFIG_SECCOMP
struct seccomp_filter;
+struct seccomp_pinned_args;
+
+#define SECCOMP_PIN_MAX_ARGS 6
+
+/**
+ * struct seccomp_pinned_arg - one kernel-owned snapshot of a user-pointer arg.
+ * @user_addr: the original userspace address (key for lookup at consumption).
+ * @size: bytes actually populated in @data.
+ * @arg_idx: syscall register slot 0..5.
+ * @kind: one of SECCOMP_PIN_*.
+ * @data: kvmalloc'd buffer holding the snapshotted bytes.
+ *
+ * Consumption sites (getname_flags, copy_strings, move_addr_to_kernel,
+ * import_ubuf) inspect @data and @size after a successful
+ * seccomp_pin_lookup_current(). For sites that need a stable kvec
+ * pointer outliving the call (import_ubuf -> vfs_write iter),
+ * seccomp_pin_kvec_for() returns a kvec stored alongside the pin
+ * with matching lifetime.
+ */
+struct seccomp_pinned_arg {
+ u64 user_addr;
+ u32 size;
+ u8 arg_idx;
+ u8 kind;
+ u16 _pad;
+ void *data;
+};
+
/**
* struct seccomp - the state of a seccomp'ed process
*
@@ -18,11 +46,16 @@ struct seccomp_filter;
*
* @filter must only be accessed from the context of current as there
* is no read locking.
+ * @pinned_args: NULL except during a PIN_ARGS window. Owned by the trapped
+ * task itself; populated by SECCOMP_IOCTL_NOTIF_PIN_ARGS, consumed
+ * on CONTINUE_PINNED, freed at syscall exit, listener release, or
+ * task exit. See kernel/seccomp_pin.c.
*/
struct seccomp {
int mode;
atomic_t filter_count;
struct seccomp_filter *filter;
+ struct seccomp_pinned_args *pinned_args;
};
#else
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index dbfc9b37fcae..51cf081cbc5a 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -154,4 +154,77 @@ struct seccomp_notif_addfd {
#define SECCOMP_IOCTL_NOTIF_SET_FLAGS SECCOMP_IOW(4, __u64)
+/*
+ * SECCOMP_IOCTL_NOTIF_PIN_ARGS — atomically snapshot the trapped child's
+ * pointer-arg payloads into kernel buffers, populate the supervisor's
+ * byte buffer, and bind the snapshot to the child for re-execution.
+ *
+ * On NOTIF_SEND with SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED, the kernel
+ * consumes from the pinned buffers instead of re-reading user memory,
+ * closing the documented TOCTOU race in seccomp_unotify(2).
+ */
+
+/* Shape of a pointer-arg to be pinned. */
+#define SECCOMP_PIN_FIXED 0 /* exactly max_bytes from user_addr */
+#define SECCOMP_PIN_CSTRING 1 /* walk to NUL, capped at max_bytes */
+#define SECCOMP_PIN_CSTRING_ARRAY 2 /* NULL-term array of CSTRINGs */
+#define SECCOMP_PIN_KIND_MAX 2
+
+/* New NOTIF_SEND response flag (paired with CONTINUE). */
+#define SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED (1UL << 1)
+
+/* Bits for seccomp_pin_arg.truncated. */
+#define SECCOMP_PIN_TRUNCATED_BYTES (1U << 0)
+#define SECCOMP_PIN_TRUNCATED_ENTRIES (1U << 1)
+
+/**
+ * struct seccomp_pin_arg - per-arg pin descriptor (in/out).
+ * @arg_idx: syscall register slot (0..5).
+ * @kind: one of SECCOMP_PIN_*.
+ * @max_bytes: hard cap on bytes copied for this arg; kernel may copy less.
+ * @max_entries: hard cap on pointer-table entries (CSTRING_ARRAY only).
+ * @actual_size: bytes the kernel actually populated for this arg (out).
+ * @actual_entries: entries actually walked (CSTRING_ARRAY only, out).
+ * @truncated: bitmask of SECCOMP_PIN_TRUNCATED_* (out).
+ * @user_addr: the userspace address the kernel snapshotted (out, echoed).
+ * @buf_offset: offset into the supervisor's buf where this arg's bytes
+ * begin (out).
+ */
+struct seccomp_pin_arg {
+ /* in */
+ __u8 arg_idx;
+ __u8 kind;
+ __u16 _reserved;
+ __u32 max_bytes;
+ __u32 max_entries;
+ __u32 _reserved2;
+ /* out */
+ __u32 actual_size;
+ __u32 actual_entries;
+ __u32 truncated;
+ __u32 _reserved3;
+ __u64 user_addr;
+ __u64 buf_offset;
+};
+
+/**
+ * struct seccomp_notif_pin_args - PIN_ARGS ioctl payload (in/out).
+ * @id: notification id from NOTIF_RECV.
+ * @nr_args: count of valid entries in @args (1..6).
+ * @buf_size: size in bytes of @buf.
+ * @buf: user pointer to the bulk byte buffer; the kernel writes
+ * copied bytes here, indexed by args[i].buf_offset.
+ * @args: per-arg descriptors; only args[0..nr_args-1] are read/written.
+ */
+struct seccomp_notif_pin_args {
+ __u64 id;
+ __u32 nr_args;
+ __u32 buf_size;
+ __u64 buf;
+ struct seccomp_pin_arg args[6];
+};
+
+#define SECCOMP_IOCTL_NOTIF_PIN_ARGS SECCOMP_IOWR(5, \
+ struct seccomp_notif_pin_args)
+
#endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 6785982013dc..7fb35fa1b43a 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_HARDLOCKUP_DETECTOR_BUDDY) += watchdog_buddy.o
obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_perf.o
obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_pin.o
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
diff --git a/kernel/exit.c b/kernel/exit.c
index 25e9cb6de7e7..5d1c54000405 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -917,6 +917,7 @@ void __noreturn do_exit(long code)
exit_signals(tsk); /* sets PF_EXITING */
seccomp_filter_release(tsk);
+ seccomp_clear_pinned_args(tsk);
acct_update_integrals(tsk);
group_dead = atomic_dec_and_test(&tsk->signal->live);
diff --git a/kernel/fork.c b/kernel/fork.c
index 5f3fdfdb14c7..a5b7dbf21932 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1763,6 +1763,11 @@ static void copy_seccomp(struct task_struct *p)
/* Ref-count the new filter user, and assign it. */
get_seccomp_filter(current);
p->seccomp = current->seccomp;
+ /*
+ * pinned_args is a per-trapped-task transient that belongs to the
+ * outstanding notification on the parent (if any). Don't inherit it.
+ */
+ p->seccomp.pinned_args = NULL;
/*
* Explicitly enable no_new_privs here in case it got set
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 066909393c38..66b7a8e4fcab 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -44,6 +44,8 @@
#include <linux/anon_inodes.h>
#include <linux/lockdep.h>
+#include "seccomp_pin.h"
+
/*
* When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced, it had the
* wrong direction flag in the ioctl number. This is the broken one,
@@ -97,6 +99,13 @@ struct seccomp_knotif {
/* outstanding addfd requests */
struct list_head addfd;
+
+ /*
+ * A SECCOMP_IOCTL_NOTIF_PIN_ARGS for this notification is mid-walk
+ * (i.e. inside Phase B's lockless mm scan). Concurrent PIN_ARGS
+ * ioctls for the same id bail with -EBUSY rather than racing.
+ */
+ bool pin_in_progress;
};
/**
@@ -1475,6 +1484,13 @@ static void seccomp_notify_detach(struct seccomp_filter *filter)
knotif->error = -ENOSYS;
knotif->val = 0;
+ /*
+ * Drop any PIN_ARGS snapshot held on the trapped task; the
+ * supervisor that owned this notif fd is gone, so the pin
+ * can never be consumed via CONTINUE_PINNED.
+ */
+ seccomp_clear_pinned_args(knotif->task);
+
/*
* We do not need to wake up any pending addfd messages, as
* the notifier will do that for us, as this just looks
@@ -1498,7 +1514,7 @@ static int seccomp_notify_release(struct inode *inode, struct file *file)
/* must be called with notif_lock held */
static inline struct seccomp_knotif *
-find_notification(struct seccomp_filter *filter, u64 id)
+seccomp_find_notification(struct seccomp_filter *filter, u64 id)
{
struct seccomp_knotif *cur;
@@ -1607,7 +1623,7 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
* sure it's still around.
*/
mutex_lock(&filter->notify_lock);
- knotif = find_notification(filter, unotif.id);
+ knotif = seccomp_find_notification(filter, unotif.id);
if (knotif) {
/* Reset the process to make sure it's not stuck */
if (should_sleep_killable(filter, knotif))
@@ -1632,18 +1648,27 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
if (copy_from_user(&resp, buf, sizeof(resp)))
return -EFAULT;
- if (resp.flags & ~SECCOMP_USER_NOTIF_FLAG_CONTINUE)
+ if (resp.flags & ~(SECCOMP_USER_NOTIF_FLAG_CONTINUE |
+ SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED))
return -EINVAL;
if ((resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE) &&
(resp.error || resp.val))
return -EINVAL;
+ /*
+ * CONTINUE_PINNED is only valid alongside CONTINUE, and is a no-op
+ * until the consumption-side hooks land in subsequent patches.
+ */
+ if ((resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED) &&
+ !(resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE))
+ return -EINVAL;
+
ret = mutex_lock_interruptible(&filter->notify_lock);
if (ret < 0)
return ret;
- knotif = find_notification(filter, resp.id);
+ knotif = seccomp_find_notification(filter, resp.id);
if (!knotif) {
ret = -ENOENT;
goto out;
@@ -1660,6 +1685,37 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
knotif->error = resp.error;
knotif->val = resp.val;
knotif->flags = resp.flags;
+
+ /*
+ * If CONTINUE_PINNED was set, arm the snapshot so that the
+ * syscall-body fetch points consume from kernel buffers instead of
+ * re-reading user memory. If CONTINUE was set without PINNED, the
+ * supervisor explicitly opted out of the snapshot and we discard
+ * it (re-read from user memory as today).
+ */
+ if (resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE) {
+ struct seccomp_pinned_args *kpa =
+ READ_ONCE(knotif->task->seccomp.pinned_args);
+
+ if (kpa && (resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED)) {
+ WRITE_ONCE(kpa->live, true);
+ /*
+ * Schedule a one-shot clear that fires when the
+ * trapped task next returns to user mode (after the
+ * resumed syscall body completes). Failure here
+ * means the task is exiting; cleanup happens via
+ * seccomp_filter_release / do_exit instead.
+ */
+ seccomp_pin_queue_clear(knotif->task);
+ } else if (kpa) {
+ seccomp_clear_pinned_args(knotif->task);
+ }
+ } else if (resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED) {
+ /* Already rejected at the top of this function, but be defensive. */
+ ret = -EINVAL;
+ goto out;
+ }
+
if (filter->notif->flags & SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)
complete_on_current_cpu(&knotif->ready);
else
@@ -1683,7 +1739,7 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
if (ret < 0)
return ret;
- knotif = find_notification(filter, id);
+ knotif = seccomp_find_notification(filter, id);
if (knotif && knotif->state == SECCOMP_NOTIFY_SENT)
ret = 0;
else
@@ -1751,7 +1807,7 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
if (ret < 0)
goto out;
- knotif = find_notification(filter, addfd.id);
+ knotif = seccomp_find_notification(filter, addfd.id);
if (!knotif) {
ret = -ENOENT;
goto out_unlock;
@@ -1823,6 +1879,125 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
return ret;
}
+static long seccomp_notif_pin_args(struct seccomp_filter *filter,
+ struct seccomp_notif_pin_args __user *uargs)
+{
+ struct seccomp_notif_pin_args kargs;
+ struct seccomp_pinned_args *kpa = NULL;
+ struct seccomp_knotif *knotif;
+ struct task_struct *task = NULL;
+ void __user *user_buf;
+ u64 args[6];
+ int syscall_nr = 0;
+ int i;
+ long ret;
+
+ if (copy_from_user(&kargs, uargs, sizeof(kargs)))
+ return -EFAULT;
+ if (kargs.nr_args == 0 || kargs.nr_args > SECCOMP_PIN_MAX_ARGS)
+ return -EINVAL;
+ if (kargs.buf_size > SECCOMP_PIN_MAX_TOTAL_BYTES)
+ return -E2BIG;
+
+ /* Validate descriptor inputs before any allocation. */
+ for (i = 0; i < kargs.nr_args; i++) {
+ struct seccomp_pin_arg *d = &kargs.args[i];
+
+ if (d->arg_idx >= 6)
+ return -EINVAL;
+ if (d->kind > SECCOMP_PIN_KIND_MAX)
+ return -EINVAL;
+ if (d->max_bytes == 0)
+ return -EINVAL;
+ if (d->max_bytes > SECCOMP_PIN_MAX_TOTAL_BYTES)
+ return -E2BIG;
+ }
+
+ user_buf = (void __user *)(uintptr_t)kargs.buf;
+ if (kargs.buf_size && !user_buf)
+ return -EINVAL;
+
+ /*
+ * Phase A: validate notif state, snapshot the args we need under
+ * the lock, take task ref, mark pin_in_progress so a concurrent
+ * PIN_ARGS for the same id bails with -EBUSY.
+ */
+ mutex_lock(&filter->notify_lock);
+ knotif = seccomp_find_notification(filter, kargs.id);
+ if (!knotif) {
+ ret = -ENOENT;
+ goto unlock_a;
+ }
+ if (knotif->state != SECCOMP_NOTIFY_SENT) {
+ ret = -EINPROGRESS;
+ goto unlock_a;
+ }
+ if (knotif->task->seccomp.pinned_args) {
+ ret = -EEXIST;
+ goto unlock_a;
+ }
+ if (knotif->pin_in_progress) {
+ ret = -EBUSY;
+ goto unlock_a;
+ }
+ knotif->pin_in_progress = true;
+ memcpy(args, knotif->data->args, sizeof(args));
+ syscall_nr = knotif->data->nr;
+ task = get_task_struct(knotif->task);
+ mutex_unlock(&filter->notify_lock);
+
+ /* Phase B: lockless mm walk + supervisor copy. */
+ ret = seccomp_pin_args_walk(task, &kargs, args, syscall_nr,
+ user_buf, kargs.buf_size, &kpa);
+ if (ret)
+ goto cleanup;
+
+ if (copy_to_user(uargs, &kargs, sizeof(kargs))) {
+ ret = -EFAULT;
+ goto cleanup;
+ }
+
+ /*
+ * Phase C: re-validate (the notif may have been replied to or the
+ * supervisor may have released the listener) and attach the
+ * snapshot.
+ */
+ mutex_lock(&filter->notify_lock);
+ knotif = seccomp_find_notification(filter, kargs.id);
+ if (!knotif || knotif->state != SECCOMP_NOTIFY_SENT) {
+ mutex_unlock(&filter->notify_lock);
+ ret = -ENOENT;
+ goto cleanup;
+ }
+ WRITE_ONCE(task->seccomp.pinned_args, kpa);
+ knotif->pin_in_progress = false;
+ kpa = NULL; /* ownership transferred to task */
+ mutex_unlock(&filter->notify_lock);
+ put_task_struct(task);
+ return 0;
+
+cleanup:
+ /*
+ * Best-effort: clear pin_in_progress so a subsequent PIN_ARGS can
+ * proceed. The notif may already be gone, in which case there is
+ * nothing to clear.
+ */
+ mutex_lock(&filter->notify_lock);
+ knotif = seccomp_find_notification(filter, kargs.id);
+ if (knotif)
+ knotif->pin_in_progress = false;
+ mutex_unlock(&filter->notify_lock);
+
+ seccomp_free_pinned_args(kpa);
+ if (task)
+ put_task_struct(task);
+ return ret;
+
+unlock_a:
+ mutex_unlock(&filter->notify_lock);
+ return ret;
+}
+
static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
@@ -1840,6 +2015,8 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
return seccomp_notify_id_valid(filter, buf);
case SECCOMP_IOCTL_NOTIF_SET_FLAGS:
return seccomp_notify_set_flags(filter, arg);
+ case SECCOMP_IOCTL_NOTIF_PIN_ARGS:
+ return seccomp_notif_pin_args(filter, buf);
}
/* Extensible Argument ioctls */
diff --git a/kernel/seccomp_pin.c b/kernel/seccomp_pin.c
new file mode 100644
index 000000000000..a206fde3d806
--- /dev/null
+++ b/kernel/seccomp_pin.c
@@ -0,0 +1,453 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Pin-args lifecycle and walker for SECCOMP_IOCTL_NOTIF_PIN_ARGS.
+ *
+ * The supervisor calls PIN_ARGS to atomically copy designated pointer-arg
+ * payloads of a trapped child into kernel-owned buffers, then sends
+ * NOTIF_SEND with CONTINUE | CONTINUE_PINNED. The kernel re-executes the
+ * syscall using the pinned bytes instead of re-reading user memory,
+ * closing the documented seccomp_unotify(2) TOCTOU race.
+ *
+ * The lock-and-validate dance lives in kernel/seccomp.c (where
+ * struct seccomp_knotif and filter->notify_lock are defined). This file
+ * owns the per-arg walker (Phase B) and the lifecycle primitives.
+ *
+ * Only SECCOMP_PIN_FIXED is implemented in v1's first cut; CSTRING and
+ * CSTRING_ARRAY arrive in subsequent patches.
+ */
+#include <linux/sched.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/task_stack.h>
+#include <linux/seccomp.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/task_work.h>
+#include <linux/uaccess.h>
+
+#include <asm/syscall.h>
+
+#include "seccomp_pin.h"
+
+struct seccomp_pinned_args *seccomp_alloc_pinned_args(u8 nr_args)
+{
+ struct seccomp_pinned_args *kpa;
+
+ if (nr_args == 0 || nr_args > SECCOMP_PIN_MAX_ARGS)
+ return ERR_PTR(-EINVAL);
+
+ kpa = kzalloc_obj(*kpa, GFP_KERNEL_ACCOUNT);
+ if (!kpa)
+ return ERR_PTR(-ENOMEM);
+ kpa->nr_args = nr_args;
+ return kpa;
+}
+
+void seccomp_free_pinned_args(struct seccomp_pinned_args *kpa)
+{
+ int i;
+
+ if (!kpa)
+ return;
+ for (i = 0; i < kpa->nr_args; i++)
+ kvfree(kpa->args[i].data);
+ kfree(kpa);
+}
+
+void seccomp_clear_pinned_args(struct task_struct *task)
+{
+ struct seccomp_pinned_args *kpa;
+
+ /*
+ * Atomically claim ownership of the kpa: this can be called
+ * concurrently from the task's own task_work callback (returning
+ * to userspace after a CONTINUE_PINNED'd syscall), from a
+ * listener-release path on the supervisor side, and from task
+ * exit. Only the xchg winner frees.
+ */
+ kpa = xchg(&task->seccomp.pinned_args, NULL);
+ if (!kpa)
+ return;
+ /*
+ * Cancel any queued post-syscall clear; its callback_head lives
+ * inside @kpa and would otherwise dangle. If task_work_cancel
+ * returns false the callback has already started running on @task,
+ * but it does its work via current->seccomp.pinned_args (already
+ * NULL) so the in-flight callback observes nothing-to-do.
+ */
+ if (kpa->clear_queued)
+ task_work_cancel(task, &kpa->clear_work);
+ seccomp_free_pinned_args(kpa);
+}
+
+/*
+ * task_work callback: runs on the trapped task when it returns to user
+ * mode after the resumed syscall body has completed. The pin is single-
+ * shot; subsequent traps must call PIN_ARGS again.
+ */
+static void seccomp_pin_clear_cb(struct callback_head *cb)
+{
+ seccomp_clear_pinned_args(current);
+}
+
+int seccomp_pin_queue_clear(struct task_struct *task)
+{
+ struct seccomp_pinned_args *kpa = task->seccomp.pinned_args;
+ int ret;
+
+ if (!kpa || kpa->clear_queued)
+ return 0;
+ init_task_work(&kpa->clear_work, seccomp_pin_clear_cb);
+ ret = task_work_add(task, &kpa->clear_work, TWA_RESUME);
+ if (ret == 0)
+ kpa->clear_queued = true;
+ return ret;
+}
+
+/* Snapshot SECCOMP_PIN_FIXED: copy exactly @desc->max_bytes from @user_addr
+ * in the trapped child's mm into a freshly-allocated kernel buffer.
+ *
+ * On success, @out is populated and @desc->actual_size / .truncated are
+ * filled. The caller is responsible for chaining the bytes into the
+ * supervisor's bulk buffer.
+ */
+static long pin_one_fixed(struct task_struct *task, u64 user_addr,
+ struct seccomp_pin_arg *desc,
+ struct seccomp_pinned_arg *out)
+{
+ struct mm_struct *mm;
+ void *kbuf;
+ int read;
+
+ kbuf = kvmalloc(desc->max_bytes, GFP_KERNEL_ACCOUNT);
+ if (!kbuf)
+ return -ENOMEM;
+
+ mm = get_task_mm(task);
+ if (!mm) {
+ kvfree(kbuf);
+ return -ESRCH;
+ }
+
+ read = access_remote_vm(mm, user_addr, kbuf, desc->max_bytes, 0);
+ mmput(mm);
+
+ if (read <= 0) {
+ kvfree(kbuf);
+ return read ? read : -EFAULT;
+ }
+
+ out->user_addr = user_addr;
+ out->size = read;
+ out->arg_idx = desc->arg_idx;
+ out->kind = SECCOMP_PIN_FIXED;
+ out->data = kbuf;
+
+ desc->actual_size = read;
+ desc->truncated = (read < desc->max_bytes) ?
+ SECCOMP_PIN_TRUNCATED_BYTES : 0;
+ return 0;
+}
+
+/* MAX_ARG_STRINGS is fs/exec.c-private; redefine our own ceiling. */
+#define SECCOMP_PIN_DEFAULT_MAX_ENTRIES 0x7FFFFFFF
+
+/*
+ * Packed CSTRING_ARRAY layout:
+ *
+ * [u32 count][u32 offsets[count]][u8 strings[]]
+ *
+ * Each offset is from the start of the buffer; each string at
+ * data + offsets[i] is NUL-terminated.
+ */
+
+/* Snapshot SECCOMP_PIN_CSTRING: NUL-bounded copy from the trapped child's
+ * mm via the existing copy_remote_vm_str() primitive. The result is
+ * always NUL-terminated; truncation is reported when the byte cap was
+ * hit before the source NUL.
+ */
+static long pin_one_cstring(struct task_struct *task, u64 user_addr,
+ struct seccomp_pin_arg *desc,
+ struct seccomp_pinned_arg *out)
+{
+ void *kbuf;
+ int copied;
+
+ kbuf = kvmalloc(desc->max_bytes, GFP_KERNEL_ACCOUNT);
+ if (!kbuf)
+ return -ENOMEM;
+
+ copied = copy_remote_vm_str(task, user_addr, kbuf, desc->max_bytes, 0);
+ if (copied < 0) {
+ kvfree(kbuf);
+ return copied;
+ }
+
+ /*
+ * copy_remote_vm_str() returns bytes not including the trailing NUL,
+ * which it always writes on success. If we filled the buffer all the
+ * way (copied == max_bytes - 1) the source NUL may not have been
+ * reached; flag that as truncation.
+ */
+ out->user_addr = user_addr;
+ out->size = copied + 1; /* include the trailing NUL */
+ out->arg_idx = desc->arg_idx;
+ out->kind = SECCOMP_PIN_CSTRING;
+ out->data = kbuf;
+
+ desc->actual_size = copied + 1;
+ desc->truncated = (copied == desc->max_bytes - 1) ?
+ SECCOMP_PIN_TRUNCATED_BYTES : 0;
+ return 0;
+}
+
+/*
+ * Snapshot SECCOMP_PIN_CSTRING_ARRAY: walk the NULL-terminated pointer
+ * table at @user_addr in the trapped child's mm; for each non-NULL ptr,
+ * copy its NUL-bounded string into a packed kernel buffer. Format:
+ *
+ * [u32 count][u32 offsets[count]][u8 strings[]]
+ *
+ * Caps on both byte total (@desc->max_bytes) and entry count
+ * (@desc->max_entries; 0 means default cap). The pointer table is
+ * walked first to determine count, *before* any string copy, so a
+ * hostile child can't tie up the kernel walking a giant table.
+ *
+ * v1: native pointer width only. Compat (32-bit pointer table read by
+ * a native supervisor) is a TODO.
+ */
+static long pin_one_cstring_array(struct task_struct *task, u64 user_addr,
+ struct seccomp_pin_arg *desc,
+ struct seccomp_pinned_arg *out)
+{
+ struct mm_struct *mm;
+ void *kbuf = NULL;
+ u32 max_entries;
+ u32 *header;
+ u32 count = 0;
+ u32 byte_off;
+ u32 truncated = 0;
+ u32 i;
+ long ret;
+
+ max_entries = desc->max_entries ?: SECCOMP_PIN_DEFAULT_MAX_ENTRIES;
+ /* Cap entries by what fits in the supervisor's max_bytes assuming
+ * even the smallest header (count + per-entry offset + 1 NUL).
+ * Each entry costs at least 4 (offset) + 1 (NUL) = 5 bytes.
+ */
+ if (max_entries > (desc->max_bytes / 5))
+ max_entries = desc->max_bytes / 5;
+
+ if (desc->max_bytes < sizeof(u32))
+ return -EINVAL;
+
+ kbuf = kvmalloc(desc->max_bytes, GFP_KERNEL_ACCOUNT);
+ if (!kbuf)
+ return -ENOMEM;
+
+ mm = get_task_mm(task);
+ if (!mm) {
+ ret = -ESRCH;
+ goto err_free;
+ }
+
+ /* Phase 1: count entries by walking the pointer table. */
+ for (i = 0; i < max_entries; i++) {
+ unsigned long ptr;
+ int got;
+
+ got = access_remote_vm(mm, user_addr + i * sizeof(ptr),
+ &ptr, sizeof(ptr), 0);
+ if (got != sizeof(ptr)) {
+ mmput(mm);
+ ret = -EFAULT;
+ goto err_free;
+ }
+ if (ptr == 0)
+ break;
+ count++;
+ }
+ if (i == max_entries) {
+ /* Hit the entry cap before the NULL terminator: still report
+ * what we have, flag truncation.
+ */
+ truncated |= SECCOMP_PIN_TRUNCATED_ENTRIES;
+ }
+
+ /* Header layout fits in max_bytes? */
+ if ((u64)sizeof(u32) + (u64)count * sizeof(u32) > desc->max_bytes) {
+ mmput(mm);
+ ret = -EINVAL;
+ goto err_free;
+ }
+
+ header = kbuf;
+ header[0] = count;
+ byte_off = sizeof(u32) + count * sizeof(u32);
+
+ /* Phase 2: copy each string into the packed area. */
+ for (i = 0; i < count; i++) {
+ unsigned long ptr;
+ u32 remaining;
+ int got, copied;
+
+ if (access_remote_vm(mm, user_addr + i * sizeof(ptr),
+ &ptr, sizeof(ptr), 0) != sizeof(ptr)) {
+ mmput(mm);
+ ret = -EFAULT;
+ goto err_free;
+ }
+ if (byte_off >= desc->max_bytes) {
+ truncated |= SECCOMP_PIN_TRUNCATED_BYTES;
+ count = i;
+ header[0] = count;
+ break;
+ }
+ remaining = desc->max_bytes - byte_off;
+ copied = copy_remote_vm_str(task, ptr,
+ (char *)kbuf + byte_off,
+ remaining, 0);
+ if (copied < 0) {
+ mmput(mm);
+ ret = copied;
+ goto err_free;
+ }
+ header[1 + i] = byte_off;
+ got = copied + 1; /* include the NUL written by helper */
+ if (got >= remaining)
+ truncated |= SECCOMP_PIN_TRUNCATED_BYTES;
+ byte_off += got;
+ }
+ mmput(mm);
+
+ out->user_addr = user_addr;
+ out->size = byte_off;
+ out->arg_idx = desc->arg_idx;
+ out->kind = SECCOMP_PIN_CSTRING_ARRAY;
+ out->data = kbuf;
+
+ desc->actual_size = byte_off;
+ desc->actual_entries = count;
+ desc->truncated = truncated;
+ return 0;
+
+err_free:
+ kvfree(kbuf);
+ return ret;
+}
+
+const struct kvec *seccomp_pin_kvec_for(const struct seccomp_pinned_arg *pin)
+{
+ struct seccomp_pinned_args *kpa;
+ long idx;
+
+ kpa = READ_ONCE(current->seccomp.pinned_args);
+ if (!kpa)
+ return NULL;
+ idx = pin - kpa->args;
+ if (idx < 0 || idx >= kpa->nr_args)
+ return NULL;
+ return &kpa->arg_kvecs[idx];
+}
+
+const struct seccomp_pinned_arg *seccomp_pin_lookup_current(u64 user_addr)
+{
+ struct seccomp_pinned_args *kpa;
+ int i;
+
+ kpa = READ_ONCE(current->seccomp.pinned_args);
+ if (!kpa || !kpa->live)
+ return NULL;
+
+ /*
+ * If the current syscall doesn't match the one snapshotted at pin
+ * time, return NULL so the caller reads user memory. This guards
+ * against a signal handler issuing an unrelated syscall during
+ * -ERESTART* resolution — that syscall has its own user pointers
+ * and must not be served from the pin.
+ */
+ if (kpa->syscall_nr !=
+ syscall_get_nr(current, task_pt_regs(current)))
+ return NULL;
+
+ for (i = 0; i < kpa->nr_args; i++) {
+ if (kpa->args[i].user_addr == user_addr)
+ return &kpa->args[i];
+ }
+ return NULL;
+}
+
+long seccomp_pin_args_walk(struct task_struct *task,
+ struct seccomp_notif_pin_args *kargs,
+ const u64 *args, int syscall_nr,
+ void __user *user_buf, u32 user_buf_size,
+ struct seccomp_pinned_args **out)
+{
+ struct seccomp_pinned_args *kpa;
+ u32 buf_off = 0;
+ int i;
+ long ret;
+
+ kpa = seccomp_alloc_pinned_args(kargs->nr_args);
+ if (IS_ERR(kpa))
+ return PTR_ERR(kpa);
+ kpa->notif_id = kargs->id;
+ kpa->syscall_nr = syscall_nr;
+
+ for (i = 0; i < kargs->nr_args; i++) {
+ struct seccomp_pin_arg *d = &kargs->args[i];
+ u64 user_addr = args[d->arg_idx];
+
+ d->user_addr = user_addr;
+ d->actual_size = 0;
+ d->actual_entries = 0;
+ d->truncated = 0;
+ d->buf_offset = buf_off;
+
+ /* NULL pointers (e.g. execveat with AT_EMPTY_PATH): record
+ * a zero-size pin and move on without faulting.
+ */
+ if (user_addr == 0)
+ continue;
+
+ switch (d->kind) {
+ case SECCOMP_PIN_FIXED:
+ ret = pin_one_fixed(task, user_addr, d, &kpa->args[i]);
+ break;
+ case SECCOMP_PIN_CSTRING:
+ ret = pin_one_cstring(task, user_addr, d, &kpa->args[i]);
+ break;
+ case SECCOMP_PIN_CSTRING_ARRAY:
+ ret = pin_one_cstring_array(task, user_addr, d,
+ &kpa->args[i]);
+ break;
+ default:
+ ret = -EOPNOTSUPP;
+ break;
+ }
+ if (ret < 0)
+ goto err_free;
+
+ /* Stable kvec for iov_iter_kvec consumers (import_ubuf). */
+ kpa->arg_kvecs[i].iov_base = kpa->args[i].data;
+ kpa->arg_kvecs[i].iov_len = kpa->args[i].size;
+
+ if (kpa->args[i].size > user_buf_size - buf_off) {
+ ret = -ENOSPC;
+ goto err_free;
+ }
+ if (copy_to_user(user_buf + buf_off,
+ kpa->args[i].data, kpa->args[i].size)) {
+ ret = -EFAULT;
+ goto err_free;
+ }
+ d->buf_offset = buf_off;
+ buf_off += kpa->args[i].size;
+ }
+
+ *out = kpa;
+ return 0;
+
+err_free:
+ seccomp_free_pinned_args(kpa);
+ return ret;
+}
diff --git a/kernel/seccomp_pin.h b/kernel/seccomp_pin.h
new file mode 100644
index 000000000000..ea699bc09645
--- /dev/null
+++ b/kernel/seccomp_pin.h
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Internal interfaces for SECCOMP_IOCTL_NOTIF_PIN_ARGS.
+ *
+ * The pin lifecycle and walker live in kernel/seccomp_pin.c to keep
+ * kernel/seccomp.c focused on the existing notify machinery.
+ */
+#ifndef _KERNEL_SECCOMP_PIN_H
+#define _KERNEL_SECCOMP_PIN_H
+
+#include <linux/types.h>
+#include <uapi/linux/seccomp.h>
+
+#include <linux/seccomp_types.h> /* struct seccomp_pinned_arg, SECCOMP_PIN_MAX_ARGS */
+#include <linux/uio.h> /* struct kvec */
+#include <linux/task_work.h> /* struct callback_head */
+
+struct task_struct;
+struct seccomp_filter;
+struct seccomp_knotif;
+struct seccomp_notif_pin_args;
+
+/*
+ * Maximum cumulative bytes a single PIN_ARGS request may snapshot on
+ * behalf of one notification. Defensive bound only — typical pins are
+ * a few KiB (one PATH_MAX path; argv up to MAX_ARG_STRLEN). Hardcoded
+ * rather than a sysctl: there is no legitimate use case for runtime
+ * tuning. Smaller is always reachable via desc->max_bytes; larger
+ * indicates a policy bug.
+ */
+#define SECCOMP_PIN_MAX_TOTAL_BYTES (1UL << 20) /* 1 MiB */
+
+/**
+ * struct seccomp_pinned_args - the per-task pin record.
+ * @notif_id: id of the outstanding notification this pin belongs to.
+ * @syscall_nr: syscall number captured at pin time; consumption checks this
+ * against current to skip pinned data on a mismatched syscall
+ * (e.g. one issued from a signal handler during restart).
+ * @nr_args: number of populated entries in @args.
+ * @live: false during the pin-decision window, set to true on
+ * CONTINUE_PINNED so consumption hooks know to use the snapshot.
+ * @args: per-slot pinned data; only the first @nr_args entries are valid.
+ */
+struct seccomp_pinned_args {
+ u64 notif_id;
+ int syscall_nr;
+ u8 nr_args;
+ bool live;
+ bool clear_queued; /* clear_work has been task_work_add()'d */
+ struct callback_head clear_work;
+ struct seccomp_pinned_arg args[SECCOMP_PIN_MAX_ARGS];
+ /*
+ * Per-arg stable kvec storage. Populated by the walker for kinds
+ * whose consumption hooks build an iov_iter (currently FIXED ->
+ * import_ubuf). The kvec must outlive the iter; this struct lives
+ * until syscall exit, which is after the iter is fully consumed.
+ */
+ struct kvec arg_kvecs[SECCOMP_PIN_MAX_ARGS];
+};
+
+#ifdef CONFIG_SECCOMP_FILTER
+
+struct seccomp_pinned_args *seccomp_alloc_pinned_args(u8 nr_args);
+void seccomp_free_pinned_args(struct seccomp_pinned_args *kpa);
+void seccomp_clear_pinned_args(struct task_struct *task);
+
+/*
+ * Queue a one-shot task_work that will clear @task's pinned_args when
+ * @task next returns to userspace, i.e. after the trapped-and-resumed
+ * syscall body has completed. Called from NOTIF_SEND on CONTINUE_PINNED.
+ */
+int seccomp_pin_queue_clear(struct task_struct *task);
+
+/**
+ * seccomp_pin_args_walk - per-arg snapshot phase (no seccomp locks).
+ * @task: the trapped child whose mm we're reading; caller must hold a
+ * reference (via get_task_struct).
+ * @kargs: in/out ioctl payload; the walker reads .nr_args / .args[i] inputs
+ * and writes back .args[i] outputs (actual_size, truncated, etc.).
+ * @args: syscall register args (knotif->data->args).
+ * @syscall_nr: syscall number captured at notif time.
+ * @user_buf: the supervisor's bulk byte buffer (user pointer).
+ * @user_buf_size: capacity of @user_buf.
+ * @out: on success, *@out is a freshly-allocated kpa with the snapshot;
+ * caller takes ownership and must seccomp_free_pinned_args() if
+ * the attach step fails.
+ *
+ * Return: 0 on success, negative errno on failure.
+ *
+ * Phase B of PIN_ARGS: this runs without seccomp locks held. Phase A (notif
+ * validation) and Phase C (attach) live in kernel/seccomp.c.
+ */
+long seccomp_pin_args_walk(struct task_struct *task,
+ struct seccomp_notif_pin_args *kargs,
+ const u64 *args, int syscall_nr,
+ void __user *user_buf, u32 user_buf_size,
+ struct seccomp_pinned_args **out);
+
+/* seccomp_pin_lookup_current() lives in include/linux/seccomp.h; it is
+ * called from consumption sites outside kernel/seccomp/ (fs/, net/, lib/).
+ */
+
+#else
+
+static inline void seccomp_clear_pinned_args(struct task_struct *task) { }
+
+#endif /* CONFIG_SECCOMP_FILTER */
+
+#endif /* _KERNEL_SECCOMP_PIN_H */
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 243662af1af7..e0b038b54ce9 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -9,6 +9,7 @@
#include <linux/vmalloc.h>
#include <linux/splice.h>
#include <linux/compat.h>
+#include <linux/seccomp.h>
#include <linux/scatterlist.h>
#include <linux/instrumented.h>
#include <linux/iov_iter.h>
@@ -1444,8 +1445,29 @@ EXPORT_SYMBOL(import_iovec);
int import_ubuf(int rw, void __user *buf, size_t len, struct iov_iter *i)
{
+ const struct seccomp_pinned_arg *pin;
+ const struct kvec *kvec;
+
if (len > MAX_RW_COUNT)
len = MAX_RW_COUNT;
+
+ /*
+ * Pinned by a seccomp PIN_ARGS supervisor on this task? Build the
+ * iov_iter over the kernel snapshot rather than re-reading user
+ * memory. The kvec storage is owned by current->seccomp.pinned_args
+ * and lives until syscall exit, so it outlasts @i's consumption.
+ */
+ pin = seccomp_pin_lookup_current((u64)(uintptr_t)buf);
+ if (pin && pin->kind == SECCOMP_PIN_FIXED) {
+ kvec = seccomp_pin_kvec_for(pin);
+ if (kvec) {
+ size_t n = min_t(size_t, len, pin->size);
+
+ iov_iter_kvec(i, rw, kvec, 1, n);
+ return 0;
+ }
+ }
+
if (unlikely(!access_ok(buf, len)))
return -EFAULT;
diff --git a/mm/memory.c b/mm/memory.c
index ea6568571131..766ea403d983 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7168,7 +7168,7 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr,
}
EXPORT_SYMBOL_GPL(access_process_vm);
-#ifdef CONFIG_BPF_SYSCALL
+#if defined(CONFIG_BPF_SYSCALL) || defined(CONFIG_SECCOMP_FILTER)
/*
* Copy a string from another process's address space as given in mm.
* If there is any error return -EFAULT.
@@ -7286,7 +7286,7 @@ int copy_remote_vm_str(struct task_struct *tsk, unsigned long addr,
return ret;
}
EXPORT_SYMBOL_GPL(copy_remote_vm_str);
-#endif /* CONFIG_BPF_SYSCALL */
+#endif /* CONFIG_BPF_SYSCALL || CONFIG_SECCOMP_FILTER */
/*
* Print the name of a VMA.
diff --git a/mm/nommu.c b/mm/nommu.c
index ed3934bc2de4..4c14ed97d661 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1711,7 +1711,7 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, in
}
EXPORT_SYMBOL_GPL(access_process_vm);
-#ifdef CONFIG_BPF_SYSCALL
+#if defined(CONFIG_BPF_SYSCALL) || defined(CONFIG_SECCOMP_FILTER)
/*
* Copy a string from another process's address space as given in mm.
* If there is any error return -EFAULT.
@@ -1788,7 +1788,7 @@ int copy_remote_vm_str(struct task_struct *tsk, unsigned long addr,
return ret;
}
EXPORT_SYMBOL_GPL(copy_remote_vm_str);
-#endif /* CONFIG_BPF_SYSCALL */
+#endif /* CONFIG_BPF_SYSCALL || CONFIG_SECCOMP_FILTER */
/**
* nommu_shrink_inode_mappings - Shrink the shared mappings on an inode
diff --git a/net/socket.c b/net/socket.c
index 22a412fdec07..6e3af6114a60 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -82,6 +82,7 @@
#include <linux/compat.h>
#include <linux/kmod.h>
#include <linux/audit.h>
+#include <linux/seccomp.h>
#include <linux/wireless.h>
#include <linux/nsproxy.h>
#include <linux/magic.h>
@@ -248,10 +249,25 @@ static const struct net_proto_family __rcu *net_families[NPROTO] __read_mostly;
int move_addr_to_kernel(void __user *uaddr, int ulen, struct sockaddr_storage *kaddr)
{
+ const struct seccomp_pinned_arg *pin;
+
if (ulen < 0 || ulen > sizeof(struct sockaddr_storage))
return -EINVAL;
if (ulen == 0)
return 0;
+
+ /* If a seccomp supervisor pinned this sockaddr via PIN_ARGS and
+ * sent CONTINUE_PINNED, consume from the kernel snapshot instead
+ * of re-reading user memory. Closes the unotify TOCTOU.
+ */
+ pin = seccomp_pin_lookup_current((u64)(uintptr_t)uaddr);
+ if (pin) {
+ size_t n = min_t(size_t, (size_t)ulen, pin->size);
+
+ memcpy(kaddr, pin->data, n);
+ return audit_sockaddr(ulen, kaddr);
+ }
+
if (copy_from_user(kaddr, uaddr, ulen))
return -EFAULT;
return audit_sockaddr(ulen, kaddr);
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC PATCH 2/3] selftests/seccomp: add seccomp_pin_args end-to-end coverage
2026-05-04 1:12 [RFC PATCH 0/3] seccomp: SECCOMP_IOCTL_NOTIF_PIN_ARGS for race-free unotify Cong Wang
2026-05-04 1:12 ` [RFC PATCH 1/3] seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU race Cong Wang
@ 2026-05-04 1:12 ` Cong Wang
2026-05-04 1:12 ` [RFC PATCH 3/3] Documentation: seccomp: document SECCOMP_IOCTL_NOTIF_PIN_ARGS Cong Wang
2 siblings, 0 replies; 6+ messages in thread
From: Cong Wang @ 2026-05-04 1:12 UTC (permalink / raw)
To: Kees Cook, linux-kernel
Cc: Andy Lutomirski, Will Drewry, Christian Brauner, Cong Wang
From: Cong Wang <cwang@multikernel.io>
Add a standalone selftest binary for SECCOMP_IOCTL_NOTIF_PIN_ARGS
exercising all three v1 shapes (fixed/cstring/cstring-array) on
real syscalls (bind, openat, execve, write), plus negative paths
(CONTINUE without PINNED, double pin, mismatched flags) and the
single-shot lifecycle (post-syscall clear, SIGKILL teardown).
The tests use MAP_SHARED to mirror the documented CLONE_VM peer
attack: the supervisor pins the trapped child's pointer arg, the
parent mutates the underlying bytes, and the test verifies the
kernel acted on the pinned snapshot rather than the mutation.
Lives in its own file rather than seccomp_bpf.c since the feature
is unrelated to the BPF filter machinery.
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
tools/testing/selftests/seccomp/.gitignore | 1 +
tools/testing/selftests/seccomp/Makefile | 2 +-
.../selftests/seccomp/seccomp_pin_args.c | 857 ++++++++++++++++++
3 files changed, 859 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/seccomp/seccomp_pin_args.c
diff --git a/tools/testing/selftests/seccomp/.gitignore b/tools/testing/selftests/seccomp/.gitignore
index dec678577f9c..0e39a7297b0a 100644
--- a/tools/testing/selftests/seccomp/.gitignore
+++ b/tools/testing/selftests/seccomp/.gitignore
@@ -1,3 +1,4 @@
# SPDX-License-Identifier: GPL-2.0-only
seccomp_bpf
seccomp_benchmark
+seccomp_pin_args
diff --git a/tools/testing/selftests/seccomp/Makefile b/tools/testing/selftests/seccomp/Makefile
index 584fba487037..26abbb3126a5 100644
--- a/tools/testing/selftests/seccomp/Makefile
+++ b/tools/testing/selftests/seccomp/Makefile
@@ -3,5 +3,5 @@ CFLAGS += -Wl,-no-as-needed -Wall $(KHDR_INCLUDES)
LDFLAGS += -lpthread
LDLIBS += -lcap
-TEST_GEN_PROGS := seccomp_bpf seccomp_benchmark
+TEST_GEN_PROGS := seccomp_bpf seccomp_benchmark seccomp_pin_args
include ../lib.mk
diff --git a/tools/testing/selftests/seccomp/seccomp_pin_args.c b/tools/testing/selftests/seccomp/seccomp_pin_args.c
new file mode 100644
index 000000000000..df21bd0781d3
--- /dev/null
+++ b/tools/testing/selftests/seccomp/seccomp_pin_args.c
@@ -0,0 +1,857 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Selftests for SECCOMP_IOCTL_NOTIF_PIN_ARGS — atomic snapshot of
+ * pointer-arg payloads for seccomp_unotify(2) supervisors.
+ *
+ * The motivating attack (see Documentation/userspace-api/seccomp_filter.rst):
+ * an unprivileged supervisor inspects bytes that a sibling thread (or
+ * CLONE_VM peer) mutates between supervisor read and kernel re-read,
+ * defeating any decision the supervisor made on the bytes it saw.
+ *
+ * Each test sets up a USER_NOTIF filter, traps a syscall, calls
+ * PIN_ARGS to atomically copy designated pointer-arg payloads into
+ * kernel buffers, mutates the underlying user memory (simulating a
+ * racy peer), sends NOTIF_SEND with CONTINUE | CONTINUE_PINNED, and
+ * verifies the kernel used the snapshotted bytes rather than the
+ * mutated ones.
+ */
+#include <errno.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <limits.h>
+
+#include "../kselftest_harness.h"
+
+#ifndef SECCOMP_IOCTL_NOTIF_PIN_ARGS
+# error "kernel UAPI lacks SECCOMP_IOCTL_NOTIF_PIN_ARGS"
+#endif
+
+#ifndef ARRAY_SIZE
+# define ARRAY_SIZE(a) (sizeof(a) / sizeof((a)[0]))
+#endif
+
+/* Install a USER_NOTIF filter that traps the given syscall number and
+ * allows everything else; returns the listener fd.
+ */
+static int install_user_notif_filter(int nr)
+{
+ struct sock_filter filter[] = {
+ BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
+ offsetof(struct seccomp_data, nr)),
+ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, nr, 0, 1),
+ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
+ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)ARRAY_SIZE(filter),
+ .filter = filter,
+ };
+
+ return syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
+}
+
+/*
+ * Helpers shared by the bind-on-shared-sockaddr tests below.
+ * MAP_SHARED gives parent and child the same physical bytes, mirroring
+ * the CLONE_VM peer in the documented attack scenario.
+ */
+struct bind_race {
+ int listener;
+ pid_t child;
+ struct sockaddr_un *shared; /* mmap'd MAP_SHARED, sockaddr_un */
+ char path_a[64]; /* original path (set before fork) */
+ char path_b[64]; /* path the parent mutates to before SEND */
+};
+
+/* Set up filter, mmap, fill path_a; fork the child to bind() against
+ * @shared. On return, the child is trapped in the seccomp wait and the
+ * supervisor (caller) is ready to NOTIF_RECV. Returns 0 on success or
+ * -1 on a setup failure (with errno preserved).
+ */
+static int bind_race_setup(struct bind_race *r)
+{
+ r->listener = -1;
+ r->child = -1;
+ r->shared = MAP_FAILED;
+
+ snprintf(r->path_a, sizeof(r->path_a),
+ "/tmp/seccomp-pin-%d-A", getpid());
+ snprintf(r->path_b, sizeof(r->path_b),
+ "/tmp/seccomp-pin-%d-B", getpid());
+ unlink(r->path_a);
+ unlink(r->path_b);
+
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != 0)
+ return -1;
+
+ r->shared = mmap(NULL, sizeof(*r->shared), PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+ if (r->shared == MAP_FAILED)
+ return -1;
+ memset(r->shared, 0, sizeof(*r->shared));
+ r->shared->sun_family = AF_UNIX;
+ strcpy(r->shared->sun_path, r->path_a);
+
+ r->listener = install_user_notif_filter(__NR_bind);
+ if (r->listener < 0)
+ return -1;
+
+ r->child = fork();
+ if (r->child < 0)
+ return -1;
+ if (r->child == 0) {
+ int fd = socket(AF_UNIX, SOCK_DGRAM, 0);
+
+ if (fd < 0)
+ _exit(10);
+ if (bind(fd, (struct sockaddr *)r->shared,
+ sizeof(*r->shared)) < 0)
+ _exit(11);
+ _exit(0);
+ }
+ return 0;
+}
+
+static void bind_race_teardown(struct bind_race *r)
+{
+ if (r->child > 0)
+ waitpid(r->child, NULL, WNOHANG);
+ if (r->listener >= 0)
+ close(r->listener);
+ if (r->shared != MAP_FAILED)
+ munmap(r->shared, sizeof(*r->shared));
+ unlink(r->path_a);
+ unlink(r->path_b);
+}
+
+/* Pin arg 1 (the sockaddr*) of the outstanding bind() notif. On success,
+ * @readback (>= sizeof(sockaddr_un)) holds the snapshotted bytes.
+ */
+static int do_pin_sockaddr(int listener, __u64 id,
+ void *readback, size_t readback_size)
+{
+ struct seccomp_notif_pin_args pinreq;
+
+ memset(&pinreq, 0, sizeof(pinreq));
+ pinreq.id = id;
+ pinreq.nr_args = 1;
+ pinreq.buf_size = readback_size;
+ pinreq.buf = (uintptr_t)readback;
+ pinreq.args[0].arg_idx = 1;
+ pinreq.args[0].kind = SECCOMP_PIN_FIXED;
+ pinreq.args[0].max_bytes = sizeof(struct sockaddr_un);
+
+ return ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_ARGS, &pinreq);
+}
+
+/*
+ * Pin a sockaddr the trapped child is about to bind(), mutate the
+ * underlying shared memory, send CONTINUE | CONTINUE_PINNED, and verify
+ * that the kernel binds against the *pinned* path rather than the
+ * mutated one.
+ */
+TEST(pin_args_sockaddr_bind)
+{
+ struct bind_race r;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ char readback[sizeof(struct sockaddr_un)];
+ struct sockaddr_un *seen;
+ struct stat st;
+ int status;
+
+ ASSERT_EQ(0, bind_race_setup(&r));
+
+ EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+ EXPECT_EQ(req.data.nr, __NR_bind);
+
+ memset(readback, 0, sizeof(readback));
+ ASSERT_EQ(0, do_pin_sockaddr(r.listener, req.id,
+ readback, sizeof(readback)));
+
+ seen = (struct sockaddr_un *)readback;
+ EXPECT_EQ(seen->sun_family, (sa_family_t)AF_UNIX);
+ EXPECT_STREQ(seen->sun_path, r.path_a);
+
+ /* Race: mutate shared memory before SEND. */
+ strcpy(r.shared->sun_path, r.path_b);
+
+ resp.id = req.id;
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE |
+ SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED;
+ EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ EXPECT_EQ(waitpid(r.child, &status, 0), r.child);
+ r.child = -1;
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ /* Pinned path won. */
+ EXPECT_EQ(stat(r.path_a, &st), 0);
+ EXPECT_EQ(stat(r.path_b, &st), -1);
+ EXPECT_EQ(errno, ENOENT);
+
+ bind_race_teardown(&r);
+}
+
+/*
+ * Negative pair of the above: pin then send CONTINUE *without* PINNED.
+ * The pin must be discarded and the kernel re-read user memory, so the
+ * bind should land at the mutated path (path_b) — the existing
+ * SECCOMP_USER_NOTIF_FLAG_CONTINUE behavior is preserved.
+ */
+TEST(pin_args_continue_without_pinned)
+{
+ struct bind_race r;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ char readback[sizeof(struct sockaddr_un)];
+ struct stat st;
+ int status;
+
+ ASSERT_EQ(0, bind_race_setup(&r));
+
+ EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+ EXPECT_EQ(req.data.nr, __NR_bind);
+
+ memset(readback, 0, sizeof(readback));
+ ASSERT_EQ(0, do_pin_sockaddr(r.listener, req.id,
+ readback, sizeof(readback)));
+
+ strcpy(r.shared->sun_path, r.path_b);
+
+ resp.id = req.id;
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE; /* no PINNED */
+ EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ EXPECT_EQ(waitpid(r.child, &status, 0), r.child);
+ r.child = -1;
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ /* Pin discarded; mutated path won. */
+ EXPECT_EQ(stat(r.path_a, &st), -1);
+ EXPECT_EQ(errno, ENOENT);
+ EXPECT_EQ(stat(r.path_b, &st), 0);
+
+ bind_race_teardown(&r);
+}
+
+/*
+ * CONTINUE_PINNED without CONTINUE must be rejected with -EINVAL by
+ * NOTIF_SEND (the flag is meaningless in isolation). After the rejection
+ * the supervisor can still send a normal CONTINUE to let the child run.
+ */
+TEST(pin_args_continue_pinned_alone)
+{
+ struct bind_race r;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ char readback[sizeof(struct sockaddr_un)];
+ int status;
+
+ ASSERT_EQ(0, bind_race_setup(&r));
+
+ EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+
+ memset(readback, 0, sizeof(readback));
+ ASSERT_EQ(0, do_pin_sockaddr(r.listener, req.id,
+ readback, sizeof(readback)));
+
+ resp.id = req.id;
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED; /* alone — invalid */
+ EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), -1);
+ EXPECT_EQ(errno, EINVAL);
+
+ /* Recover by sending a regular CONTINUE so the child can finish. */
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+ EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ EXPECT_EQ(waitpid(r.child, &status, 0), r.child);
+ r.child = -1;
+ EXPECT_EQ(true, WIFEXITED(status));
+
+ bind_race_teardown(&r);
+}
+
+/*
+ * Two PIN_ARGS calls for the same notif id: the second must be rejected
+ * with -EEXIST. The original snapshot stays in effect.
+ */
+TEST(pin_args_double_pin)
+{
+ struct bind_race r;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ char readback[sizeof(struct sockaddr_un)];
+ char readback2[sizeof(struct sockaddr_un)];
+ int status;
+
+ ASSERT_EQ(0, bind_race_setup(&r));
+
+ EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+
+ memset(readback, 0, sizeof(readback));
+ ASSERT_EQ(0, do_pin_sockaddr(r.listener, req.id,
+ readback, sizeof(readback)));
+
+ memset(readback2, 0, sizeof(readback2));
+ EXPECT_EQ(do_pin_sockaddr(r.listener, req.id,
+ readback2, sizeof(readback2)), -1);
+ EXPECT_EQ(errno, EEXIST);
+
+ resp.id = req.id;
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE |
+ SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED;
+ EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ EXPECT_EQ(waitpid(r.child, &status, 0), r.child);
+ r.child = -1;
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ bind_race_teardown(&r);
+}
+
+/*
+ * SECCOMP_PIN_CSTRING: pin the path passed to openat(), mutate the
+ * shared user-memory copy of the path between PIN_ARGS and SEND, and
+ * verify that the kernel opens the *pinned* path rather than the
+ * mutated one.
+ *
+ * Matches the motivating attack against path-based filters: supervisor
+ * blesses /tmp/pin-A; sibling rewrites the path to /tmp/pin-B; the
+ * kernel must still open /tmp/pin-A.
+ */
+TEST(pin_args_openat_cstring)
+{
+ char *shared_path;
+ char path_a[64], path_b[64];
+ struct seccomp_notif_pin_args pinreq;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ char readback[PATH_MAX];
+ int listener, status, fd_a, fd_b;
+ pid_t pid;
+ long ret;
+
+ snprintf(path_a, sizeof(path_a), "/tmp/seccomp-pin-cstr-%d-A", getpid());
+ snprintf(path_b, sizeof(path_b), "/tmp/seccomp-pin-cstr-%d-B", getpid());
+
+ /* Pre-create both targets so openat() succeeds either way; we
+ * verify *which* file got opened, not whether open succeeded.
+ */
+ fd_a = open(path_a, O_CREAT | O_TRUNC | O_WRONLY, 0600);
+ ASSERT_GE(fd_a, 0);
+ write(fd_a, "A", 1);
+ close(fd_a);
+
+ fd_b = open(path_b, O_CREAT | O_TRUNC | O_WRONLY, 0600);
+ ASSERT_GE(fd_b, 0);
+ write(fd_b, "B", 1);
+ close(fd_b);
+
+ ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
+
+ shared_path = mmap(NULL, PATH_MAX, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+ ASSERT_NE(MAP_FAILED, shared_path);
+ memset(shared_path, 0, PATH_MAX);
+ strcpy(shared_path, path_a);
+
+ listener = install_user_notif_filter(__NR_openat);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ char buf[2] = {0};
+ int fd = openat(AT_FDCWD, shared_path, O_RDONLY);
+
+ if (fd < 0)
+ _exit(10);
+ if (read(fd, buf, 1) != 1)
+ _exit(11);
+ close(fd);
+ /* Encode which file we read in the exit code. */
+ _exit(buf[0] == 'A' ? 0 : (buf[0] == 'B' ? 1 : 12));
+ }
+
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+ EXPECT_EQ(req.data.nr, __NR_openat);
+
+ memset(&pinreq, 0, sizeof(pinreq));
+ memset(readback, 0, sizeof(readback));
+ pinreq.id = req.id;
+ pinreq.nr_args = 1;
+ pinreq.buf_size = sizeof(readback);
+ pinreq.buf = (uintptr_t)readback;
+ pinreq.args[0].arg_idx = 1; /* openat: pathname */
+ pinreq.args[0].kind = SECCOMP_PIN_CSTRING;
+ pinreq.args[0].max_bytes = PATH_MAX;
+
+ ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_ARGS, &pinreq);
+ ASSERT_EQ(ret, 0) {
+ TH_LOG("PIN_ARGS failed: %s", strerror(errno));
+ }
+ EXPECT_STREQ(readback, path_a);
+ EXPECT_EQ(pinreq.args[0].truncated, 0);
+
+ /* Race: mutate the path before SEND. */
+ strcpy(shared_path, path_b);
+
+ resp.id = req.id;
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE |
+ SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ /* Child read 'A' if pin won, 'B' if mutation won. */
+ EXPECT_EQ(0, WEXITSTATUS(status)) {
+ TH_LOG("opened %s instead of pinned %s",
+ WEXITSTATUS(status) == 1 ? "path_b" : "?", path_a);
+ }
+
+ unlink(path_a);
+ unlink(path_b);
+ munmap(shared_path, PATH_MAX);
+ close(listener);
+}
+
+/* CSTRING truncation: ask for fewer bytes than the actual path; verify
+ * the truncation flag is set and actual_size == max_bytes.
+ */
+TEST(pin_args_cstring_truncated)
+{
+ char *shared_path;
+ char path[128];
+ struct seccomp_notif_pin_args pinreq;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ char readback[16];
+ int listener, status;
+ pid_t pid;
+
+ snprintf(path, sizeof(path),
+ "/tmp/seccomp-pin-trunc-%d-LONG-PATH-NAME", getpid());
+
+ ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
+
+ shared_path = mmap(NULL, PATH_MAX, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+ ASSERT_NE(MAP_FAILED, shared_path);
+ memset(shared_path, 0, PATH_MAX);
+ strcpy(shared_path, path);
+
+ listener = install_user_notif_filter(__NR_openat);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ /* Will fail with ENOENT — we don't care, we just want to
+ * trigger the trap so the supervisor can run PIN_ARGS.
+ */
+ openat(AT_FDCWD, shared_path, O_RDONLY);
+ _exit(0);
+ }
+
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+
+ memset(&pinreq, 0, sizeof(pinreq));
+ memset(readback, 0, sizeof(readback));
+ pinreq.id = req.id;
+ pinreq.nr_args = 1;
+ pinreq.buf_size = sizeof(readback);
+ pinreq.buf = (uintptr_t)readback;
+ pinreq.args[0].arg_idx = 1;
+ pinreq.args[0].kind = SECCOMP_PIN_CSTRING;
+ pinreq.args[0].max_bytes = sizeof(readback); /* deliberately small */
+
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_ARGS, &pinreq), 0);
+ EXPECT_EQ(pinreq.args[0].truncated, SECCOMP_PIN_TRUNCATED_BYTES);
+ EXPECT_EQ(pinreq.args[0].actual_size, sizeof(readback));
+ /* Buffer is NUL-terminated even when truncated. */
+ EXPECT_EQ(readback[sizeof(readback) - 1], '\0');
+
+ /* Just continue normally so the child completes. */
+ resp.id = req.id;
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+
+ munmap(shared_path, PATH_MAX);
+ close(listener);
+}
+
+/*
+ * SECCOMP_PIN_CSTRING_ARRAY: pin argv at execve(), mutate the argv
+ * pointer table (and the strings it points to) between PIN_ARGS and
+ * SEND, and verify the kernel execs against the *pinned* argv.
+ *
+ * Reproduces the §1 attack from the design doc: the supervisor sees
+ * a blessed argv, a shared peer rewrites argv between supervisor read
+ * and kernel re-read, and without PIN_ARGS the kernel would exec
+ * against the rewritten bytes.
+ */
+TEST(pin_args_execve_argv)
+{
+ char *shared;
+ char *strA, *strB;
+ char **argv_ptrs;
+ struct seccomp_notif_pin_args pinreq;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ char readback[1024];
+ static const char *const envp_a[] = {"CHK=A", NULL};
+ int listener, status;
+ pid_t pid;
+ long ret;
+
+ /*
+ * Set up the argv table and string storage in shared memory so
+ * the supervisor can mutate them between PIN_ARGS and SEND.
+ */
+ shared = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+ ASSERT_NE(MAP_FAILED, shared);
+
+ /* Layout in the shared page:
+ * [argv_ptrs: char*[3]] [strA: 32 bytes] [strB: 32 bytes]
+ */
+ argv_ptrs = (char **)shared;
+ strA = shared + sizeof(char *) * 3;
+ strB = strA + 32;
+ strcpy(strA, "/bin/true");
+ strcpy(strB, "/bin/false");
+ argv_ptrs[0] = strA;
+ argv_ptrs[1] = NULL; /* will mutate to strB before SEND */
+ argv_ptrs[2] = NULL;
+
+ ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
+
+ listener = install_user_notif_filter(__NR_execve);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ /* execve("/bin/true", {strA, NULL}, {"CHK=A", NULL}).
+ * The supervisor will mutate argv to point at strB before
+ * CONTINUE_PINNED. With PIN_ARGS working, the kernel still
+ * execs /bin/true (filename is also pinned in this test),
+ * exit code 0. Without it, the kernel would re-read argv
+ * and exec /bin/false, exit code 1.
+ *
+ * We pin the *filename* (arg 0) too so the mutation can't
+ * change which binary runs by changing argv[0].
+ */
+ execve(strA, argv_ptrs, (char *const *)envp_a);
+ _exit(99);
+ }
+
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+ EXPECT_EQ(req.data.nr, __NR_execve);
+
+ /* Pin filename (CSTRING) and argv (CSTRING_ARRAY). */
+ memset(&pinreq, 0, sizeof(pinreq));
+ memset(readback, 0, sizeof(readback));
+ pinreq.id = req.id;
+ pinreq.nr_args = 2;
+ pinreq.buf_size = sizeof(readback);
+ pinreq.buf = (uintptr_t)readback;
+
+ pinreq.args[0].arg_idx = 0;
+ pinreq.args[0].kind = SECCOMP_PIN_CSTRING;
+ pinreq.args[0].max_bytes = 64;
+
+ pinreq.args[1].arg_idx = 1;
+ pinreq.args[1].kind = SECCOMP_PIN_CSTRING_ARRAY;
+ pinreq.args[1].max_bytes = 512;
+ pinreq.args[1].max_entries = 8;
+
+ ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_ARGS, &pinreq);
+ ASSERT_EQ(ret, 0) {
+ TH_LOG("PIN_ARGS failed: %s", strerror(errno));
+ }
+ EXPECT_EQ(pinreq.args[1].actual_entries, 1);
+
+ /* Mutate the argv pointer table to swap in strB ("/bin/false"). */
+ argv_ptrs[0] = strB;
+
+ resp.id = req.id;
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE |
+ SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ /* /bin/true exits 0; /bin/false exits 1; execve failure exits 99. */
+ EXPECT_EQ(0, WEXITSTATUS(status)) {
+ TH_LOG("expected /bin/true (pinned) but got exit code %d",
+ WEXITSTATUS(status));
+ }
+
+ munmap(shared, 4096);
+ close(listener);
+}
+
+/*
+ * SECCOMP_PIN_FIXED applied to write(fd, buf, count): pin @buf via
+ * PIN_ARGS, mutate the underlying shared bytes between PIN_ARGS and
+ * SEND, and verify the bytes the kernel actually writes to disk are
+ * the *pinned* ones, not the mutated ones.
+ */
+TEST(pin_args_write_buf)
+{
+ char *shared_buf;
+ char file_path[64];
+ struct seccomp_notif_pin_args pinreq;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ const char *pinned_msg = "PINNED";
+ const char *mutated_msg = "MUTATED";
+ size_t msg_len = strlen(pinned_msg);
+ char readback[16];
+ char file_content[16];
+ int listener, status, file_fd;
+ pid_t pid;
+ long ret;
+
+ snprintf(file_path, sizeof(file_path),
+ "/tmp/seccomp-pin-write-%d", getpid());
+ unlink(file_path);
+
+ file_fd = open(file_path, O_CREAT | O_TRUNC | O_WRONLY, 0600);
+ ASSERT_GE(file_fd, 0);
+
+ ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
+
+ shared_buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+ ASSERT_NE(MAP_FAILED, shared_buf);
+ memcpy(shared_buf, pinned_msg, msg_len);
+
+ listener = install_user_notif_filter(__NR_write);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ ssize_t n;
+
+ n = write(file_fd, shared_buf, msg_len);
+ _exit(n == (ssize_t)msg_len ? 0 : 10);
+ }
+
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+ EXPECT_EQ(req.data.nr, __NR_write);
+
+ memset(&pinreq, 0, sizeof(pinreq));
+ memset(readback, 0, sizeof(readback));
+ pinreq.id = req.id;
+ pinreq.nr_args = 1;
+ pinreq.buf_size = sizeof(readback);
+ pinreq.buf = (uintptr_t)readback;
+ pinreq.args[0].arg_idx = 1; /* write: buf */
+ pinreq.args[0].kind = SECCOMP_PIN_FIXED;
+ pinreq.args[0].max_bytes = msg_len;
+
+ ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_ARGS, &pinreq);
+ ASSERT_EQ(ret, 0) {
+ TH_LOG("PIN_ARGS failed: %s", strerror(errno));
+ }
+ EXPECT_EQ(pinreq.args[0].actual_size, msg_len);
+ EXPECT_EQ(0, memcmp(readback, pinned_msg, msg_len));
+
+ /* Race: rewrite the buffer the child is about to write. */
+ memcpy(shared_buf, mutated_msg, msg_len);
+
+ resp.id = req.id;
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE |
+ SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ close(file_fd);
+ file_fd = open(file_path, O_RDONLY);
+ ASSERT_GE(file_fd, 0);
+ memset(file_content, 0, sizeof(file_content));
+ EXPECT_EQ((ssize_t)msg_len,
+ read(file_fd, file_content, sizeof(file_content)));
+ close(file_fd);
+
+ /* The pinned bytes should be on disk. */
+ EXPECT_EQ(0, memcmp(file_content, pinned_msg, msg_len)) {
+ TH_LOG("file contained '%.*s'; expected '%s'",
+ (int)msg_len, file_content, pinned_msg);
+ }
+
+ unlink(file_path);
+ munmap(shared_buf, 4096);
+ close(listener);
+}
+
+/*
+ * The pin is single-shot: after CONTINUE_PINNED, the subsequent
+ * task_work-driven clear must run before the trapped task issues its
+ * *next* filtered syscall, so a second PIN_ARGS for the new notif id
+ * succeeds (no stale -EEXIST). Validates the post-syscall lifecycle.
+ */
+TEST(pin_args_one_shot)
+{
+ struct sockaddr_un *shared;
+ char path_a[64], path_b[64];
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ char readback[sizeof(struct sockaddr_un)];
+ int listener, status;
+ pid_t pid;
+
+ snprintf(path_a, sizeof(path_a), "/tmp/seccomp-pin-1shot-%d-A", getpid());
+ snprintf(path_b, sizeof(path_b), "/tmp/seccomp-pin-1shot-%d-B", getpid());
+ unlink(path_a);
+ unlink(path_b);
+
+ ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
+
+ shared = mmap(NULL, sizeof(*shared), PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+ ASSERT_NE(MAP_FAILED, shared);
+
+ listener = install_user_notif_filter(__NR_bind);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ int fd1, fd2;
+
+ memset(shared, 0, sizeof(*shared));
+ shared->sun_family = AF_UNIX;
+ strcpy(shared->sun_path, path_a);
+ fd1 = socket(AF_UNIX, SOCK_DGRAM, 0);
+ if (fd1 < 0)
+ _exit(10);
+ if (bind(fd1, (struct sockaddr *)shared,
+ sizeof(*shared)) < 0)
+ _exit(11);
+
+ strcpy(shared->sun_path, path_b);
+ fd2 = socket(AF_UNIX, SOCK_DGRAM, 0);
+ if (fd2 < 0)
+ _exit(12);
+ if (bind(fd2, (struct sockaddr *)shared,
+ sizeof(*shared)) < 0)
+ _exit(13);
+ _exit(0);
+ }
+
+ /* First trap: bind(path_a). Pin and CONTINUE_PINNED. */
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+ memset(readback, 0, sizeof(readback));
+ ASSERT_EQ(0, do_pin_sockaddr(listener, req.id,
+ readback, sizeof(readback)));
+ resp.id = req.id;
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE |
+ SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ /* Second trap: bind(path_b). PIN_ARGS must succeed (no stale pin
+ * from the first trap leaking via -EEXIST).
+ */
+ memset(&req, 0, sizeof(req));
+ memset(&resp, 0, sizeof(resp));
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+ EXPECT_EQ(req.data.nr, __NR_bind);
+ memset(readback, 0, sizeof(readback));
+ ASSERT_EQ(0, do_pin_sockaddr(listener, req.id,
+ readback, sizeof(readback))) {
+ TH_LOG("second PIN_ARGS failed (errno=%d %s); pin from prior trap may have leaked",
+ errno, strerror(errno));
+ }
+ resp.id = req.id;
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE |
+ SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ struct stat st;
+
+ EXPECT_EQ(stat(path_a, &st), 0);
+ EXPECT_EQ(stat(path_b, &st), 0);
+
+ unlink(path_a);
+ unlink(path_b);
+ munmap(shared, sizeof(*shared));
+ close(listener);
+}
+
+/* SIGKILL the trapped child while a pin is attached but not yet armed.
+ * The kpa must be freed; supervisor's listener fd must remain healthy.
+ */
+TEST(pin_args_sigkill_child)
+{
+ struct bind_race r;
+ struct seccomp_notif req = {};
+ char readback[sizeof(struct sockaddr_un)];
+ int status;
+
+ ASSERT_EQ(0, bind_race_setup(&r));
+
+ EXPECT_EQ(ioctl(r.listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+
+ memset(readback, 0, sizeof(readback));
+ ASSERT_EQ(0, do_pin_sockaddr(r.listener, req.id,
+ readback, sizeof(readback)));
+
+ /* Pin attached, not armed. Kill the child mid-wait. */
+ kill(r.child, SIGKILL);
+
+ EXPECT_EQ(waitpid(r.child, &status, 0), r.child);
+ r.child = -1;
+ EXPECT_EQ(true, WIFSIGNALED(status));
+ EXPECT_EQ(SIGKILL, WTERMSIG(status));
+
+ /*
+ * Listener fd is still valid. F_GETFD returns the FD flags
+ * (FD_CLOEXEC is set on the listener by seccomp), so the
+ * health-check is "not -1", not "== 0".
+ */
+ EXPECT_NE(-1, fcntl(r.listener, F_GETFD));
+
+ bind_race_teardown(&r);
+}
+
+TEST_HARNESS_MAIN
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC PATCH 3/3] Documentation: seccomp: document SECCOMP_IOCTL_NOTIF_PIN_ARGS
2026-05-04 1:12 [RFC PATCH 0/3] seccomp: SECCOMP_IOCTL_NOTIF_PIN_ARGS for race-free unotify Cong Wang
2026-05-04 1:12 ` [RFC PATCH 1/3] seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU race Cong Wang
2026-05-04 1:12 ` [RFC PATCH 2/3] selftests/seccomp: add seccomp_pin_args end-to-end coverage Cong Wang
@ 2026-05-04 1:12 ` Cong Wang
2 siblings, 0 replies; 6+ messages in thread
From: Cong Wang @ 2026-05-04 1:12 UTC (permalink / raw)
To: Kees Cook, linux-kernel
Cc: Andy Lutomirski, Will Drewry, Christian Brauner, Cong Wang
From: Cong Wang <cwang@multikernel.io>
Add a "Pinned arguments" section to the userspace API doc covering
the motivation (closing the documented TOCTOU window for unprivileged
supervisors), the pin/consume flow via SECCOMP_IOCTL_NOTIF_PIN_ARGS
and SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED, the three v1 shapes
with their per-shape semantics, the single-shot lifecycle, the
syscall_nr mismatch check, and the explicitly-not-covered cases left
for follow-ups (vector I/O, nested-pointer payloads).
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
.../userspace-api/seccomp_filter.rst | 76 +++++++++++++++++++
1 file changed, 76 insertions(+)
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index cff0fa7f3175..8bbbd923c31d 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -289,6 +289,82 @@ above in this document: all arguments being read from the tracee's memory
should be read into the tracer's memory before any policy decisions are made.
This allows for an atomic decision on syscall arguments.
+Pinned arguments
+----------------
+
+For unprivileged supervisors, ``ptrace()``/``/proc/pid/mem`` are not
+available, and reading the tracee's memory via ``process_vm_readv()``
+remains racy: a sibling thread or ``CLONE_VM`` peer can mutate the
+buffer between supervisor read and the kernel's re-read on
+``SECCOMP_USER_NOTIF_FLAG_CONTINUE``. ``SECCOMP_IOCTL_NOTIF_PIN_ARGS``
+closes that race by atomically copying designated pointer-arg payloads
+from the tracee's address space into kernel-owned buffers, and binding
+those buffers to the tracee's next-syscall execution.
+
+The supervisor receives a notification as today, then issues
+``ioctl(SECCOMP_IOCTL_NOTIF_PIN_ARGS, &payload)`` with a
+``struct seccomp_notif_pin_args`` describing which pointer-args to
+snapshot. Each per-arg descriptor names a syscall register slot
+(``arg_idx``, 0..5), one of three shapes (``SECCOMP_PIN_FIXED``,
+``SECCOMP_PIN_CSTRING``, ``SECCOMP_PIN_CSTRING_ARRAY``), and a
+``max_bytes`` cap. The kernel walks the trapped task's mm, copies
+the bytes into kernel buffers, and writes them back to a supervisor-
+provided byte buffer (``buf`` / ``buf_size``) plus per-arg metadata
+(``actual_size``, ``buf_offset``, ``truncated``).
+
+To consume the snapshot on syscall re-execution, the supervisor sends
+``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` with both
+``SECCOMP_USER_NOTIF_FLAG_CONTINUE`` and
+``SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED`` set. The kernel's syscall
+fetch points (``getname_flags``, ``copy_strings``,
+``move_addr_to_kernel``, ``import_ubuf``) check
+``current->seccomp.pinned_args`` and consume from the kernel buffer
+instead of re-reading user memory; mutations to the original buffer
+after ``PIN_ARGS`` returns have no effect.
+
+The pin is single-shot: it is cleared automatically when the trapped
+task next returns to user mode after the resumed syscall body
+completes, when the task exits, when the listener fd is closed, or
+when the supervisor sends ``CONTINUE`` without ``CONTINUE_PINNED``
+(an explicit "I changed my mind" path). Subsequent traps require a
+fresh ``PIN_ARGS`` for the new notification id.
+
+Per-shape semantics:
+
+* ``SECCOMP_PIN_FIXED`` copies exactly ``max_bytes`` from
+ ``args[arg_idx]``. Suitable for ``struct sockaddr`` (``bind``,
+ ``connect``, ``sendto``) and for ``write(fd, buf, count)`` (the
+ supervisor sets ``max_bytes = count`` from
+ ``seccomp_data.args[2]``).
+
+* ``SECCOMP_PIN_CSTRING`` walks to the trailing NUL, capped at
+ ``max_bytes``. The pinned buffer is always NUL-terminated; if the
+ cap was hit before the source NUL, ``truncated`` carries
+ ``SECCOMP_PIN_TRUNCATED_BYTES``. Suitable for paths
+ (``open``/``openat``/``execve`` filename, etc.).
+
+* ``SECCOMP_PIN_CSTRING_ARRAY`` walks a NULL-terminated pointer table
+ at ``args[arg_idx]`` and copies each non-NULL string. Suitable for
+ ``execve``'s argv and envp. Bounded by both ``max_bytes`` and
+ ``max_entries``. Result is packed as
+ ``[u32 count][u32 offsets[count]][u8 strings[]]``.
+
+The total cumulative ``max_bytes`` across all per-arg descriptors and
+the supervisor-provided ``buf_size`` are each bounded at 1 MiB; this
+is a hard-coded defensive ceiling, not a tunable.
+
+The kernel records the syscall number at pin time and verifies a
+match at consumption: a signal handler running on the trapped task
+during ``-ERESTART*`` resolution that issues an unrelated syscall
+will not consume the pin.
+
+Cumulative scope of v1: ``SECCOMP_PIN_FIXED`` covers sockaddr and
+single-buffer write content; ``SECCOMP_PIN_CSTRING`` covers paths;
+``SECCOMP_PIN_CSTRING_ARRAY`` covers argv and envp. Vector I/O
+(``readv``/``writev``) and nested-pointer payloads
+(``sendmsg``/``recvmsg`` ``msghdr``, ``futex_waitv``) are not covered
+in v1.
+
Sysctls
=======
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC PATCH 1/3] seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU race
2026-05-04 1:12 ` [RFC PATCH 1/3] seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU race Cong Wang
@ 2026-05-04 12:51 ` Christian Brauner
2026-05-06 5:00 ` Cong Wang
0 siblings, 1 reply; 6+ messages in thread
From: Christian Brauner @ 2026-05-04 12:51 UTC (permalink / raw)
To: Cong Wang
Cc: Kees Cook, linux-kernel, Andy Lutomirski, Will Drewry,
Christian Brauner, Cong Wang
On Sun, 03 May 2026 18:12:05 -0700, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> diff --git a/fs/namei.c b/fs/namei.c
> index c7fac83c9a85..ee86f4c91cae 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -222,6 +223,24 @@ do_getname(const char __user *filename, int flags, bool incomplete)
> [ ... skip 15 lines ... ]
> + pin = seccomp_pin_lookup_current((u64)(uintptr_t)filename);
> + if (pin && pin->kind == SECCOMP_PIN_CSTRING) {
> + if (pin->size <= 1 && !(flags & LOOKUP_EMPTY))
> + return ERR_PTR(-ENOENT);
> + return getname_kernel(pin->data);
> + }
Sorry, no. That's just not acceptable at all. We're not spraying
"continue from snapshotted state" across the vfs and the kernel in
general. This is just screaming for security issues. Anything that wants
to do something remotely like this needs to come as generic abstraction
where the syscall layer itself doesn't have to care at all about this.
There are just so many corners where you run into issues with this.
--
Christian Brauner <brauner@kernel.org>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH 1/3] seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU race
2026-05-04 12:51 ` Christian Brauner
@ 2026-05-06 5:00 ` Cong Wang
0 siblings, 0 replies; 6+ messages in thread
From: Cong Wang @ 2026-05-06 5:00 UTC (permalink / raw)
To: Christian Brauner
Cc: Kees Cook, linux-kernel, Andy Lutomirski, Will Drewry, Cong Wang
On Mon, May 4, 2026 at 8:51 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Sun, 03 May 2026 18:12:05 -0700, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > diff --git a/fs/namei.c b/fs/namei.c
> > index c7fac83c9a85..ee86f4c91cae 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -222,6 +223,24 @@ do_getname(const char __user *filename, int flags, bool incomplete)
> > [ ... skip 15 lines ... ]
> > + pin = seccomp_pin_lookup_current((u64)(uintptr_t)filename);
> > + if (pin && pin->kind == SECCOMP_PIN_CSTRING) {
> > + if (pin->size <= 1 && !(flags & LOOKUP_EMPTY))
> > + return ERR_PTR(-ENOENT);
> > + return getname_kernel(pin->data);
> > + }
>
> Sorry, no. That's just not acceptable at all. We're not spraying
> "continue from snapshotted state" across the vfs and the kernel in
> general. This is just screaming for security issues. Anything that wants
> to do something remotely like this needs to come as generic abstraction
> where the syscall layer itself doesn't have to care at all about this.
> There are just so many corners where you run into issues with this.
You're right. Having every fetch site consult a per-task pin pointer is
exactly the kind of cross-cutting awareness that doesn't scale.
How about the following direction instead?
Reshape the mechanism as a PTRACE_SYSCALL-style redirect, applied at
the notification reply path. The supervisor describes:
struct seccomp_notif_inject {
__u64 id;
__u64 nr;
__u64 args[6];
__u64 buf; /* __user, kernel-input bytes */
__u32 buf_size;
__u32 args_in_buf_mask; /* bit i: args[i] is offset into buf */
};
NOTIF_SEND with a new FLAG_INJECTED applies the redirect. The trapped
task's nr/args registers are set via syscall_set_nr() and
syscall_set_arguments() (the same primitives ptrace uses for syscall
substitution today), and any arg flagged in args_in_buf_mask is
satisfied from a kernel-side buffer rather than from the trapped
task's mm. fs/, net/, mm/, lib/ get zero changes. The whole feature
lives in a new kernel/seccomp_inject.c plus a small dispatcher in
kernel/seccomp.c.
This is intentionally a strict subset of what ptrace can already do
via PTRACE_POKEDATA + PTRACE_SETREGSET. This does not add a
kernel capability; it provides a listener-fd-gated,
syscall-whitelisted, narrower interface to that capability for
unprivileged seccomp_unotify supervisors, where ptrace's privilege
model and per-syscall overhead are not viable. SECCOMP_IOCTL_NOTIF_ADDFD
set the precedent for this kind of narrow listener-fd interface to a
ptrace-overlapping capability.
Thanks,
Cong
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-06 5:00 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-04 1:12 [RFC PATCH 0/3] seccomp: SECCOMP_IOCTL_NOTIF_PIN_ARGS for race-free unotify Cong Wang
2026-05-04 1:12 ` [RFC PATCH 1/3] seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU race Cong Wang
2026-05-04 12:51 ` Christian Brauner
2026-05-06 5:00 ` Cong Wang
2026-05-04 1:12 ` [RFC PATCH 2/3] selftests/seccomp: add seccomp_pin_args end-to-end coverage Cong Wang
2026-05-04 1:12 ` [RFC PATCH 3/3] Documentation: seccomp: document SECCOMP_IOCTL_NOTIF_PIN_ARGS Cong Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox