From: Cong Wang <xiyou.wangcong@gmail.com>
To: Kees Cook <kees@kernel.org>, linux-kernel@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>,
Will Drewry <wad@chromium.org>,
Christian Brauner <brauner@kernel.org>,
Cong Wang <cwang@multikernel.io>
Subject: [RFC PATCH 3/3] Documentation: seccomp: document SECCOMP_IOCTL_NOTIF_PIN_ARGS
Date: Sun, 3 May 2026 18:12:07 -0700 [thread overview]
Message-ID: <20260504011207.539408-4-xiyou.wangcong@gmail.com> (raw)
In-Reply-To: <20260504011207.539408-1-xiyou.wangcong@gmail.com>
From: Cong Wang <cwang@multikernel.io>
Add a "Pinned arguments" section to the userspace API doc covering
the motivation (closing the documented TOCTOU window for unprivileged
supervisors), the pin/consume flow via SECCOMP_IOCTL_NOTIF_PIN_ARGS
and SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED, the three v1 shapes
with their per-shape semantics, the single-shot lifecycle, the
syscall_nr mismatch check, and the explicitly-not-covered cases left
for follow-ups (vector I/O, nested-pointer payloads).
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
.../userspace-api/seccomp_filter.rst | 76 +++++++++++++++++++
1 file changed, 76 insertions(+)
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index cff0fa7f3175..8bbbd923c31d 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -289,6 +289,82 @@ above in this document: all arguments being read from the tracee's memory
should be read into the tracer's memory before any policy decisions are made.
This allows for an atomic decision on syscall arguments.
+Pinned arguments
+----------------
+
+For unprivileged supervisors, ``ptrace()``/``/proc/pid/mem`` are not
+available, and reading the tracee's memory via ``process_vm_readv()``
+remains racy: a sibling thread or ``CLONE_VM`` peer can mutate the
+buffer between supervisor read and the kernel's re-read on
+``SECCOMP_USER_NOTIF_FLAG_CONTINUE``. ``SECCOMP_IOCTL_NOTIF_PIN_ARGS``
+closes that race by atomically copying designated pointer-arg payloads
+from the tracee's address space into kernel-owned buffers, and binding
+those buffers to the tracee's next-syscall execution.
+
+The supervisor receives a notification as today, then issues
+``ioctl(SECCOMP_IOCTL_NOTIF_PIN_ARGS, &payload)`` with a
+``struct seccomp_notif_pin_args`` describing which pointer-args to
+snapshot. Each per-arg descriptor names a syscall register slot
+(``arg_idx``, 0..5), one of three shapes (``SECCOMP_PIN_FIXED``,
+``SECCOMP_PIN_CSTRING``, ``SECCOMP_PIN_CSTRING_ARRAY``), and a
+``max_bytes`` cap. The kernel walks the trapped task's mm, copies
+the bytes into kernel buffers, and writes them back to a supervisor-
+provided byte buffer (``buf`` / ``buf_size``) plus per-arg metadata
+(``actual_size``, ``buf_offset``, ``truncated``).
+
+To consume the snapshot on syscall re-execution, the supervisor sends
+``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` with both
+``SECCOMP_USER_NOTIF_FLAG_CONTINUE`` and
+``SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED`` set. The kernel's syscall
+fetch points (``getname_flags``, ``copy_strings``,
+``move_addr_to_kernel``, ``import_ubuf``) check
+``current->seccomp.pinned_args`` and consume from the kernel buffer
+instead of re-reading user memory; mutations to the original buffer
+after ``PIN_ARGS`` returns have no effect.
+
+The pin is single-shot: it is cleared automatically when the trapped
+task next returns to user mode after the resumed syscall body
+completes, when the task exits, when the listener fd is closed, or
+when the supervisor sends ``CONTINUE`` without ``CONTINUE_PINNED``
+(an explicit "I changed my mind" path). Subsequent traps require a
+fresh ``PIN_ARGS`` for the new notification id.
+
+Per-shape semantics:
+
+* ``SECCOMP_PIN_FIXED`` copies exactly ``max_bytes`` from
+ ``args[arg_idx]``. Suitable for ``struct sockaddr`` (``bind``,
+ ``connect``, ``sendto``) and for ``write(fd, buf, count)`` (the
+ supervisor sets ``max_bytes = count`` from
+ ``seccomp_data.args[2]``).
+
+* ``SECCOMP_PIN_CSTRING`` walks to the trailing NUL, capped at
+ ``max_bytes``. The pinned buffer is always NUL-terminated; if the
+ cap was hit before the source NUL, ``truncated`` carries
+ ``SECCOMP_PIN_TRUNCATED_BYTES``. Suitable for paths
+ (``open``/``openat``/``execve`` filename, etc.).
+
+* ``SECCOMP_PIN_CSTRING_ARRAY`` walks a NULL-terminated pointer table
+ at ``args[arg_idx]`` and copies each non-NULL string. Suitable for
+ ``execve``'s argv and envp. Bounded by both ``max_bytes`` and
+ ``max_entries``. Result is packed as
+ ``[u32 count][u32 offsets[count]][u8 strings[]]``.
+
+The total cumulative ``max_bytes`` across all per-arg descriptors and
+the supervisor-provided ``buf_size`` are each bounded at 1 MiB; this
+is a hard-coded defensive ceiling, not a tunable.
+
+The kernel records the syscall number at pin time and verifies a
+match at consumption: a signal handler running on the trapped task
+during ``-ERESTART*`` resolution that issues an unrelated syscall
+will not consume the pin.
+
+Cumulative scope of v1: ``SECCOMP_PIN_FIXED`` covers sockaddr and
+single-buffer write content; ``SECCOMP_PIN_CSTRING`` covers paths;
+``SECCOMP_PIN_CSTRING_ARRAY`` covers argv and envp. Vector I/O
+(``readv``/``writev``) and nested-pointer payloads
+(``sendmsg``/``recvmsg`` ``msghdr``, ``futex_waitv``) are not covered
+in v1.
+
Sysctls
=======
--
2.43.0
prev parent reply other threads:[~2026-05-04 1:12 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-04 1:12 [RFC PATCH 0/3] seccomp: SECCOMP_IOCTL_NOTIF_PIN_ARGS for race-free unotify Cong Wang
2026-05-04 1:12 ` [RFC PATCH 1/3] seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU race Cong Wang
2026-05-04 12:51 ` Christian Brauner
2026-05-06 5:00 ` Cong Wang
2026-05-04 1:12 ` [RFC PATCH 2/3] selftests/seccomp: add seccomp_pin_args end-to-end coverage Cong Wang
2026-05-04 1:12 ` Cong Wang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260504011207.539408-4-xiyou.wangcong@gmail.com \
--to=xiyou.wangcong@gmail.com \
--cc=brauner@kernel.org \
--cc=cwang@multikernel.io \
--cc=kees@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=wad@chromium.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox