All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/3] seccomp: non-cooperative pinned-memfd argument redirect
@ 2026-06-13  0:15 Cong Wang
  2026-06-13  0:15 ` [RFC PATCH v3 1/3] mm: add __do_mmap() and vm_mmap_seal_remote() Cong Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Cong Wang @ 2026-06-13  0:15 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Kees Cook, linux-kernel, Will Drewry, Christian Brauner

The seccomp user-notification SECCOMP_USER_NOTIF_FLAG_CONTINUE response
carries an inherent TOCTOU: once the supervisor decides to let a syscall
continue, the target (or a CLONE_VM peer) can rewrite the memory behind a
pointer argument before the kernel reads it. This is documented in the
UAPI header and is why the notifier "cannot be used to implement a
security policy" today.

The cooperative way around this is for the target to map a shared memfd
and mseal() it during a trusted setup window, so the supervisor can hand
the kernel an immutable buffer. That window does not exist for the common
fork()+execve() sandbox model, where the supervisor wants to confine an
uncooperative (or legacy) binary it did not write.

This series lets the supervisor close the TOCTOU without any target-side
cooperation:

  - The kernel installs a sealed, read-only, MAP_SHARED mapping of a
    supervisor-owned memfd directly into the trapped task's mm
    (SECCOMP_IOCTL_NOTIF_PIN_INSTALL). The mapping is VM_SEALED at
    creation, so neither the target nor a CLONE_VM peer can unmap,
    remap, mprotect or MAP_FIXED-stomp it. The supervisor writes the
    intended argument data through its own mapping of the same memfd.

  - The supervisor then resumes the syscall with selected argument
    registers rewritten to point into that pin
    (SECCOMP_IOCTL_NOTIF_SEND_REDIRECT). Pointer substitutions are
    validated so the whole access [ptr, ptr+len) lies inside a pin that
    still lives in the target's current mm; original registers are
    restored at syscall exit for ABI compliance.

Because the data the kernel acts on lives in an immutable pin, the
target can no longer win the race. execve() is handled as a first-class
case: its pathname is copied from the pin before the old mm is torn
down, and the register-restore is skipped once the program image has
been replaced (detected via self_exec_id).

Patch 1 adds the mm plumbing: __do_mmap(), a variant of do_mmap() that
targets a caller-supplied mm (do_mmap() stays a current->mm wrapper, so
no existing caller changes), and vm_mmap_seal_remote(), a tailored
high-level helper for installing the sealed pin. Patch 2 is the seccomp
ABI and implementation. Patch 3 adds selftests.

Changes since v2:
  v3 is a redesign rather than an incremental revision. v2 added a
  SECCOMP_IOCTL_NOTIF_INJECT ioctl: the supervisor described a
  substitute syscall plus an input buffer, and on CONTINUE the kernel
  copied that buffer in and ran a kernel-side helper for a small
  whitelist of syscalls (openat, bind, write) without re-reading the
  target's memory. That closed the TOCTOU, but required an in-kernel
  reimplementation of every supported syscall and a fixed whitelist,
  and never actually ran the real syscall.

  v3 drops the kernel-side helpers entirely as suggested by Andy.

All four pinned-memfd selftests pass.

---
Cong Wang (3):
  mm: add __do_mmap() and vm_mmap_seal_remote()
  seccomp: add kernel-installed pinned-memfd redirect
  selftests/seccomp: cover non-cooperative pinned-memfd install

 include/linux/mm.h                            |   2 +
 include/linux/seccomp.h                       |   8 +
 include/uapi/linux/seccomp.h                  |  99 ++
 kernel/seccomp.c                              | 366 +++++++
 mm/internal.h                                 |   5 +
 mm/mmap.c                                     |  29 +-
 mm/nommu.c                                    |  12 +-
 mm/util.c                                     |  50 +
 mm/vma.c                                      |  18 +-
 mm/vma.h                                      |   6 +-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 960 ++++++++++++++++++
 11 files changed, 1533 insertions(+), 22 deletions(-)


base-commit: 28608283615e5e7e92ea79c8ea13507f4b5e0cbe
-- 
2.43.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-06-24  3:39 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-13  0:15 [RFC PATCH v3 0/3] seccomp: non-cooperative pinned-memfd argument redirect Cong Wang
2026-06-13  0:15 ` [RFC PATCH v3 1/3] mm: add __do_mmap() and vm_mmap_seal_remote() Cong Wang
2026-06-13  0:15 ` [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect Cong Wang
2026-06-13  4:03   ` Andy Lutomirski
2026-06-20 21:11     ` Cong Wang
2026-06-23 19:02       ` Andy Lutomirski
2026-06-23 19:11         ` Kees Cook
2026-06-23 22:21           ` Andy Lutomirski
2026-06-24  3:38             ` Cong Wang
2026-06-23 23:29           ` Cong Wang
2026-06-23 23:26         ` Cong Wang
2026-06-13  0:15 ` [RFC PATCH v3 3/3] selftests/seccomp: cover non-cooperative pinned-memfd install Cong Wang
2026-06-13 10:07   ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.