[RFC PATCH v3 0/3] seccomp: non-cooperative pinned-memfd argument redirect

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v3 0/3] seccomp: non-cooperative pinned-memfd argument redirect
@ 2026-06-13  0:15 Cong Wang
  2026-06-13  0:15 ` [RFC PATCH v3 1/3] mm: add __do_mmap() and vm_mmap_seal_remote() Cong Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Cong Wang @ 2026-06-13  0:15 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Kees Cook, linux-kernel, Will Drewry, Christian Brauner

The seccomp user-notification SECCOMP_USER_NOTIF_FLAG_CONTINUE response
carries an inherent TOCTOU: once the supervisor decides to let a syscall
continue, the target (or a CLONE_VM peer) can rewrite the memory behind a
pointer argument before the kernel reads it. This is documented in the
UAPI header and is why the notifier "cannot be used to implement a
security policy" today.

The cooperative way around this is for the target to map a shared memfd
and mseal() it during a trusted setup window, so the supervisor can hand
the kernel an immutable buffer. That window does not exist for the common
fork()+execve() sandbox model, where the supervisor wants to confine an
uncooperative (or legacy) binary it did not write.

This series lets the supervisor close the TOCTOU without any target-side
cooperation:

  - The kernel installs a sealed, read-only, MAP_SHARED mapping of a
    supervisor-owned memfd directly into the trapped task's mm
    (SECCOMP_IOCTL_NOTIF_PIN_INSTALL). The mapping is VM_SEALED at
    creation, so neither the target nor a CLONE_VM peer can unmap,
    remap, mprotect or MAP_FIXED-stomp it. The supervisor writes the
    intended argument data through its own mapping of the same memfd.

  - The supervisor then resumes the syscall with selected argument
    registers rewritten to point into that pin
    (SECCOMP_IOCTL_NOTIF_SEND_REDIRECT). Pointer substitutions are
    validated so the whole access [ptr, ptr+len) lies inside a pin that
    still lives in the target's current mm; original registers are
    restored at syscall exit for ABI compliance.

Because the data the kernel acts on lives in an immutable pin, the
target can no longer win the race. execve() is handled as a first-class
case: its pathname is copied from the pin before the old mm is torn
down, and the register-restore is skipped once the program image has
been replaced (detected via self_exec_id).

Patch 1 adds the mm plumbing: __do_mmap(), a variant of do_mmap() that
targets a caller-supplied mm (do_mmap() stays a current->mm wrapper, so
no existing caller changes), and vm_mmap_seal_remote(), a tailored
high-level helper for installing the sealed pin. Patch 2 is the seccomp
ABI and implementation. Patch 3 adds selftests.

Changes since v2:
  v3 is a redesign rather than an incremental revision. v2 added a
  SECCOMP_IOCTL_NOTIF_INJECT ioctl: the supervisor described a
  substitute syscall plus an input buffer, and on CONTINUE the kernel
  copied that buffer in and ran a kernel-side helper for a small
  whitelist of syscalls (openat, bind, write) without re-reading the
  target's memory. That closed the TOCTOU, but required an in-kernel
  reimplementation of every supported syscall and a fixed whitelist,
  and never actually ran the real syscall.

  v3 drops the kernel-side helpers entirely as suggested by Andy.

All four pinned-memfd selftests pass.

---
Cong Wang (3):
  mm: add __do_mmap() and vm_mmap_seal_remote()
  seccomp: add kernel-installed pinned-memfd redirect
  selftests/seccomp: cover non-cooperative pinned-memfd install

 include/linux/mm.h                            |   2 +
 include/linux/seccomp.h                       |   8 +
 include/uapi/linux/seccomp.h                  |  99 ++
 kernel/seccomp.c                              | 366 +++++++
 mm/internal.h                                 |   5 +
 mm/mmap.c                                     |  29 +-
 mm/nommu.c                                    |  12 +-
 mm/util.c                                     |  50 +
 mm/vma.c                                      |  18 +-
 mm/vma.h                                      |   6 +-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 960 ++++++++++++++++++
 11 files changed, 1533 insertions(+), 22 deletions(-)

base-commit: 28608283615e5e7e92ea79c8ea13507f4b5e0cbe
-- 
2.43.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH v3 1/3] mm: add __do_mmap() and vm_mmap_seal_remote()
  2026-06-13  0:15 [RFC PATCH v3 0/3] seccomp: non-cooperative pinned-memfd argument redirect Cong Wang
@ 2026-06-13  0:15 ` Cong Wang
  2026-06-13  0:15 ` [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect Cong Wang
  2026-06-13  0:15 ` [RFC PATCH v3 3/3] selftests/seccomp: cover non-cooperative pinned-memfd install Cong Wang
  2 siblings, 0 replies; 10+ messages in thread
From: Cong Wang @ 2026-06-13  0:15 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Kees Cook, linux-kernel, Will Drewry, Christian Brauner

Add __do_mmap(), a variant of do_mmap() that installs the mapping into
a caller-supplied mm rather than current->mm. do_mmap() becomes a thin
wrapper that passes current->mm, so all existing callers and the public
do_mmap() signature are unchanged; the same split is applied in the
nommu do_mmap(). mmap_region()/__mmap_region() gain an mm argument
(their sole caller is __do_mmap()) so the target mm flows down to where
the VMA is inserted. __do_mmap() is mm-internal, declared in
mm/internal.h.

On top of that, add vm_mmap_seal_remote() in mm/util.c, a high-level
entry point that installs a mapping into a caller-specified mm. The
intended consumer is seccomp_unotify, where an unprivileged supervisor
needs to install a sealed pinned memfd region in a supervised task's
address space without target-side cooperation (the existing mseal-based
pinned-memfd flow only worked if the target installed its own mmap+mseal
during a trusted setup window, which is unavailable for fork+execve
sandbox wrappers).

LSM hooks (security_mmap_file, fsnotify_mmap_perm) run against
current, the supervisor installing the mapping, not the target
mm's owner. This matches the supervisor-installs-into-target
mental model and parallels pidfd_getfd()'s cross-task fd install.

Cross-task authorization is left to the caller; this primitive
performs no ptrace_may_access check. The seccomp consumer gates
on listener-fd ownership.

Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 include/linux/mm.h |  2 ++
 mm/internal.h      |  5 +++++
 mm/mmap.c          | 29 ++++++++++++++++++---------
 mm/nommu.c         | 12 ++++++++++-
 mm/util.c          | 50 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/vma.c           | 18 ++++++++---------
 mm/vma.h           |  6 +++---
 7 files changed, 100 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fc2acedf0b76..dd14a32f76d3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4118,6 +4118,8 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
 	struct list_head *uf);
+unsigned long vm_mmap_seal_remote(struct mm_struct *mm, struct file *file,
+	unsigned long addr, unsigned long len, unsigned long pgoff);
 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
 			 bool unlock);
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..897c4e08e0b1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1437,6 +1437,11 @@ extern unsigned long  __must_check vm_mmap_pgoff(struct file *, unsigned long,
         unsigned long, unsigned long,
         unsigned long, unsigned long);
 
+unsigned long __do_mmap(struct mm_struct *mm, struct file *file,
+	unsigned long addr, unsigned long len, unsigned long prot,
+	unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff,
+	unsigned long *populate, struct list_head *uf);
+
 extern void set_pageblock_order(void);
 unsigned long reclaim_pages(struct list_head *folio_list);
 unsigned int reclaim_clean_pages_from_list(struct zone *zone,
diff --git a/mm/mmap.c b/mm/mmap.c
index 5754d1c36462..c9e437effd9c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -277,7 +277,7 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
 }
 
 /**
- * do_mmap() - Perform a userland memory mapping into the current process
+ * __do_mmap() - Perform a userland memory mapping into @mm's
  * address space of length @len with protection bits @prot, mmap flags @flags
  * (from which VMA flags will be inferred), and any additional VMA flags to
  * apply @vm_flags. If this is a file-backed mapping then the file is specified
@@ -307,8 +307,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
  * start of a VMA, rather only the start of a valid mapped range of length
  * @len bytes, rounded down to the nearest page size.
  *
- * The caller must write-lock current->mm->mmap_lock.
+ * The caller must write-lock @mm->mmap_lock. do_mmap() is the common
+ * wrapper that targets current->mm.
  *
+ * @mm: The mm_struct to install the mapping into. The caller must hold a
+ * reference and write-lock its mmap_lock.
  * @file: An optional struct file pointer describing the file which is to be
  * mapped, if a file-backed mapping.
  * @addr: If non-zero, hints at (or if @flags has MAP_FIXED set, specifies) the
@@ -333,13 +336,12 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
  * Returns: Either an error, or the address at which the requested mapping has
  * been performed.
  */
-unsigned long do_mmap(struct file *file, unsigned long addr,
-			unsigned long len, unsigned long prot,
-			unsigned long flags, vm_flags_t vm_flags,
-			unsigned long pgoff, unsigned long *populate,
-			struct list_head *uf)
+unsigned long __do_mmap(struct mm_struct *mm, struct file *file,
+			unsigned long addr, unsigned long len,
+			unsigned long prot, unsigned long flags,
+			vm_flags_t vm_flags, unsigned long pgoff,
+			unsigned long *populate, struct list_head *uf)
 {
-	struct mm_struct *mm = current->mm;
 	int pkey = 0;
 
 	*populate = 0;
@@ -557,7 +559,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(mm, file, addr, len, vm_flags, pgoff, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -565,6 +567,15 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	return addr;
 }
 
+unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len,
+		      unsigned long prot, unsigned long flags,
+		      vm_flags_t vm_flags, unsigned long pgoff,
+		      unsigned long *populate, struct list_head *uf)
+{
+	return __do_mmap(current->mm, file, addr, len, prot, flags,
+			     vm_flags, pgoff, populate, uf);
+}
+
 unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 			      unsigned long prot, unsigned long flags,
 			      unsigned long fd, unsigned long pgoff)
diff --git a/mm/nommu.c b/mm/nommu.c
index ed3934bc2de4..7f2136129c72 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1009,7 +1009,8 @@ static int do_mmap_private(struct vm_area_struct *vma,
 /*
  * handle mapping creation for uClinux
  */
-unsigned long do_mmap(struct file *file,
+unsigned long __do_mmap(struct mm_struct *mm,
+			struct file *file,
 			unsigned long addr,
 			unsigned long len,
 			unsigned long prot,
@@ -1246,6 +1247,15 @@ unsigned long do_mmap(struct file *file,
 	return -ENOMEM;
 }
 
+unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len,
+		      unsigned long prot, unsigned long flags,
+		      vm_flags_t vm_flags, unsigned long pgoff,
+		      unsigned long *populate, struct list_head *uf)
+{
+	return __do_mmap(current->mm, file, addr, len, prot, flags,
+			     vm_flags, pgoff, populate, uf);
+}
+
 unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 			      unsigned long prot, unsigned long flags,
 			      unsigned long fd, unsigned long pgoff)
diff --git a/mm/util.c b/mm/util.c
index 3cc949a0b7ed..ecc2087f744a 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -588,6 +588,56 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	return ret;
 }
 
+/**
+ * vm_mmap_seal_remote - install a sealed PROT_READ MAP_SHARED file mapping
+ * into @mm, without target-side cooperation.
+ * @mm: Target mm; caller holds a reference (e.g. get_task_mm()).
+ * @file: Backing file.
+ * @addr: Page-aligned address (MAP_FIXED_NOREPLACE: -EEXIST if occupied).
+ * @len: Length in bytes (page-aligned).
+ * @pgoff: Page offset into @file.
+ *
+ * The VMA is created VM_SEALED, so it is immediately immutable against the
+ * target mm's owner and its CLONE_VM peers. LSM/fsnotify hooks run against
+ * %current; cross-task authorization is the caller's responsibility (no
+ * ptrace_may_access check).
+ *
+ * Returns the mapped address on success, or a negative errno.
+ */
+unsigned long vm_mmap_seal_remote(struct mm_struct *mm, struct file *file,
+	unsigned long addr, unsigned long len, unsigned long pgoff)
+{
+	const unsigned long prot = PROT_READ;
+	const unsigned long flags = MAP_SHARED | MAP_FIXED_NOREPLACE;
+	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
+	unsigned long ret;
+	unsigned long populate;
+	LIST_HEAD(uf);
+
+	if (WARN_ON_ONCE(!mm))
+		return -EINVAL;
+	if (!VM_SEALED)		/* sealing unavailable (e.g. !CONFIG_64BIT) */
+		return -EOPNOTSUPP;
+
+	ret = security_mmap_file(file, prot, flags);
+	if (!ret)
+		ret = fsnotify_mmap_perm(file, prot, off, len);
+	if (!ret) {
+		if (mmap_write_lock_killable(mm))
+			return -EINTR;
+		ret = __do_mmap(mm, file, addr, len, prot, flags, VM_SEALED,
+				pgoff, &populate, &uf);
+		mmap_write_unlock(mm);
+		userfaultfd_unmap_complete(mm, &uf);
+		/*
+		 * Do not mm_populate() against a foreign mm; the target task
+		 * will fault pages in on first access.
+		 */
+	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vm_mmap_seal_remote);
+
 /*
  * Perform a userland memory mapping into the current process address space. See
  * the comment for do_mmap() for more details on this operation in general.
diff --git a/mm/vma.c b/mm/vma.c
index d90791b00a7b..fdd14349f719 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2729,11 +2729,10 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
 	return false;
 }
 
-static unsigned long __mmap_region(struct file *file, unsigned long addr,
-		unsigned long len, vma_flags_t vma_flags,
+static unsigned long __mmap_region(struct mm_struct *mm, struct file *file,
+		unsigned long addr, unsigned long len, vma_flags_t vma_flags,
 		unsigned long pgoff, struct list_head *uf)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
 	bool have_mmap_prepare = file && file->f_op->mmap_prepare;
 	VMA_ITERATOR(vmi, mm, addr);
@@ -2827,15 +2826,16 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
  * Returns: Either an error, or the address at which the requested mapping has
  * been performed.
  */
-unsigned long mmap_region(struct file *file, unsigned long addr,
-			  unsigned long len, vm_flags_t vm_flags,
-			  unsigned long pgoff, struct list_head *uf)
+unsigned long mmap_region(struct mm_struct *mm, struct file *file,
+			  unsigned long addr, unsigned long len,
+			  vm_flags_t vm_flags, unsigned long pgoff,
+			  struct list_head *uf)
 {
 	unsigned long ret;
 	bool writable_file_mapping = false;
 	const vma_flags_t vma_flags = legacy_to_vma_flags(vm_flags);
 
-	mmap_assert_write_locked(current->mm);
+	mmap_assert_write_locked(mm);
 
 	/* Check to see if MDWE is applicable. */
 	if (map_deny_write_exec(&vma_flags, &vma_flags))
@@ -2854,13 +2854,13 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		writable_file_mapping = true;
 	}
 
-	ret = __mmap_region(file, addr, len, vma_flags, pgoff, uf);
+	ret = __mmap_region(mm, file, addr, len, vma_flags, pgoff, uf);
 
 	/* Clear our write mapping regardless of error. */
 	if (writable_file_mapping)
 		mapping_unmap_writable(file->f_mapping);
 
-	validate_mm(current->mm);
+	validate_mm(mm);
 	return ret;
 }
 
diff --git a/mm/vma.h b/mm/vma.h
index 8e4b61a7304c..4f5222ad2e9d 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -459,9 +459,9 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
 int mm_take_all_locks(struct mm_struct *mm);
 void mm_drop_all_locks(struct mm_struct *mm);
 
-unsigned long mmap_region(struct file *file, unsigned long addr,
-		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf);
+unsigned long mmap_region(struct mm_struct *mm, struct file *file,
+		unsigned long addr, unsigned long len, vm_flags_t vm_flags,
+		unsigned long pgoff, struct list_head *uf);
 
 int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *brkvma,
 		 unsigned long addr, unsigned long request,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect
  2026-06-13  0:15 [RFC PATCH v3 0/3] seccomp: non-cooperative pinned-memfd argument redirect Cong Wang
  2026-06-13  0:15 ` [RFC PATCH v3 1/3] mm: add __do_mmap() and vm_mmap_seal_remote() Cong Wang
@ 2026-06-13  0:15 ` Cong Wang
  2026-06-13  4:03   ` Andy Lutomirski
  2026-06-13  0:15 ` [RFC PATCH v3 3/3] selftests/seccomp: cover non-cooperative pinned-memfd install Cong Wang
  2 siblings, 1 reply; 10+ messages in thread
From: Cong Wang @ 2026-06-13  0:15 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Kees Cook, linux-kernel, Will Drewry, Christian Brauner

Introduce non-cooperative pinned-memfd: the kernel installs a
sealed PROT_READ MAP_SHARED mapping of the supervisor's memfd
directly into the trapped task's mm, without any target-side
cooperation. This is what makes the feature work for fork+execve
sandbox wrappers (Sandlock CLI, Firejail, Bubblewrap-style)
where the target has no trusted post-exec window to install
its own mappings.

Two new ioctls are introduced:

  SECCOMP_IOCTL_NOTIF_PIN_INSTALL

    Supervisor names an active notification id, a memfd it owns,
    and a target address+size. Kernel grabs the trapped task's
    mm via get_task_mm(), calls vm_mmap_pgoff_to_mm() with
    MAP_FIXED | MAP_SHARED, PROT_READ, and VM_SEALED in
    extra_vm_flags. On success the VMA is installed in the
    target's mm, immediately sealed against munmap/mremap/
    mprotect/MAP_FIXED-stomp from the target itself and any
    CLONE_VM peer. The range is recorded on the listener filter
    for SEND_REDIRECT validation.

  SECCOMP_IOCTL_NOTIF_SEND_REDIRECT

    Resumes the trapped syscall (like FLAG_CONTINUE) with
    arg-register substitution. The supervisor supplies an
    args_mask (which arg registers to replace), a ptr_mask
    (which of those are pointers, validated to fall inside an
    installed pin) and replacement values. The kernel saves
    the trapped task's original arg registers into a small
    heap record, writes substituted values via
    syscall_set_arguments(), and queues a task_work callback
    that fires at user-mode return after the syscall completes
    to restore the original registers. This preserves the
    caller-saved arg-register ABI invariant for callers that
    expected register contents to survive across the syscall
    (compilers under LTO, inline-asm syscall wrappers, anything
    that doesn't strictly follow psABI).

The kernel-side capability is identical to what the trapped
task would have done with its own (peer-uncorrupted) arguments.
No per-syscall kernel-mode entrypoints are added; the
substituted syscall runs in the trapped task's context against
sealed pages whose contents are supervisor-controlled.

Pin lifetime: the VMA lives in the target's mm and follows that
mm's lifetime (the seal protects it from unmap by the target).
The bookkeeping reference to the backing memfd is released when
the listener filter is freed; if a target execve's, the new mm
is fresh and the supervisor's bookkeeping range becomes stale
(no VMA exists at that address). A future revision could
explicitly invalidate stale ranges on execve; current code
simply fails subsequent SEND_REDIRECTs against the stale range
when validation finds no matching VMA.

Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 include/linux/seccomp.h      |   8 +
 include/uapi/linux/seccomp.h |  99 ++++++++++
 kernel/seccomp.c             | 366 +++++++++++++++++++++++++++++++++++
 3 files changed, 473 insertions(+)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 9b959972bf4a..f4f733601b8c 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -16,6 +16,14 @@
 #define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24
 #define SECCOMP_NOTIFY_ADDFD_SIZE_LATEST SECCOMP_NOTIFY_ADDFD_SIZE_VER0
 
+/* sizeof() the first published struct seccomp_notif_pin_install */
+#define SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER0 32
+#define SECCOMP_NOTIFY_PIN_INSTALL_SIZE_LATEST SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER0
+
+/* sizeof() the first published struct seccomp_notif_resp_redirect */
+#define SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_VER0 120
+#define SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_LATEST SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_VER0
+
 #ifdef CONFIG_SECCOMP
 
 #include <linux/thread_info.h>
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index dbfc9b37fcae..4d769b634f0b 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -137,6 +137,37 @@ struct seccomp_notif_addfd {
 	__u32 newfd_flags;
 };
 
+/**
+ * struct seccomp_notif_pin_install - have the kernel install a sealed
+ * MAP_SHARED PROT_READ mapping of @memfd into the trapped task's mm
+ * at @target_addr, and record the range as a valid redirect target
+ * for SECCOMP_IOCTL_NOTIF_SEND_REDIRECT.
+ *
+ * The supervisor owns @memfd. The kernel installs the mapping into
+ * the trapped task's address space without target-side cooperation
+ * (the target need not mmap or mseal anything itself). The mapping
+ * is marked VM_SEALED at install time, so the target and any
+ * CLONE_VM peer cannot munmap, mremap, mprotect, or MAP_FIXED-stomp
+ * it. The supervisor retains write access via its own mapping of
+ * the same memfd in its own mm.
+ *
+ * @id: The ID of an active seccomp notification on this listener,
+ *      identifying the trapped task whose mm receives the pin.
+ * @flags: Reserved, must be 0.
+ * @memfd: Supervisor-side fd for the backing memfd.
+ * @target_addr: Address in the trapped task's mm to install at.
+ *               Must be page-aligned. MAP_FIXED semantics — no other
+ *               mapping may exist in [@target_addr, @target_addr + @size).
+ * @size: Size of the pin in bytes. Must be page-aligned.
+ */
+struct seccomp_notif_pin_install {
+	__u64 id;
+	__u32 flags;
+	__u32 memfd;
+	__u64 target_addr;
+	__u64 size;
+};
+
 #define SECCOMP_IOC_MAGIC		'!'
 #define SECCOMP_IO(nr)			_IO(SECCOMP_IOC_MAGIC, nr)
 #define SECCOMP_IOR(nr, type)		_IOR(SECCOMP_IOC_MAGIC, nr, type)
@@ -154,4 +185,72 @@ struct seccomp_notif_addfd {
 
 #define SECCOMP_IOCTL_NOTIF_SET_FLAGS	SECCOMP_IOW(4, __u64)
 
+/* Valid flags for struct seccomp_notif_resp_redirect. */
+#define SECCOMP_REDIRECT_FLAG_CONTINUE (1UL << 0)
+
+/*
+ * Number of syscall argument registers a redirect response may
+ * substitute (matches struct seccomp_data::args[]).
+ */
+#define SECCOMP_REDIRECT_ARGS 6
+
+/**
+ * struct seccomp_notif_resp_redirect - resume the trapped syscall with
+ * substituted arg-register values pointing into a previously installed
+ * pinned-memfd region.
+ *
+ * Like SECCOMP_USER_NOTIF_FLAG_CONTINUE the syscall actually runs, but
+ * the kernel first rewrites the arg registers per @args_mask. Each
+ * pointer substitution (@ptr_mask) is validated so the whole access
+ * [args[i], args[i] + ptr_len[i]) lies inside one pin still mapped in
+ * the trapped task's current mm. Original registers are saved and
+ * restored at syscall exit for ABI compliance - except after a
+ * successful execve, whose new register file is left untouched (the
+ * redirect still applies, as execve copies the pathname from the
+ * immutable pin before the old mm is gone, closing that TOCTOU too).
+ *
+ * @id: The ID of the seccomp notification this response consumes.
+ * @flags: SECCOMP_REDIRECT_FLAG_*. CONTINUE must be set.
+ * @args_mask: Bit i set means args[i] replaces the trapped task's
+ *             corresponding arg register before the syscall runs.
+ * @ptr_mask: Subset of @args_mask. Bit i set means args[i] is a
+ *            pointer and the access [args[i], args[i] + ptr_len[i])
+ *            is validated to lie entirely inside a single installed
+ *            pin. Scalar replacements (in args_mask but not ptr_mask)
+ *            are written verbatim.
+ * @_pad: Reserved, must be 0.
+ * @args: Replacement values for the arg registers.
+ * @ptr_len: For each bit set in @ptr_mask, ptr_len[i] is the byte
+ *           length of the access starting at args[i]; it must be
+ *           non-zero and args[i] + ptr_len[i] must not overflow.
+ *           For every i whose bit is clear in @ptr_mask, ptr_len[i]
+ *           must be 0.
+ */
+struct seccomp_notif_resp_redirect {
+	__u64 id;
+	__u32 flags;
+	__u32 args_mask;
+	__u32 ptr_mask;
+	__u32 _pad;
+	__u64 args[SECCOMP_REDIRECT_ARGS];
+	__u64 ptr_len[SECCOMP_REDIRECT_ARGS];
+};
+
+/*
+ * Install a sealed memfd-backed pin in the trapped task's mm without
+ * target-side cooperation. The supervisor owns the backing memfd;
+ * the kernel installs the mapping and marks it VM_SEALED.
+ */
+#define SECCOMP_IOCTL_NOTIF_PIN_INSTALL	SECCOMP_IOW(5, \
+						struct seccomp_notif_pin_install)
+
+/*
+ * Resume the trapped syscall with substituted arg-register values
+ * pointing into an installed pin. The kernel saves and restores the
+ * original registers at syscall exit so the caller observes ABI-
+ * correct register preservation.
+ */
+#define SECCOMP_IOCTL_NOTIF_SEND_REDIRECT	SECCOMP_IOW(6, \
+						struct seccomp_notif_resp_redirect)
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 066909393c38..51b929a6fa6a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -43,6 +43,12 @@
 #include <linux/uaccess.h>
 #include <linux/anon_inodes.h>
 #include <linux/lockdep.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/mmap_lock.h>
+#include <linux/sched/mm.h>
+#include <linux/task_work.h>
+#include <uapi/asm-generic/mman-common.h>
 
 /*
  * When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced, it had the
@@ -232,6 +238,27 @@ struct seccomp_filter {
 	struct notification *notif;
 	struct mutex notify_lock;
 	wait_queue_head_t wqh;
+	struct list_head pins;
+};
+
+/*
+ * A sealed MAP_SHARED PROT_READ mapping of @memfd_file installed at
+ * @target_addr in @mm by SECCOMP_IOCTL_NOTIF_PIN_INSTALL, used as a
+ * redirect target for SECCOMP_IOCTL_NOTIF_SEND_REDIRECT.
+ *
+ * @mm is held via mmgrab() (mm_count, not mm_users) purely for a stable
+ * identity compare; it does not keep the address space alive, so execve
+ * tears the mapping down and switches to a fresh mm. A redirect is only
+ * honoured while the trapped task's current mm equals @mm: since
+ * VM_SEALED makes the mapping immutable for the life of that mm, an
+ * identity match proves the range is still the read-only memfd backing.
+ */
+struct seccomp_pin_range {
+	struct list_head list;
+	unsigned long target_addr;
+	size_t size;
+	struct file *memfd_file;
+	struct mm_struct *mm;
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -521,9 +548,44 @@ static inline pid_t seccomp_can_sync_threads(void)
 	return 0;
 }
 
+static void seccomp_pin_range_free(struct seccomp_pin_range *pin)
+{
+	fput(pin->memfd_file);
+	mmdrop(pin->mm);
+	kfree(pin);
+}
+
+static void seccomp_pin_evict(struct seccomp_filter *filter,
+			      unsigned long target_addr)
+{
+	struct seccomp_pin_range *old, *tmp;
+
+	lockdep_assert_held(&filter->notify_lock);
+	list_for_each_entry_safe(old, tmp, &filter->pins, list) {
+		if (old->target_addr == target_addr) {
+			list_del(&old->list);
+			seccomp_pin_range_free(old);
+		}
+	}
+}
+
 static inline void seccomp_filter_free(struct seccomp_filter *filter)
 {
+	struct seccomp_pin_range *pin, *tmp;
+
 	if (filter) {
+		/*
+		 * No other reference to @filter exists at this point. The
+		 * pinned-memfd VMAs themselves live in the trapped task's mm
+		 * and follow that mm's lifetime (the seal protects them from
+		 * unmap by the target); here we only need to drop our
+		 * bookkeeping references: the backing memfd file and the
+		 * mmgrab() identity reference on the install-time mm.
+		 */
+		list_for_each_entry_safe(pin, tmp, &filter->pins, list) {
+			list_del(&pin->list);
+			seccomp_pin_range_free(pin);
+		}
 		bpf_prog_destroy(filter->prog);
 		kfree(filter);
 	}
@@ -708,6 +770,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	refcount_set(&sfilter->refs, 1);
 	refcount_set(&sfilter->users, 1);
 	init_waitqueue_head(&sfilter->wqh);
+	INIT_LIST_HEAD(&sfilter->pins);
 
 	return sfilter;
 }
@@ -1823,6 +1886,303 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
 	return ret;
 }
 
+/*
+ * Install a sealed PROT_READ MAP_SHARED mapping of @memfd_file at
+ * @target_addr in @target's mm, then record the range on @filter for
+ * later SEND_REDIRECT pointer validation. Steals one reference on
+ * @memfd_file on success (held until filter free); on failure, the
+ * caller is responsible for releasing it.
+ */
+static long seccomp_install_pin(struct seccomp_filter *filter,
+				struct task_struct *target,
+				struct file *memfd_file,
+				unsigned long target_addr, size_t size)
+{
+	struct seccomp_pin_range *range;
+	struct mm_struct *mm;
+	unsigned long ret;
+
+	mm = get_task_mm(target);
+	if (!mm)
+		return -ESRCH;
+
+	range = kmalloc_obj(*range, GFP_KERNEL_ACCOUNT);
+	if (!range) {
+		mmput(mm);
+		return -ENOMEM;
+	}
+
+	/*
+	 * Install a read-only, sealed, MAP_FIXED_NOREPLACE mapping: an
+	 * existing mapping at @target_addr yields -EEXIST rather than being
+	 * silently clobbered.
+	 */
+	ret = vm_mmap_seal_remote(mm, memfd_file, target_addr, size, 0);
+	if (IS_ERR_VALUE(ret)) {
+		mmput(mm);
+		kfree(range);
+		return (long)ret;
+	}
+	if (ret != target_addr) {
+		mmput(mm);
+		kfree(range);
+		return -ENOMEM;
+	}
+
+	mmgrab(mm);
+	mmput(mm); /* For get_task_mm() */
+
+	range->target_addr = target_addr;
+	range->size = size;
+	range->memfd_file = memfd_file;
+	range->mm = mm;
+
+	mutex_lock(&filter->notify_lock);
+	seccomp_pin_evict(filter, target_addr);
+	list_add(&range->list, &filter->pins);
+	mutex_unlock(&filter->notify_lock);
+	return 0;
+}
+
+static long seccomp_notify_pin_install(struct seccomp_filter *filter,
+				       struct seccomp_notif_pin_install __user *upin,
+				       unsigned int size)
+{
+	struct seccomp_notif_pin_install pin;
+	struct seccomp_knotif *knotif;
+	struct task_struct *target;
+	struct file *memfd_file;
+	long ret;
+
+	BUILD_BUG_ON(sizeof(pin) < SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER0);
+	BUILD_BUG_ON(sizeof(pin) != SECCOMP_NOTIFY_PIN_INSTALL_SIZE_LATEST);
+
+	if (size < SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER0 || size >= PAGE_SIZE)
+		return -EINVAL;
+
+	ret = copy_struct_from_user(&pin, sizeof(pin), upin, size);
+	if (ret)
+		return ret;
+
+	if (pin.flags)
+		return -EINVAL;
+	if (!pin.size || !IS_ALIGNED(pin.target_addr, PAGE_SIZE) ||
+	    !IS_ALIGNED(pin.size, PAGE_SIZE))
+		return -EINVAL;
+	if (pin.target_addr + pin.size < pin.target_addr)
+		return -EINVAL;
+
+	memfd_file = fget(pin.memfd);
+	if (!memfd_file)
+		return -EBADF;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		goto out_fput;
+
+	knotif = find_notification(filter, pin.id);
+	if (!knotif) {
+		ret = -ENOENT;
+		goto out_unlock;
+	}
+	if (knotif->state != SECCOMP_NOTIFY_SENT) {
+		ret = -EINPROGRESS;
+		goto out_unlock;
+	}
+
+	target = knotif->task;
+	get_task_struct(target);
+	mutex_unlock(&filter->notify_lock);
+
+	ret = seccomp_install_pin(filter, target, memfd_file,
+				  pin.target_addr, pin.size);
+	put_task_struct(target);
+
+	if (ret < 0)
+		goto out_fput;
+	return 0;
+
+out_unlock:
+	mutex_unlock(&filter->notify_lock);
+out_fput:
+	fput(memfd_file);
+	return ret;
+}
+
+static bool seccomp_pin_check(struct seccomp_filter *filter,
+			      struct mm_struct *target_mm, u64 ptr, u64 len)
+{
+	struct seccomp_pin_range *pin;
+	u64 end;
+
+	lockdep_assert_held(&filter->notify_lock);
+
+	if (!len)
+		return false;
+	end = ptr + len;
+	if (end < ptr)
+		return false;
+
+	list_for_each_entry(pin, &filter->pins, list) {
+		if (pin->mm != target_mm)
+			continue;
+		if (ptr >= pin->target_addr &&
+		    end <= pin->target_addr + pin->size)
+			return true;
+	}
+	return false;
+}
+
+struct seccomp_redirect_restore {
+	struct callback_head twork;
+	unsigned long orig_args[SECCOMP_REDIRECT_ARGS];
+	u32 args_mask;
+	u64 self_exec_id;	/* snapshot to detect an intervening execve */
+};
+
+static void seccomp_redirect_restore_cb(struct callback_head *cb)
+{
+	struct seccomp_redirect_restore *r =
+		container_of(cb, struct seccomp_redirect_restore, twork);
+	unsigned long args[SECCOMP_REDIRECT_ARGS];
+	int i;
+
+	if (READ_ONCE(current->self_exec_id) != r->self_exec_id) {
+		kfree(r);
+		return;
+	}
+
+	syscall_get_arguments(current, current_pt_regs(), args);
+	for (i = 0; i < SECCOMP_REDIRECT_ARGS; i++)
+		if (r->args_mask & (1U << i))
+			args[i] = r->orig_args[i];
+	syscall_set_arguments(current, current_pt_regs(), args);
+	kfree(r);
+}
+
+static long seccomp_notify_send_redirect(struct seccomp_filter *filter,
+					 struct seccomp_notif_resp_redirect __user *uresp,
+					 unsigned int size)
+{
+	struct seccomp_notif_resp_redirect resp;
+	struct seccomp_knotif *knotif;
+	struct seccomp_redirect_restore *restore;
+	struct pt_regs *target_regs;
+	unsigned long args[SECCOMP_REDIRECT_ARGS];
+	long ret;
+	int i;
+
+	BUILD_BUG_ON(sizeof(resp) < SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_VER0);
+	BUILD_BUG_ON(sizeof(resp) != SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_LATEST);
+
+	if (size < SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_VER0 || size >= PAGE_SIZE)
+		return -EINVAL;
+
+	ret = copy_struct_from_user(&resp, sizeof(resp), uresp, size);
+	if (ret)
+		return ret;
+
+	if (!(resp.flags & SECCOMP_REDIRECT_FLAG_CONTINUE))
+		return -EINVAL;
+	if (resp.flags & ~SECCOMP_REDIRECT_FLAG_CONTINUE)
+		return -EINVAL;
+	if (resp.args_mask & ~((1U << SECCOMP_REDIRECT_ARGS) - 1))
+		return -EINVAL;
+	if (resp.ptr_mask & ~resp.args_mask)
+		return -EINVAL;
+	if (resp._pad)
+		return -EINVAL;
+	if (!resp.args_mask)
+		return -EINVAL;
+
+	for (i = 0; i < SECCOMP_REDIRECT_ARGS; i++) {
+		if (resp.ptr_mask & (1U << i)) {
+			if (!resp.ptr_len[i])
+				return -EINVAL;
+		} else if (resp.ptr_len[i]) {
+			return -EINVAL;
+		}
+	}
+
+	restore = kzalloc_obj(*restore, GFP_KERNEL_ACCOUNT);
+	if (!restore)
+		return -ENOMEM;
+	init_task_work(&restore->twork, seccomp_redirect_restore_cb);
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0) {
+		kfree(restore);
+		return ret;
+	}
+
+	knotif = find_notification(filter, resp.id);
+	if (!knotif) {
+		ret = -ENOENT;
+		goto out_unlock_free;
+	}
+	if (knotif->state != SECCOMP_NOTIFY_SENT) {
+		ret = -EINPROGRESS;
+		goto out_unlock_free;
+	}
+
+	for (i = 0; i < SECCOMP_REDIRECT_ARGS; i++) {
+		if (!(resp.ptr_mask & (1U << i)))
+			continue;
+		if (!seccomp_pin_check(filter, knotif->task->mm,
+				       resp.args[i], resp.ptr_len[i])) {
+			ret = -EFAULT;
+			goto out_unlock_free;
+		}
+	}
+
+	/*
+	 * Save original pt_regs args (target is parked in
+	 * seccomp_do_user_notification, so its pt_regs is stable) and
+	 * write substituted values. The trapped task's task_work fires
+	 * at user-mode return, restoring originals for ABI compliance.
+	 */
+	target_regs = task_pt_regs(knotif->task);
+	syscall_get_arguments(knotif->task, target_regs, args);
+	for (i = 0; i < SECCOMP_REDIRECT_ARGS; i++)
+		restore->orig_args[i] = args[i];
+	restore->args_mask = resp.args_mask;
+	restore->self_exec_id = READ_ONCE(knotif->task->self_exec_id);
+
+	for (i = 0; i < SECCOMP_REDIRECT_ARGS; i++)
+		if (resp.args_mask & (1U << i))
+			args[i] = resp.args[i];
+	syscall_set_arguments(knotif->task, target_regs, args);
+
+	ret = task_work_add(knotif->task, &restore->twork, TWA_RESUME);
+	if (ret) {
+		for (i = 0; i < SECCOMP_REDIRECT_ARGS; i++)
+			args[i] = restore->orig_args[i];
+		syscall_set_arguments(knotif->task, target_regs, args);
+		goto out_unlock_free;
+	}
+
+	/*
+	 * Mark REPLIED with FLAG_CONTINUE so the wait-loop exit path
+	 * runs the syscall normally.
+	 */
+	knotif->state = SECCOMP_NOTIFY_REPLIED;
+	knotif->error = 0;
+	knotif->val = 0;
+	knotif->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+	if (filter->notif->flags & SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)
+		complete_on_current_cpu(&knotif->ready);
+	else
+		complete(&knotif->ready);
+
+	mutex_unlock(&filter->notify_lock);
+	return 0;
+
+out_unlock_free:
+	mutex_unlock(&filter->notify_lock);
+	kfree(restore);
+	return ret;
+}
+
 static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
 				 unsigned long arg)
 {
@@ -1847,6 +2207,12 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
 	switch (EA_IOCTL(cmd)) {
 	case EA_IOCTL(SECCOMP_IOCTL_NOTIF_ADDFD):
 		return seccomp_notify_addfd(filter, buf, _IOC_SIZE(cmd));
+	case EA_IOCTL(SECCOMP_IOCTL_NOTIF_PIN_INSTALL):
+		return seccomp_notify_pin_install(filter, buf,
+						  _IOC_SIZE(cmd));
+	case EA_IOCTL(SECCOMP_IOCTL_NOTIF_SEND_REDIRECT):
+		return seccomp_notify_send_redirect(filter, buf,
+						    _IOC_SIZE(cmd));
 	default:
 		return -EINVAL;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v3 3/3] selftests/seccomp: cover non-cooperative pinned-memfd install
  2026-06-13  0:15 [RFC PATCH v3 0/3] seccomp: non-cooperative pinned-memfd argument redirect Cong Wang
  2026-06-13  0:15 ` [RFC PATCH v3 1/3] mm: add __do_mmap() and vm_mmap_seal_remote() Cong Wang
  2026-06-13  0:15 ` [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect Cong Wang
@ 2026-06-13  0:15 ` Cong Wang
  2026-06-13 10:07   ` kernel test robot
  2 siblings, 1 reply; 10+ messages in thread
From: Cong Wang @ 2026-06-13  0:15 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Kees Cook, linux-kernel, Will Drewry, Christian Brauner

Exercises SECCOMP_IOCTL_NOTIF_PIN_INSTALL: a forked child that does
no setup of its own traps on openat; the supervisor calls
PIN_INSTALL and the kernel installs a sealed PROT_READ MAP_SHARED
mapping of the supervisor's memfd into the child's mm via
vm_mmap_pgoff_to_mm(). The supervisor then SEND_REDIRECTs the
openat path argument to point into the freshly installed pin, and
the child's syscall reads the supervisor-controlled safe path from
the sealed pages.

This is the same kernel path a fork+execve sandbox supervisor
(Sandlock CLI, Firejail, Bubblewrap-style) would use to install
a pin in the new image's fresh post-exec mm. Also covers the
rejection path: redirect into an address one byte past the installed
pin returns -EFAULT.

Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@multikernel.io>
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 960 ++++++++++++++++++
 1 file changed, 960 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 358b6c65e120..763a8629770e 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -295,6 +295,33 @@ struct seccomp_notif_addfd_big {
 #define PTRACE_EVENTMSG_SYSCALL_EXIT	2
 #endif
 
+#ifndef SECCOMP_IOCTL_NOTIF_PIN_INSTALL
+struct seccomp_notif_pin_install {
+	__u64 id;
+	__u32 flags;
+	__u32 memfd;
+	__u64 target_addr;
+	__u64 size;
+};
+#define SECCOMP_IOCTL_NOTIF_PIN_INSTALL	SECCOMP_IOW(5, \
+						struct seccomp_notif_pin_install)
+#endif
+
+#ifndef SECCOMP_IOCTL_NOTIF_SEND_REDIRECT
+#define SECCOMP_REDIRECT_FLAG_CONTINUE (1UL << 0)
+struct seccomp_notif_resp_redirect {
+	__u64 id;
+	__u32 flags;
+	__u32 args_mask;
+	__u32 ptr_mask;
+	__u32 _pad;
+	__u64 args[6];
+	__u64 ptr_len[6];
+};
+#define SECCOMP_IOCTL_NOTIF_SEND_REDIRECT	SECCOMP_IOW(6, \
+						struct seccomp_notif_resp_redirect)
+#endif
+
 #ifndef SECCOMP_USER_NOTIF_FLAG_CONTINUE
 #define SECCOMP_USER_NOTIF_FLAG_CONTINUE 0x00000001
 #endif
@@ -4368,6 +4395,939 @@ TEST(user_notification_addfd_rlimit)
 	close(memfd);
 }
 
+/*
+ * Non-cooperative pinned-memfd: kernel installs a sealed PROT_READ
+ * MAP_SHARED mapping of the supervisor's memfd directly into the
+ * trapped task's mm. The target runs no mmap or mseal code itself —
+ * this exercises the same kernel path that a fork+execve sandbox
+ * supervisor would use to install a pin in the new image's fresh
+ * post-exec mm.
+ *
+ * Target child does nothing but call openat() on a bait path. The
+ * supervisor catches the trap, calls PIN_INSTALL (kernel does the
+ * mmap + seal in target's mm via vm_mmap_seal_remote()), writes a
+ * safe path into its own memfd view, and SEND_REDIRECTs args[1]
+ * into the freshly installed pin. The child's openat resumes,
+ * reads from the sealed pin, and returns an fd to the safe path.
+ */
+TEST(user_notification_pinned_memfd_remote)
+{
+	pid_t pid;
+	long ret;
+	int status, listener, memfd;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_pin_install pin = {};
+	struct seccomp_notif_resp_redirect redir = {};
+	char *sup_view;
+	const unsigned long TGT_PIN_BASE = 0x70000000UL;
+	const size_t PIN_SIZE = 4096;
+	const char *safe_path = "/dev/null";
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	memfd = memfd_create("pinned-remote", MFD_ALLOW_SEALING);
+	ASSERT_GE(memfd, 0);
+	ASSERT_EQ(0, ftruncate(memfd, PIN_SIZE));
+	ASSERT_EQ(0, fcntl(memfd, F_ADD_SEALS,
+			   F_SEAL_SHRINK | F_SEAL_GROW));
+
+	sup_view = mmap(NULL, PIN_SIZE, PROT_READ | PROT_WRITE,
+			MAP_SHARED, memfd, 0);
+	ASSERT_NE(MAP_FAILED, sup_view);
+	memcpy(sup_view, safe_path, strlen(safe_path) + 1);
+
+	listener = user_notif_syscall(__NR_openat,
+				      SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	ASSERT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+
+		/*
+		 * Target performs no setup. Just trap on openat. Kernel
+		 * (driven by the supervisor) will install the pin in this
+		 * process's mm at TGT_PIN_BASE behind our back, and our
+		 * openat will be redirected to read from there.
+		 */
+		fd = syscall(__NR_openat, AT_FDCWD,
+			     "/this/should/never/be/touched", O_RDONLY, 0);
+		if (fd < 0)
+			_exit(11);
+		_exit(0);
+	}
+
+	ASSERT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req));
+	EXPECT_EQ(req.data.nr, __NR_openat);
+
+	pin.id = req.id;
+	pin.memfd = memfd;
+	pin.target_addr = TGT_PIN_BASE;
+	pin.size = PIN_SIZE;
+	EXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_INSTALL, &pin)) {
+		if (errno == EINVAL) {
+			SKIP(goto cleanup,
+			     "Kernel does not support pinned-memfd remote install");
+		}
+		TH_LOG("PIN_INSTALL failed: errno=%d", errno);
+	}
+
+	/* Reject: redirect outside any installed pin. */
+	redir.id = req.id;
+	redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+	redir.args_mask = 1U << 1;
+	redir.ptr_mask = 1U << 1;
+	redir.ptr_len[1] = strlen(safe_path) + 1;
+	redir.args[1] = TGT_PIN_BASE + PIN_SIZE;	/* one byte past */
+	EXPECT_EQ(-1, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+			    &redir));
+	EXPECT_EQ(EFAULT, errno);
+
+	/* Reject: base is inside the pin but the extent runs past its end. */
+	redir.args[1] = TGT_PIN_BASE;
+	redir.ptr_len[1] = PIN_SIZE + 1;
+	EXPECT_EQ(-1, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+			    &redir));
+	EXPECT_EQ(EFAULT, errno);
+
+	/* Happy path: redirect into the kernel-installed pin. */
+	redir.args[1] = TGT_PIN_BASE;
+	redir.ptr_len[1] = strlen(safe_path) + 1;
+	EXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+			   &redir));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status)) {
+		TH_LOG("child exit %d (11=openat fail)", WEXITSTATUS(status));
+	}
+
+cleanup:
+	munmap(sup_view, PIN_SIZE);
+	close(memfd);
+	close(listener);
+}
+
+/*
+ * Helper for the execve test: read up to @max bytes of a NUL-terminated
+ * string from @pid's mm at @addr into @out. Returns the length read
+ * (excluding the NUL), or -1 on failure or no NUL.
+ */
+static ssize_t read_remote_string(pid_t pid, unsigned long addr,
+				  char *out, size_t max)
+{
+	struct iovec local = { .iov_base = out, .iov_len = max };
+	struct iovec remote = { .iov_base = (void *)addr, .iov_len = max };
+	ssize_t n;
+	size_t i;
+
+	n = process_vm_readv(pid, &local, 1, &remote, 1, 0);
+	if (n <= 0)
+		return -1;
+	for (i = 0; i < (size_t)n; i++)
+		if (out[i] == '\0')
+			return (ssize_t)i;
+	return -1;
+}
+
+/*
+ * Send a file descriptor over a connected UNIX socket via SCM_RIGHTS.
+ * Used by the execve_scm test so the target child can hand its
+ * SECCOMP_FILTER_FLAG_NEW_LISTENER fd to the supervising parent
+ * without the parent having to inherit the seccomp filter itself.
+ */
+static int send_fd(int sock, int fd)
+{
+	char cbuf[CMSG_SPACE(sizeof(int))] = {};
+	char data = 'x';
+	struct iovec iov = { .iov_base = &data, .iov_len = 1 };
+	struct msghdr msg = {
+		.msg_iov = &iov, .msg_iovlen = 1,
+		.msg_control = cbuf, .msg_controllen = sizeof(cbuf),
+	};
+	struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
+
+	cmsg->cmsg_level = SOL_SOCKET;
+	cmsg->cmsg_type = SCM_RIGHTS;
+	cmsg->cmsg_len = CMSG_LEN(sizeof(int));
+	memcpy(CMSG_DATA(cmsg), &fd, sizeof(int));
+	return sendmsg(sock, &msg, 0) < 0 ? -1 : 0;
+}
+
+static int recv_fd(int sock)
+{
+	char cbuf[CMSG_SPACE(sizeof(int))] = {};
+	char data;
+	struct iovec iov = { .iov_base = &data, .iov_len = 1 };
+	struct msghdr msg = {
+		.msg_iov = &iov, .msg_iovlen = 1,
+		.msg_control = cbuf, .msg_controllen = sizeof(cbuf),
+	};
+	struct cmsghdr *cmsg;
+	int fd;
+
+	if (recvmsg(sock, &msg, 0) < 0)
+		return -1;
+	cmsg = CMSG_FIRSTHDR(&msg);
+	if (!cmsg || cmsg->cmsg_level != SOL_SOCKET ||
+	    cmsg->cmsg_type != SCM_RIGHTS ||
+	    cmsg->cmsg_len != CMSG_LEN(sizeof(int)))
+		return -1;
+	memcpy(&fd, CMSG_DATA(cmsg), sizeof(int));
+	return fd;
+}
+
+struct addr_range {
+	unsigned long start, end;
+};
+
+/*
+ * Parse /proc/<pid>/maps looking for the dynamic linker's executable
+ * mapping (glibc ld-linux-*.so, musl ld-musl-*.so, etc.). The trapped
+ * task's instruction_pointer falling in this range identifies a
+ * loader-bootstrap syscall (race-free, kernel-truth) so the supervisor
+ * can auto-allow it without inspecting argument content via the racy
+ * process_vm_readv path.
+ *
+ * Requires the supervisor not to be subject to the seccomp filter
+ * itself -- fopen() internally calls openat(). The execve_scm test
+ * structure (child installs filter, sends listener fd to parent via
+ * SCM_RIGHTS) satisfies that.
+ *
+ * Returns 0 on success with @out populated, -1 if not found.
+ */
+static int find_loader_text_range(pid_t pid, struct addr_range *out)
+{
+	char maps_path[64];
+	char line[512];
+	FILE *f;
+	int found = 0;
+
+	snprintf(maps_path, sizeof(maps_path), "/proc/%d/maps", pid);
+	f = fopen(maps_path, "r");
+	if (!f)
+		return -1;
+
+	while (fgets(line, sizeof(line), f)) {
+		unsigned long start, end;
+		char perms[8];
+		char *path;
+
+		if (sscanf(line, "%lx-%lx %7s", &start, &end, perms) != 3)
+			continue;
+		if (!strchr(perms, 'x'))
+			continue;
+		path = strchr(line, '/');
+		if (!path)
+			continue;
+		/*
+		 * Match common dynamic-linker basenames: ld-linux-*.so
+		 * (glibc), ld-musl-*.so (musl), ld-*.so (older glibc).
+		 */
+		if (strstr(path, "/ld-") || strstr(path, "/ld.so")) {
+			out->start = start;
+			out->end = end;
+			found = 1;
+			break;
+		}
+	}
+	fclose(f);
+	return found ? 0 : -1;
+}
+
+/*
+ * Non-cooperative pinned-memfd across actual execve, with both
+ * post-execve PIN_INSTALL AND post-execve redirect exercised.
+ *
+ *   Phase 1: child does a pre-execve openat; supervisor PIN_INSTALLs
+ *            and SEND_REDIRECTs (same as the basic test).
+ *
+ *   Phase 2: child execve's /bin/cat with a sentinel bait path as
+ *            its argument. The pre-execve pin VMA dies with the old
+ *            mm; the listener filter's bookkeeping becomes stale.
+ *
+ *   Phase 3: ld.so does openat for libc etc. on startup, then cat's
+ *            main() does openat on the bait path. Supervisor reads
+ *            each trapped path via process_vm_readv; library paths
+ *            get SEND_CONTINUE (let the loader do its real work);
+ *            the bait path triggers a fresh PIN_INSTALL (exercises
+ *            idempotent replace in the post-execve mm) followed by
+ *            SEND_REDIRECT pointing args[1] into the pin (which the
+ *            supervisor has primed with "/dev/null"). cat then reads
+ *            zero bytes from /dev/null, writes nothing, exits 0.
+ *
+ * Verifies:
+ *   - Pre-execve redirect works.
+ *   - Post-execve PIN_INSTALL succeeds in the freshly-replaced mm
+ *     (idempotent replace of the stale phase-1 bookkeeping).
+ *   - Post-execve SEND_REDIRECT actually substitutes a path the new
+ *     image then reads — proves the full redirect mechanism works
+ *     across an mm replacement, not just the install side.
+ *   - cat exits 0 (substituted /dev/null read succeeded).
+ */
+TEST(user_notification_pinned_memfd_execve)
+{
+	pid_t pid;
+	long ret;
+	int status, listener, memfd;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_pin_install pin = {};
+	struct seccomp_notif_resp_redirect redir = {};
+	struct seccomp_notif_resp cont_resp = {};
+	char *sup_view;
+	const unsigned long TGT_PIN_BASE = 0x70000000UL;
+	const size_t PIN_SIZE = 4096;
+	const char *safe_path = "/dev/null";
+	const char *bait = "/seccomp_pinned_memfd_test_bait";
+	bool post_exec_install_ok = false;
+	bool post_exec_redirect_done = false;
+	int phase = 0;	/* 0=pre-execve, 1=post-execve */
+	int trap_count = 0;
+	const int trap_limit = 200;
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	if (access("/bin/cat", X_OK) != 0)
+		SKIP(return, "/bin/cat not present");
+
+	memfd = memfd_create("pin-execve", MFD_ALLOW_SEALING);
+	ASSERT_GE(memfd, 0);
+	ASSERT_EQ(0, ftruncate(memfd, PIN_SIZE));
+	ASSERT_EQ(0, fcntl(memfd, F_ADD_SEALS,
+			   F_SEAL_SHRINK | F_SEAL_GROW));
+
+	sup_view = mmap(NULL, PIN_SIZE, PROT_READ | PROT_WRITE,
+			MAP_SHARED, memfd, 0);
+	ASSERT_NE(MAP_FAILED, sup_view);
+	memcpy(sup_view, safe_path, strlen(safe_path) + 1);
+
+	listener = user_notif_syscall(__NR_openat,
+				      SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	ASSERT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int fd;
+
+		/* Pre-execve trap. */
+		fd = syscall(__NR_openat, AT_FDCWD,
+			     "/this/should/never/be/touched", O_RDONLY, 0);
+		if (fd < 0)
+			_exit(11);
+
+		/*
+		 * Replace mm via execve. The pre-execve pin VMA is
+		 * destroyed along with the old mm; the listener filter's
+		 * bookkeeping range becomes stale until the supervisor
+		 * reinstalls. cat will openat the bait path post-execve,
+		 * which the supervisor redirects to /dev/null.
+		 */
+		execl("/bin/cat", "cat", bait, (char *)NULL);
+		_exit(12);
+	}
+
+	for (;;) {
+		struct pollfd pfd = { .fd = listener, .events = POLLIN };
+		int pret = poll(&pfd, 1, 500);
+		pid_t reaped;
+
+		if (pret < 0)
+			break;
+		if (pret == 0 || !(pfd.revents & POLLIN)) {
+			/*
+			 * Timeout or listener-side hangup (no more notifs
+			 * will arrive — child has exited or detached). Use
+			 * waitpid to confirm and exit the loop cleanly;
+			 * issuing NOTIF_RECV unconditionally here would
+			 * block forever when there's nothing to receive.
+			 */
+			reaped = waitpid(pid, &status, WNOHANG);
+			if (reaped == pid)
+				break;
+			if (pfd.revents & (POLLHUP | POLLERR))
+				break;
+			continue;
+		}
+
+		memset(&req, 0, sizeof(req));
+		if (ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req) < 0) {
+			TH_LOG("NOTIF_RECV failed: errno=%d", errno);
+			break;
+		}
+		if (++trap_count > trap_limit) {
+			TH_LOG("trap_limit (%d) exceeded -- aborting loop",
+			       trap_limit);
+			break;
+		}
+
+		if (phase == 0) {
+			/* Pre-execve openat: install pin + redirect. */
+			pin.id = req.id;
+			pin.memfd = memfd;
+			pin.target_addr = TGT_PIN_BASE;
+			pin.size = PIN_SIZE;
+			if (ioctl(listener,
+				  SECCOMP_IOCTL_NOTIF_PIN_INSTALL,
+				  &pin) != 0) {
+				TH_LOG("pre-exec PIN_INSTALL failed: errno=%d",
+				       errno);
+				if (errno == EINVAL)
+					SKIP(goto cleanup,
+					     "Kernel lacks pinned-memfd remote");
+				goto cleanup;
+			}
+
+			memset(&redir, 0, sizeof(redir));
+			redir.id = req.id;
+			redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+			redir.args_mask = 1U << 1;
+			redir.ptr_mask = 1U << 1;
+			redir.ptr_len[1] = strlen(safe_path) + 1;
+			redir.args[1] = TGT_PIN_BASE;
+			if (ioctl(listener,
+				  SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+				  &redir) != 0) {
+				TH_LOG("pre-exec SEND_REDIRECT failed: errno=%d",
+				       errno);
+				goto cleanup;
+			}
+			phase = 1;
+		} else {
+			char path[PATH_MAX];
+			ssize_t n;
+
+			/*
+			 * Loader-vs-program distinction: this test uses the
+			 * openat path argument (read racily via
+			 * process_vm_readv) as a cheap heuristic. The
+			 * race-free pattern is to use
+			 * req.data.instruction_pointer + /proc/<pid>/maps,
+			 * but that requires the supervisor to call openat
+			 * (to fopen the maps file) which would trap on its
+			 * own seccomp filter in this single-process test
+			 * setup. A follow-on test restructures with
+			 * SCM_RIGHTS-based listener-fd passing so the
+			 * supervisor doesn't have the filter and can use
+			 * the proper IP-based pattern.
+			 */
+			n = read_remote_string(req.pid, req.data.args[1],
+					       path, sizeof(path));
+				if (n >= 0 && !strcmp(path, bait)) {
+					int e;
+
+					pin.id = req.id;
+					pin.memfd = memfd;
+					pin.target_addr = TGT_PIN_BASE;
+					pin.size = PIN_SIZE;
+					if (ioctl(listener,
+						  SECCOMP_IOCTL_NOTIF_PIN_INSTALL,
+						  &pin) == 0) {
+						post_exec_install_ok = true;
+					} else {
+						e = errno;
+						TH_LOG("post-exec PIN_INSTALL failed: errno=%d",
+						       e);
+						/*
+						 * Recover: send a normal
+						 * CONTINUE so the child
+						 * isn't left blocked.
+						 */
+						memset(&cont_resp, 0,
+						       sizeof(cont_resp));
+						cont_resp.id = req.id;
+						cont_resp.flags =
+							SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+						ioctl(listener,
+						      SECCOMP_IOCTL_NOTIF_SEND,
+						      &cont_resp);
+						continue;
+					}
+
+					memset(&redir, 0, sizeof(redir));
+					redir.id = req.id;
+					redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+					redir.args_mask = 1U << 1;
+					redir.ptr_mask = 1U << 1;
+					redir.ptr_len[1] = strlen(safe_path) + 1;
+					redir.args[1] = TGT_PIN_BASE;
+					if (ioctl(listener,
+						  SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+						  &redir) == 0) {
+						post_exec_redirect_done = true;
+					} else {
+						TH_LOG("post-exec SEND_REDIRECT failed: errno=%d",
+						       errno);
+						memset(&cont_resp, 0,
+						       sizeof(cont_resp));
+						cont_resp.id = req.id;
+						cont_resp.flags =
+							SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+						ioctl(listener,
+						      SECCOMP_IOCTL_NOTIF_SEND,
+						      &cont_resp);
+					}
+				} else {
+					/*
+					 * Non-bait openat outside the loader
+					 * (libc opening locale data, etc.).
+					 * Let through unmodified.
+					 */
+					memset(&cont_resp, 0,
+					       sizeof(cont_resp));
+					cont_resp.id = req.id;
+					cont_resp.flags =
+						SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+					ioctl(listener,
+					      SECCOMP_IOCTL_NOTIF_SEND,
+					      &cont_resp);
+			}
+		}
+	}
+
+	/*
+	 * Ensure the child isn't blocked on an unanswered trap when we
+	 * fall out of the loop (trap_limit hit, listener hangup, etc.).
+	 * If it's already exited, kill is a no-op for the zombie.
+	 */
+	if (waitpid(pid, &status, WNOHANG) == 0) {
+		kill(pid, SIGKILL);
+		waitpid(pid, &status, 0);
+	}
+	EXPECT_EQ(true, post_exec_install_ok) {
+		TH_LOG("PIN_INSTALL never succeeded in post-execve mm");
+	}
+	EXPECT_EQ(true, post_exec_redirect_done) {
+		TH_LOG("expected to see and redirect cat's openat of bait path");
+	}
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status)) {
+		TH_LOG("child exit %d (11=pre-exec openat fail, 12=execve fail)",
+		       WEXITSTATUS(status));
+	}
+
+cleanup:
+	munmap(sup_view, PIN_SIZE);
+	close(memfd);
+	close(listener);
+}
+
+/*
+ * Same flow as user_notification_pinned_memfd_execve but with the
+ * proper supervisor-isolation pattern: the child (target) installs
+ * the seccomp filter on itself and sends its listener fd to the
+ * parent (supervisor) via SCM_RIGHTS over a socketpair. The parent
+ * therefore does not carry the seccomp filter and can freely call
+ * openat() -- which is what makes the race-free, kernel-truth
+ * loader detection (req.data.instruction_pointer +
+ * /proc/<pid>/maps) actually usable.
+ */
+TEST(user_notification_pinned_memfd_execve_scm)
+{
+	pid_t pid;
+	int status, listener, memfd, sv[2];
+	struct seccomp_notif req = {};
+	struct seccomp_notif_pin_install pin = {};
+	struct seccomp_notif_resp_redirect redir = {};
+	struct seccomp_notif_resp cont_resp = {};
+	char *sup_view;
+	const unsigned long TGT_PIN_BASE = 0x70000000UL;
+	const size_t PIN_SIZE = 4096;
+	const char *safe_path = "/dev/null";
+	const char *bait = "/seccomp_pinned_memfd_test_bait_scm";
+	bool post_exec_install_ok = false;
+	bool post_exec_redirect_done = false;
+	bool loader_known = false;
+	bool loader_check_attempted = false;
+	struct addr_range loader_range = {};
+	int phase = 0;
+	int trap_count = 0;
+	const int trap_limit = 200;
+
+	if (access("/bin/cat", X_OK) != 0)
+		SKIP(return, "/bin/cat not present");
+
+	memfd = memfd_create("pin-execve-scm", MFD_ALLOW_SEALING);
+	ASSERT_GE(memfd, 0);
+	ASSERT_EQ(0, ftruncate(memfd, PIN_SIZE));
+	ASSERT_EQ(0, fcntl(memfd, F_ADD_SEALS,
+			   F_SEAL_SHRINK | F_SEAL_GROW));
+
+	sup_view = mmap(NULL, PIN_SIZE, PROT_READ | PROT_WRITE,
+			MAP_SHARED, memfd, 0);
+	ASSERT_NE(MAP_FAILED, sup_view);
+	memcpy(sup_view, safe_path, strlen(safe_path) + 1);
+
+	ASSERT_EQ(0, socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sv));
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		struct sock_filter filter[] = {
+			BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
+				 offsetof(struct seccomp_data, nr)),
+			BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_openat,
+				 0, 1),
+			BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
+			BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
+		};
+		struct sock_fprog prog = {
+			.len = (unsigned short)ARRAY_SIZE(filter),
+			.filter = filter,
+		};
+		int my_listener;
+		int fd;
+
+		close(sv[0]);
+		if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
+			_exit(20);
+		my_listener = seccomp(SECCOMP_SET_MODE_FILTER,
+				      SECCOMP_FILTER_FLAG_NEW_LISTENER,
+				      &prog);
+		if (my_listener < 0)
+			_exit(21);
+		if (send_fd(sv[1], my_listener) < 0)
+			_exit(22);
+		close(my_listener);
+		close(sv[1]);
+
+		/* Pre-execve trap. */
+		fd = syscall(__NR_openat, AT_FDCWD,
+			     "/this/should/never/be/touched", O_RDONLY, 0);
+		if (fd < 0)
+			_exit(11);
+
+		execl("/bin/cat", "cat", bait, (char *)NULL);
+		_exit(12);
+	}
+
+	close(sv[1]);
+	listener = recv_fd(sv[0]);
+	close(sv[0]);
+	ASSERT_GE(listener, 0);
+
+	/*
+	 * Parent has the listener fd and does NOT have the seccomp
+	 * filter. fopen(/proc/<pid>/maps) below works without
+	 * deadlocking on the parent's own openat.
+	 */
+	for (;;) {
+		struct pollfd pfd = { .fd = listener, .events = POLLIN };
+		int pret = poll(&pfd, 1, 500);
+		pid_t reaped;
+		bool ip_in_loader;
+
+		if (pret < 0)
+			break;
+		if (pret == 0 || !(pfd.revents & POLLIN)) {
+			reaped = waitpid(pid, &status, WNOHANG);
+			if (reaped == pid)
+				break;
+			if (pfd.revents & (POLLHUP | POLLERR))
+				break;
+			continue;
+		}
+
+		memset(&req, 0, sizeof(req));
+		if (ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req) < 0) {
+			TH_LOG("NOTIF_RECV failed: errno=%d", errno);
+			break;
+		}
+		if (++trap_count > trap_limit) {
+			TH_LOG("trap_limit (%d) exceeded", trap_limit);
+			break;
+		}
+
+		if (phase == 0) {
+			pin.id = req.id;
+			pin.memfd = memfd;
+			pin.target_addr = TGT_PIN_BASE;
+			pin.size = PIN_SIZE;
+			if (ioctl(listener,
+				  SECCOMP_IOCTL_NOTIF_PIN_INSTALL,
+				  &pin) != 0) {
+				TH_LOG("pre-exec PIN_INSTALL failed: errno=%d",
+				       errno);
+				if (errno == EINVAL)
+					SKIP(goto cleanup_scm,
+					     "Kernel lacks pinned-memfd remote");
+				goto cleanup_scm;
+			}
+
+			memset(&redir, 0, sizeof(redir));
+			redir.id = req.id;
+			redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+			redir.args_mask = 1U << 1;
+			redir.ptr_mask = 1U << 1;
+			redir.ptr_len[1] = strlen(safe_path) + 1;
+			redir.args[1] = TGT_PIN_BASE;
+			if (ioctl(listener,
+				  SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+				  &redir) != 0) {
+				TH_LOG("pre-exec SEND_REDIRECT failed: errno=%d",
+				       errno);
+				goto cleanup_scm;
+			}
+			phase = 1;
+			continue;
+		}
+
+		/*
+		 * Post-execve. Lazily resolve the loader range. The
+		 * supervisor's own openat (fopen on /proc/<pid>/maps)
+		 * doesn't trap because the filter lives on the child,
+		 * not on us.
+		 */
+		if (!loader_known && !loader_check_attempted) {
+			if (find_loader_text_range(req.pid,
+						   &loader_range) == 0)
+				loader_known = true;
+			loader_check_attempted = true;
+		}
+
+		ip_in_loader = loader_known &&
+			req.data.instruction_pointer >= loader_range.start &&
+			req.data.instruction_pointer <  loader_range.end;
+
+		if (ip_in_loader) {
+			memset(&cont_resp, 0, sizeof(cont_resp));
+			cont_resp.id = req.id;
+			cont_resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+			ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &cont_resp);
+			continue;
+		}
+
+		/* Program code: inspect the path to identify the bait. */
+		{
+			char path[PATH_MAX];
+			ssize_t n;
+
+			n = read_remote_string(req.pid, req.data.args[1],
+					       path, sizeof(path));
+			if (n < 0 || strcmp(path, bait) != 0) {
+				memset(&cont_resp, 0, sizeof(cont_resp));
+				cont_resp.id = req.id;
+				cont_resp.flags =
+					SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+				ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND,
+				      &cont_resp);
+				continue;
+			}
+
+			pin.id = req.id;
+			pin.memfd = memfd;
+			pin.target_addr = TGT_PIN_BASE;
+			pin.size = PIN_SIZE;
+			if (ioctl(listener,
+				  SECCOMP_IOCTL_NOTIF_PIN_INSTALL,
+				  &pin) == 0) {
+				post_exec_install_ok = true;
+			} else {
+				TH_LOG("post-exec PIN_INSTALL failed: errno=%d",
+				       errno);
+				memset(&cont_resp, 0, sizeof(cont_resp));
+				cont_resp.id = req.id;
+				cont_resp.flags =
+					SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+				ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND,
+				      &cont_resp);
+				continue;
+			}
+
+			memset(&redir, 0, sizeof(redir));
+			redir.id = req.id;
+			redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+			redir.args_mask = 1U << 1;
+			redir.ptr_mask = 1U << 1;
+			redir.ptr_len[1] = strlen(safe_path) + 1;
+			redir.args[1] = TGT_PIN_BASE;
+			if (ioctl(listener,
+				  SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+				  &redir) == 0) {
+				post_exec_redirect_done = true;
+			} else {
+				TH_LOG("post-exec SEND_REDIRECT failed: errno=%d",
+				       errno);
+				memset(&cont_resp, 0, sizeof(cont_resp));
+				cont_resp.id = req.id;
+				cont_resp.flags =
+					SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+				ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND,
+				      &cont_resp);
+			}
+		}
+	}
+
+	if (waitpid(pid, &status, WNOHANG) == 0) {
+		kill(pid, SIGKILL);
+		waitpid(pid, &status, 0);
+	}
+	EXPECT_EQ(true, loader_known) {
+		TH_LOG("find_loader_text_range never resolved");
+	}
+	EXPECT_EQ(true, post_exec_install_ok);
+	EXPECT_EQ(true, post_exec_redirect_done);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+cleanup_scm:
+	munmap(sup_view, PIN_SIZE);
+	close(memfd);
+	close(listener);
+}
+
+#ifdef __x86_64__
+/*
+ * Load-bearing ABI check: after SEND_REDIRECT, the trapped task's
+ * redirected arg register must be restored to its original value
+ * before user-mode code resumes. The kernel's restore mechanism
+ * (task_work_add(TWA_RESUME) -> seccomp_redirect_restore_cb) is
+ * what guarantees this; without a test the property is just an
+ * assertion. Bypass libc's syscall() wrapper (which caller-saves
+ * arg values and would mask a restore bug) and capture the actual
+ * arg register immediately after the SYSCALL instruction.
+ *
+ * The child issues openat with RSI = sentinel_path. The supervisor
+ * SEND_REDIRECTs args[1] (RSI) to point into the pin. The kernel:
+ *   - saves the original RSI into the knotif
+ *   - writes the pin address into RSI via syscall_set_arguments()
+ *   - runs the syscall (kernel reads path from the pin)
+ *   - on syscall_exit_to_user_mode, fires task_work which calls
+ *     syscall_set_arguments() again with the saved original
+ *   - returns to user mode
+ *
+ * If task_work fires correctly, the child observes RSI == sentinel.
+ * If broken, RSI holds the pin address (the redirected value the
+ * kernel left in pt_regs).
+ */
+TEST(user_notification_pinned_memfd_abi)
+{
+	pid_t pid;
+	long ret;
+	int status, listener, memfd;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_pin_install pin = {};
+	struct seccomp_notif_resp_redirect redir = {};
+	char *sup_view;
+	const unsigned long TGT_PIN_BASE = 0x70000000UL;
+	const size_t PIN_SIZE = 4096;
+	const char *safe_path = "/dev/null";
+	/*
+	 * The "sentinel" is a real string the child can also pass as
+	 * the openat path. Its address is captured pre-syscall as RSI;
+	 * post-syscall RSI must equal the same address.
+	 */
+	static const char sentinel_path[] = "/seccomp_abi_sentinel";
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	memfd = memfd_create("pin-abi", MFD_ALLOW_SEALING);
+	ASSERT_GE(memfd, 0);
+	ASSERT_EQ(0, ftruncate(memfd, PIN_SIZE));
+	ASSERT_EQ(0, fcntl(memfd, F_ADD_SEALS,
+			   F_SEAL_SHRINK | F_SEAL_GROW));
+
+	sup_view = mmap(NULL, PIN_SIZE, PROT_READ | PROT_WRITE,
+			MAP_SHARED, memfd, 0);
+	ASSERT_NE(MAP_FAILED, sup_view);
+	memcpy(sup_view, safe_path, strlen(safe_path) + 1);
+
+	listener = user_notif_syscall(__NR_openat,
+				      SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	ASSERT_GE(listener, 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		register long r10_val asm("r10") = 0;
+		unsigned long rsi_after;
+		long fd;
+
+		asm volatile(
+			"syscall\n\t"
+			"mov %%rsi, %[after]"
+			: "=a"(fd), [after] "=&r"(rsi_after)
+			: "0"((long)__NR_openat),
+			  "D"((long)AT_FDCWD),
+			  "S"((unsigned long)sentinel_path),
+			  "d"((long)O_RDONLY),
+			  "r"(r10_val)
+			: "rcx", "r11", "memory"
+		);
+
+		if (fd < 0)
+			_exit(11);
+		/*
+		 * Load-bearing check: RSI immediately post-SYSCALL must
+		 * still be the sentinel pointer the child passed in. The
+		 * kernel's REDIRECT-then-restore mechanism is the only
+		 * thing that guarantees this; a broken restore would leave
+		 * the pin address in RSI.
+		 */
+		if (rsi_after != (unsigned long)sentinel_path)
+			_exit(12);
+		_exit(0);
+	}
+
+	ASSERT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req));
+	EXPECT_EQ(req.data.nr, __NR_openat);
+	EXPECT_EQ(req.data.args[1], (unsigned long)sentinel_path);
+
+	pin.id = req.id;
+	pin.memfd = memfd;
+	pin.target_addr = TGT_PIN_BASE;
+	pin.size = PIN_SIZE;
+	EXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_INSTALL, &pin)) {
+		if (errno == EINVAL)
+			SKIP(goto cleanup,
+			     "Kernel lacks pinned-memfd remote install");
+	}
+
+	redir.id = req.id;
+	redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+	redir.args_mask = 1U << 1;
+	redir.ptr_mask = 1U << 1;
+	redir.ptr_len[1] = strlen(safe_path) + 1;
+	redir.args[1] = TGT_PIN_BASE;
+	EXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+			   &redir));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status)) {
+		switch (WEXITSTATUS(status)) {
+		case 11:
+			TH_LOG("child exit 11: openat returned -errno");
+			break;
+		case 12:
+			TH_LOG("child exit 12: ABI violation -- RSI not restored after redirect");
+			break;
+		default:
+			TH_LOG("child exit %d (unexpected)", WEXITSTATUS(status));
+		}
+	}
+
+cleanup:
+	munmap(sup_view, PIN_SIZE);
+	close(memfd);
+	close(listener);
+}
+#endif /* __x86_64__ */
+
 #ifndef SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP
 #define SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP (1UL << 0)
 #define SECCOMP_IOCTL_NOTIF_SET_FLAGS  SECCOMP_IOW(4, __u64)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect
  2026-06-13  0:15 ` [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect Cong Wang
@ 2026-06-13  4:03   ` Andy Lutomirski
  2026-06-20 21:11     ` Cong Wang
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Lutomirski @ 2026-06-13  4:03 UTC (permalink / raw)
  To: Cong Wang; +Cc: Kees Cook, linux-kernel, Will Drewry, Christian Brauner

On Fri, Jun 12, 2026 at 5:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Two new ioctls are introduced:
>
>   SECCOMP_IOCTL_NOTIF_PIN_INSTALL
>
>     Supervisor names an active notification id, a memfd it owns,
>     and a target address+size. Kernel grabs the trapped task's
>     mm via get_task_mm(), calls vm_mmap_pgoff_to_mm() with
>     MAP_FIXED | MAP_SHARED, PROT_READ, and VM_SEALED in
>     extra_vm_flags. On success the VMA is installed in the
>     target's mm, immediately sealed against munmap/mremap/
>     mprotect/MAP_FIXED-stomp from the target itself and any
>     CLONE_VM peer. The range is recorded on the listener filter
>     for SEND_REDIRECT validation.
>

I haven't read the code, but I think this at least conceptually makes
a decent amount of sense.  But...

>   SECCOMP_IOCTL_NOTIF_SEND_REDIRECT
>
>     Resumes the trapped syscall (like FLAG_CONTINUE) with
>     arg-register substitution. The supervisor supplies an
>     args_mask (which arg registers to replace), a ptr_mask
>     (which of those are pointers, validated to fall inside an
>     installed pin) and replacement values. The kernel saves
>     the trapped task's original arg registers into a small
>     heap record, writes substituted values via
>     syscall_set_arguments(), and queues a task_work callback
>     that fires at user-mode return after the syscall completes
>     to restore the original registers. This preserves the
>     caller-saved arg-register ABI invariant for callers that
>     expected register contents to survive across the syscall
>     (compilers under LTO, inline-asm syscall wrappers, anything
>     that doesn't strictly follow psABI).

Here there be dragons, and I kind of alluded to some of those dragons
in my recent message about STRICT, but let's be more thorough.

I'm going to totally ignore the implementation for now (which I think
has a memory leak, but whatever -- this is solvable, at least in
principle).  Conceptually, SEND_REDIRECT is handling a seccomp action
by doing a syscall that may be different from the originally requested
syscall.  And we have a whole host of potential issues, some related
to security and some related to functionality.

Let's do the functionality ones first: what happens if a signal
happens?  In the simplest cases (signal completely ignored, task
killed (there's the memory leak), or -EINTR), I think we're mostly
okay.  But in the case where the syscall needs to restart or, worse,
use one of the fancy restart techniques, what should happen?  I think
that even defining semantics is somewhat nontrivial, and I'm a bit
concerned that the user notifier would need to actually be aware of
signals.  Yuck.

Now security: right now we have this rule:

/*
 * All BPF programs must return a 32-bit value.
 * The bottom 16-bits are for optional return data.
 * The upper 16-bits are ordered from least permissive values to most,
 * as a signed value (so 0x8000000 is negative).
 *
 * The ordering ensures that a min_t() over composed return values always
 * selects the least permissive choice.
 */
#define SECCOMP_RET_KILL_PROCESS 0x80000000U /* kill the process */
#define SECCOMP_RET_KILL_THREAD  0x00000000U /* kill the thread */
#define SECCOMP_RET_KILL         SECCOMP_RET_KILL_THREAD
#define SECCOMP_RET_TRAP         0x00030000U /* disallow and force a SIGSYS */
#define SECCOMP_RET_ERRNO        0x00050000U /* returns an errno */
#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
#define SECCOMP_RET_TRACE        0x7ff00000U /* pass to a tracer or disallow */
#define SECCOMP_RET_LOG          0x7ffc0000U /* allow after logging */
#define SECCOMP_RET_ALLOW        0x7fff0000U /* allow */

This has always bothered me.  In the absence of USER_NOTIF and TRACE,
fine, I guess -- we're choosing the least permissive, and this doesn't
seem too crazy.  But if we do anything fancy (like this patch series),
I think this becomes wrong.  (And I kind of think I said something
along these lines many years ago.)

Before this series (in current kernels), one can do syscall emulation
using SECCOMP_RET_TRACE, and it kind of works as long as no filter in
the stack tries to block the original syscall *or* the rewritten
syscall, because syscalls issued by using ptrace to redirect the
traced process go through seccomp again.  It's a total mess, it can't
handle complex cases, but it's at least approximately secure.

With this series, I think it's all busted.  Suppose I make a container
and block everyone's favorite unshare, generating SECCOMP_RET_ERRNO.
(This is the default "docker" (actually moby I think) policy.)  Then,
inside the container, I write a program that installs a filter that
sends syscall 12345 to USER_NOTIF.  Then I fork and my child does
syscall 12345.  I handle USER_NOTIF by using the new redirect feature
to redirect to unshare().  And unshare() gets called.

IMHO what *should* happen is that we actually keep track of where we
are in the seccomp filter stack.  We start from the innermost filter
(most recently applied) and start running the filters.  And then we do
something that actually makes sense based on the result.  For example:

KILL: Kill it.  Do not run more filters.  (I suppose we could see if
an outer filter promotes from KILL_THREAD to KILL_PROCESS, but this
doesn't seem helpful.)
TRAP:  Generate the signal.  Do *not* run more filters.  Sure, this
can allow a contained program to generate a SIGSYS instead of getting
killed if it tries some blocked syscall that the outer filter wants to
KILL.  So what?  I actually think this is better behavior -- the
combination of the program and inner filter is not actually doing the
syscall.

ERRNO: Same deal -- replace the syscall with a return of the specified
value.  Don't call more filters.

TRACE: Similar.

ALLOW: Call the next filter.

USER_NOTIF: Stop calling filters and remember where we are in the
filter chain.  Call out to the user notifier *associated with this
filter*.  When the user notifier responds, if the notifier asks for a
redirect or to resume the syscall, then continue calling filters *on
the new syscall*.

Looking at my example above, the effect would be that the inner filter
gets a notifier event for syscall 12345 and redirects to unshare.
Then the outer filter sees unshare.  It can ERROR to cause unshare to
return an error, or it can do its own USER_NOTIF to do something fancy
with unshare, or it can KILL, etc.

This may be enough of a scary departure that we will want each filter
to opt in to the new behavior for filters applied later.  Or maybe
everyone can get comfortable enough with it to just switch over.  Or
maybe there's another solution.  Or maybe someone can try to convince
me that the existing behavior makes sense if syscalls can be
redirected (maybe call the whole chain on the redirected syscall? Even
defining that gets a little messy.)

--Andy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v3 3/3] selftests/seccomp: cover non-cooperative pinned-memfd install
  2026-06-13  0:15 ` [RFC PATCH v3 3/3] selftests/seccomp: cover non-cooperative pinned-memfd install Cong Wang
@ 2026-06-13 10:07   ` kernel test robot
  0 siblings, 0 replies; 10+ messages in thread
From: kernel test robot @ 2026-06-13 10:07 UTC (permalink / raw)
  To: Cong Wang; +Cc: llvm, oe-kbuild-all

Hi Cong,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on 28608283615e5e7e92ea79c8ea13507f4b5e0cbe]

url:    https://github.com/intel-lab-lkp/linux/commits/Cong-Wang/mm-add-__do_mmap-and-vm_mmap_seal_remote/20260613-081932
base:   28608283615e5e7e92ea79c8ea13507f4b5e0cbe
patch link:    https://lore.kernel.org/r/20260613001533.314739-4-xiyou.wangcong%40gmail.com
patch subject: [RFC PATCH v3 3/3] selftests/seccomp: cover non-cooperative pinned-memfd install
config: x86_64-rhel-9.4-rust (https://download.01.org/0day-ci/archive/20260613/202606131238.sI0qu5os-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project f43d6834093b19baf79beda8c0337ab020ac5f17)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260613/202606131238.sI0qu5os-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606131238.sI0qu5os-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: mm/vma.c:2832 function parameter 'mm' not described in 'mmap_region'
>> Warning: mm/vma.c:2832 function parameter 'mm' not described in 'mmap_region'

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect
  2026-06-13  4:03   ` Andy Lutomirski
@ 2026-06-20 21:11     ` Cong Wang
  2026-06-23 19:02       ` Andy Lutomirski
  0 siblings, 1 reply; 10+ messages in thread
From: Cong Wang @ 2026-06-20 21:11 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Kees Cook, linux-kernel, Will Drewry, Christian Brauner

On Fri, Jun 12, 2026 at 9:03 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> On Fri, Jun 12, 2026 at 5:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > Two new ioctls are introduced:
> >
> >   SECCOMP_IOCTL_NOTIF_PIN_INSTALL
> >
> >     Supervisor names an active notification id, a memfd it owns,
> >     and a target address+size. Kernel grabs the trapped task's
> >     mm via get_task_mm(), calls vm_mmap_pgoff_to_mm() with
> >     MAP_FIXED | MAP_SHARED, PROT_READ, and VM_SEALED in
> >     extra_vm_flags. On success the VMA is installed in the
> >     target's mm, immediately sealed against munmap/mremap/
> >     mprotect/MAP_FIXED-stomp from the target itself and any
> >     CLONE_VM peer. The range is recorded on the listener filter
> >     for SEND_REDIRECT validation.
> >
>
> I haven't read the code, but I think this at least conceptually makes
> a decent amount of sense.  But...
>
> >   SECCOMP_IOCTL_NOTIF_SEND_REDIRECT
> >
> >     Resumes the trapped syscall (like FLAG_CONTINUE) with
> >     arg-register substitution. The supervisor supplies an
> >     args_mask (which arg registers to replace), a ptr_mask
> >     (which of those are pointers, validated to fall inside an
> >     installed pin) and replacement values. The kernel saves
> >     the trapped task's original arg registers into a small
> >     heap record, writes substituted values via
> >     syscall_set_arguments(), and queues a task_work callback
> >     that fires at user-mode return after the syscall completes
> >     to restore the original registers. This preserves the
> >     caller-saved arg-register ABI invariant for callers that
> >     expected register contents to survive across the syscall
> >     (compilers under LTO, inline-asm syscall wrappers, anything
> >     that doesn't strictly follow psABI).
>
> Here there be dragons, and I kind of alluded to some of those dragons
> in my recent message about STRICT, but let's be more thorough.
>
> I'm going to totally ignore the implementation for now (which I think
> has a memory leak, but whatever -- this is solvable, at least in

Yes, let's get the design correct before digging into any detail.

> principle).  Conceptually, SEND_REDIRECT is handling a seccomp action
> by doing a syscall that may be different from the originally requested
> syscall.  And we have a whole host of potential issues, some related
> to security and some related to functionality.
>
> Let's do the functionality ones first: what happens if a signal
> happens?  In the simplest cases (signal completely ignored, task
> killed (there's the memory leak), or -EINTR), I think we're mostly
> okay.  But in the case where the syscall needs to restart or, worse,
> use one of the fancy restart techniques, what should happen?  I think
> that even defining semantics is somewhat nontrivial, and I'm a bit
> concerned that the user notifier would need to actually be aware of
> signals.  Yuck.

Good catch! I completely missed the signal case, how about restoring
at syscall-exit, before signal/restart processing?

>
> Now security: right now we have this rule:
>
> /*
>  * All BPF programs must return a 32-bit value.
>  * The bottom 16-bits are for optional return data.
>  * The upper 16-bits are ordered from least permissive values to most,
>  * as a signed value (so 0x8000000 is negative).
>  *
>  * The ordering ensures that a min_t() over composed return values always
>  * selects the least permissive choice.
>  */
> #define SECCOMP_RET_KILL_PROCESS 0x80000000U /* kill the process */
> #define SECCOMP_RET_KILL_THREAD  0x00000000U /* kill the thread */
> #define SECCOMP_RET_KILL         SECCOMP_RET_KILL_THREAD
> #define SECCOMP_RET_TRAP         0x00030000U /* disallow and force a SIGSYS */
> #define SECCOMP_RET_ERRNO        0x00050000U /* returns an errno */
> #define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
> #define SECCOMP_RET_TRACE        0x7ff00000U /* pass to a tracer or disallow */
> #define SECCOMP_RET_LOG          0x7ffc0000U /* allow after logging */
> #define SECCOMP_RET_ALLOW        0x7fff0000U /* allow */
>
> This has always bothered me.  In the absence of USER_NOTIF and TRACE,
> fine, I guess -- we're choosing the least permissive, and this doesn't
> seem too crazy.  But if we do anything fancy (like this patch series),
> I think this becomes wrong.  (And I kind of think I said something
> along these lines many years ago.)
>
> Before this series (in current kernels), one can do syscall emulation
> using SECCOMP_RET_TRACE, and it kind of works as long as no filter in
> the stack tries to block the original syscall *or* the rewritten
> syscall, because syscalls issued by using ptrace to redirect the
> traced process go through seccomp again.  It's a total mess, it can't
> handle complex cases, but it's at least approximately secure.
>
> With this series, I think it's all busted.  Suppose I make a container
> and block everyone's favorite unshare, generating SECCOMP_RET_ERRNO.
> (This is the default "docker" (actually moby I think) policy.)  Then,
> inside the container, I write a program that installs a filter that
> sends syscall 12345 to USER_NOTIF.  Then I fork and my child does
> syscall 12345.  I handle USER_NOTIF by using the new redirect feature
> to redirect to unshare().  And unshare() gets called.
>
> IMHO what *should* happen is that we actually keep track of where we
> are in the seccomp filter stack.  We start from the innermost filter
> (most recently applied) and start running the filters.  And then we do
> something that actually makes sense based on the result.  For example:
>
> KILL: Kill it.  Do not run more filters.  (I suppose we could see if
> an outer filter promotes from KILL_THREAD to KILL_PROCESS, but this
> doesn't seem helpful.)
> TRAP:  Generate the signal.  Do *not* run more filters.  Sure, this
> can allow a contained program to generate a SIGSYS instead of getting
> killed if it tries some blocked syscall that the outer filter wants to
> KILL.  So what?  I actually think this is better behavior -- the
> combination of the program and inner filter is not actually doing the
> syscall.
>
> ERRNO: Same deal -- replace the syscall with a return of the specified
> value.  Don't call more filters.
>
> TRACE: Similar.
>
> ALLOW: Call the next filter.
>
> USER_NOTIF: Stop calling filters and remember where we are in the
> filter chain.  Call out to the user notifier *associated with this
> filter*.  When the user notifier responds, if the notifier asks for a
> redirect or to resume the syscall, then continue calling filters *on
> the new syscall*.
>
> Looking at my example above, the effect would be that the inner filter
> gets a notifier event for syscall 12345 and redirects to unshare.
> Then the outer filter sees unshare.  It can ERROR to cause unshare to
> return an error, or it can do its own USER_NOTIF to do something fancy
> with unshare, or it can KILL, etc.
>
>
> This may be enough of a scary departure that we will want each filter
> to opt in to the new behavior for filters applied later.  Or maybe
> everyone can get comfortable enough with it to just switch over.  Or
> maybe there's another solution.  Or maybe someone can try to convince
> me that the existing behavior makes sense if syscalls can be
> redirected (maybe call the whole chain on the redirected syscall? Even
> defining that gets a little messy.)

Thanks for the detailed analysis!

How about keeping min and re-run only the outer suffix after a redirect?

I think this is the safest option. I agree your suggestion of removing min
is more elegant, but it also brings risks of breaking existing filtering logic.

Below is the code sketch to show what I propose here:

  static u32 seccomp_run_filters_from(const struct seccomp_data *sd,
                                      struct seccomp_filter *start,
                                      struct seccomp_filter **match)
  {
      u32 ret = SECCOMP_RET_ALLOW;
      for (struct seccomp_filter *f = start; f; f = f->prev)
          if (ACTION_ONLY(bpf_prog_run_pin_on_cpu(f->prog, sd)) <
ACTION_ONLY(ret)) {
              ret = /*cur*/; *match = f;
          }
      return ret;
  }

  /* __seccomp_filter takes a `start` (innermost on the first call) */
  case SECCOMP_RET_USER_NOTIF:
      if (seccomp_do_user_notification(this_syscall, match, &sd))
          goto skip;

      /* syscall may be rewritten now; re-vote with the OUTER filters only */
      this_syscall = syscall_get_nr(current, current_pt_regs());
      if (this_syscall < 0)
          return 0;
      return __seccomp_filter(this_syscall, match->prev /* start outward */);


Please let me know your preference.

Thanks!
Cong

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect
  2026-06-20 21:11     ` Cong Wang
@ 2026-06-23 19:02       ` Andy Lutomirski
  2026-06-23 19:11         ` Kees Cook
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Lutomirski @ 2026-06-23 19:02 UTC (permalink / raw)
  To: Cong Wang; +Cc: Kees Cook, linux-kernel, Will Drewry, Christian Brauner

On Sat, Jun 20, 2026 at 2:12 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Fri, Jun 12, 2026 at 9:03 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > On Fri, Jun 12, 2026 at 5:16 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > Two new ioctls are introduced:
> > >
> > >   SECCOMP_IOCTL_NOTIF_PIN_INSTALL
> > >
> > >     Supervisor names an active notification id, a memfd it owns,
> > >     and a target address+size. Kernel grabs the trapped task's
> > >     mm via get_task_mm(), calls vm_mmap_pgoff_to_mm() with
> > >     MAP_FIXED | MAP_SHARED, PROT_READ, and VM_SEALED in
> > >     extra_vm_flags. On success the VMA is installed in the
> > >     target's mm, immediately sealed against munmap/mremap/
> > >     mprotect/MAP_FIXED-stomp from the target itself and any
> > >     CLONE_VM peer. The range is recorded on the listener filter
> > >     for SEND_REDIRECT validation.
> > >
> >
> > I haven't read the code, but I think this at least conceptually makes
> > a decent amount of sense.  But...
> >
> > >   SECCOMP_IOCTL_NOTIF_SEND_REDIRECT
> > >
> > >     Resumes the trapped syscall (like FLAG_CONTINUE) with
> > >     arg-register substitution. The supervisor supplies an
> > >     args_mask (which arg registers to replace), a ptr_mask
> > >     (which of those are pointers, validated to fall inside an
> > >     installed pin) and replacement values. The kernel saves
> > >     the trapped task's original arg registers into a small
> > >     heap record, writes substituted values via
> > >     syscall_set_arguments(), and queues a task_work callback
> > >     that fires at user-mode return after the syscall completes
> > >     to restore the original registers. This preserves the
> > >     caller-saved arg-register ABI invariant for callers that
> > >     expected register contents to survive across the syscall
> > >     (compilers under LTO, inline-asm syscall wrappers, anything
> > >     that doesn't strictly follow psABI).
> >
> > Here there be dragons, and I kind of alluded to some of those dragons
> > in my recent message about STRICT, but let's be more thorough.
> >
> > I'm going to totally ignore the implementation for now (which I think
> > has a memory leak, but whatever -- this is solvable, at least in
>
> Yes, let's get the design correct before digging into any detail.
>
> > principle).  Conceptually, SEND_REDIRECT is handling a seccomp action
> > by doing a syscall that may be different from the originally requested
> > syscall.  And we have a whole host of potential issues, some related
> > to security and some related to functionality.
> >
> > Let's do the functionality ones first: what happens if a signal
> > happens?  In the simplest cases (signal completely ignored, task
> > killed (there's the memory leak), or -EINTR), I think we're mostly
> > okay.  But in the case where the syscall needs to restart or, worse,
> > use one of the fancy restart techniques, what should happen?  I think
> > that even defining semantics is somewhat nontrivial, and I'm a bit
> > concerned that the user notifier would need to actually be aware of
> > signals.  Yuck.
>
> Good catch! I completely missed the signal case, how about restoring
> at syscall-exit, before signal/restart processing?

I think that handling this nicely is extremely complex.

One could imagine a fictional universe where Linux works like this:

void user_asked_for_a_syscall(nr, args, etc)
{
    do preprocessing;

    handle seccomp;

    ret = actually do the syscall;

    if (ret says a signal happened) {
      do the horrid magical signal fixup;
    }
}

But Linux doesn't, and cannot quite, work this way.  On a syscall
entry, first there's seccomp.  Then, after seccomp *returns* (and even
the function that called it returns!), we make it to the actual
syscall.  Then we make it to later code that notices a signal and
deals with restart, and x86 even has two copies of this (handle_signal
and arch_do_signal_or_restart, both in the same file), and they work
differently.

Now, on the bright side, the actual semantics are sort of all in the
syscall itself and its return value, along with (x86-specific) orig_ax
(see the orig_ax = -1 in restore_sigcontext).  So maybe one could
rearrange the syscall code so that seccomp can actually see the return
value.  And this might actually be an excellent idea, although it
would need to be done with quite a bit of care.

And this whole mess is kind of neccesary: it's possible for user code
to do a syscall, get a signal, and have a handler for that signal.  So
the kernel will rewrite the return state so it returns to the handler.
Then the handler returns, via sigreturn, and sigreturn needs to be
able to *resume a potentially interrupted syscall*.  So the kernel
needs to set up the stack so that this happens.  There is no actual
guarantee that any of this matches up correctly -- what if the user
does usermode threading and resumes on a different thread?
Regrettably, the kernel has the restartblock mechanism and does keep
some limited state, and this sucks and has nasty corner cases, and I
really don't think we want to expose this to seccomp.


One solution is to declare that, for now, we will only allow one user
notifier in the stack or at least only one that declares its intention
to use the redirect feature.  Even with this, the fact that we kind of
need to fix up registers after a redirected syscall is a mess, but at
least *that* mess can be fixed in a sort of one-deep sense by making
sure that we fix it after precisely the one syscall we issued (which
is roughly what your patch does).

>
> >
> > Now security: right now we have this rule:
> >
> > /*
> >  * All BPF programs must return a 32-bit value.
> >  * The bottom 16-bits are for optional return data.
> >  * The upper 16-bits are ordered from least permissive values to most,
> >  * as a signed value (so 0x8000000 is negative).
> >  *
> >  * The ordering ensures that a min_t() over composed return values always
> >  * selects the least permissive choice.
> >  */
> > #define SECCOMP_RET_KILL_PROCESS 0x80000000U /* kill the process */
> > #define SECCOMP_RET_KILL_THREAD  0x00000000U /* kill the thread */
> > #define SECCOMP_RET_KILL         SECCOMP_RET_KILL_THREAD
> > #define SECCOMP_RET_TRAP         0x00030000U /* disallow and force a SIGSYS */
> > #define SECCOMP_RET_ERRNO        0x00050000U /* returns an errno */
> > #define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
> > #define SECCOMP_RET_TRACE        0x7ff00000U /* pass to a tracer or disallow */
> > #define SECCOMP_RET_LOG          0x7ffc0000U /* allow after logging */
> > #define SECCOMP_RET_ALLOW        0x7fff0000U /* allow */
> >
> > This has always bothered me.  In the absence of USER_NOTIF and TRACE,
> > fine, I guess -- we're choosing the least permissive, and this doesn't
> > seem too crazy.  But if we do anything fancy (like this patch series),
> > I think this becomes wrong.  (And I kind of think I said something
> > along these lines many years ago.)
> >
> > Before this series (in current kernels), one can do syscall emulation
> > using SECCOMP_RET_TRACE, and it kind of works as long as no filter in
> > the stack tries to block the original syscall *or* the rewritten
> > syscall, because syscalls issued by using ptrace to redirect the
> > traced process go through seccomp again.  It's a total mess, it can't
> > handle complex cases, but it's at least approximately secure.
> >
> > With this series, I think it's all busted.  Suppose I make a container
> > and block everyone's favorite unshare, generating SECCOMP_RET_ERRNO.
> > (This is the default "docker" (actually moby I think) policy.)  Then,
> > inside the container, I write a program that installs a filter that
> > sends syscall 12345 to USER_NOTIF.  Then I fork and my child does
> > syscall 12345.  I handle USER_NOTIF by using the new redirect feature
> > to redirect to unshare().  And unshare() gets called.
> >
> > IMHO what *should* happen is that we actually keep track of where we
> > are in the seccomp filter stack.  We start from the innermost filter
> > (most recently applied) and start running the filters.  And then we do
> > something that actually makes sense based on the result.  For example:
> >
> > KILL: Kill it.  Do not run more filters.  (I suppose we could see if
> > an outer filter promotes from KILL_THREAD to KILL_PROCESS, but this
> > doesn't seem helpful.)
> > TRAP:  Generate the signal.  Do *not* run more filters.  Sure, this
> > can allow a contained program to generate a SIGSYS instead of getting
> > killed if it tries some blocked syscall that the outer filter wants to
> > KILL.  So what?  I actually think this is better behavior -- the
> > combination of the program and inner filter is not actually doing the
> > syscall.
> >
> > ERRNO: Same deal -- replace the syscall with a return of the specified
> > value.  Don't call more filters.
> >
> > TRACE: Similar.
> >
> > ALLOW: Call the next filter.
> >
> > USER_NOTIF: Stop calling filters and remember where we are in the
> > filter chain.  Call out to the user notifier *associated with this
> > filter*.  When the user notifier responds, if the notifier asks for a
> > redirect or to resume the syscall, then continue calling filters *on
> > the new syscall*.
> >
> > Looking at my example above, the effect would be that the inner filter
> > gets a notifier event for syscall 12345 and redirects to unshare.
> > Then the outer filter sees unshare.  It can ERROR to cause unshare to
> > return an error, or it can do its own USER_NOTIF to do something fancy
> > with unshare, or it can KILL, etc.
> >
> >
> > This may be enough of a scary departure that we will want each filter
> > to opt in to the new behavior for filters applied later.  Or maybe
> > everyone can get comfortable enough with it to just switch over.  Or
> > maybe there's another solution.  Or maybe someone can try to convince
> > me that the existing behavior makes sense if syscalls can be
> > redirected (maybe call the whole chain on the redirected syscall? Even
> > defining that gets a little messy.)
>
> Thanks for the detailed analysis!
>
> How about keeping min and re-run only the outer suffix after a redirect?
>
> I think this is the safest option. I agree your suggestion of removing min
> is more elegant, but it also brings risks of breaking existing filtering logic.

I'm really not convinced that the min is needed to preserve any useful
behavior.  But Kees is very conservative about these things, with good
reason.

I'm also not sure what happens if we do a redirect and then discover
that the outer rules trigger a user notifier.  We need *some*
semantics in this case.

--Andy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect
  2026-06-23 19:02       ` Andy Lutomirski
@ 2026-06-23 19:11         ` Kees Cook
  2026-06-23 22:21           ` Andy Lutomirski
  0 siblings, 1 reply; 10+ messages in thread
From: Kees Cook @ 2026-06-23 19:11 UTC (permalink / raw)
  To: Cong Wang; +Cc: Andy Lutomirski, linux-kernel, Will Drewry, Christian Brauner

On Tue, Jun 23, 2026 at 12:02:32PM -0700, Andy Lutomirski wrote:
> I'm really not convinced that the min is needed to preserve any useful
> behavior.  But Kees is very conservative about these things, with good
> reason.

What is going to use this feature? I'd rather not try to have a USER_NOTIF
security boundary since there are so many corner cases.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect
  2026-06-23 19:11         ` Kees Cook
@ 2026-06-23 22:21           ` Andy Lutomirski
  0 siblings, 0 replies; 10+ messages in thread
From: Andy Lutomirski @ 2026-06-23 22:21 UTC (permalink / raw)
  To: Kees Cook; +Cc: Cong Wang, linux-kernel, Will Drewry, Christian Brauner

On Tue, Jun 23, 2026 at 12:11 PM Kees Cook <kees@kernel.org> wrote:
>
> On Tue, Jun 23, 2026 at 12:02:32PM -0700, Andy Lutomirski wrote:
> > I'm really not convinced that the min is needed to preserve any useful
> > behavior.  But Kees is very conservative about these things, with good
> > reason.
>
> What is going to use this feature? I'd rather not try to have a USER_NOTIF
> security boundary since there are so many corner cases.

This whole redirect-to-pinned-memory mechanism for starters -- it's
extremely useful I think.

But maybe that particular usecase could be a lot narrower than I'm
making it out to be.  In this particular patch, the redirect is for a
single syscall.  We could say that the user cannot redirect to a
different syscall nr (mostly to make it easier to reason about the
code), and this patch already has the (rather limiting) property that
one can only redirect to a *single* syscall.  So the controller
process doesn't get full control over the target process.  And maybe
we just declare signal handling to be out of scope in the sense that
intelligent handling of syscalls that have complex signal handling
(e.g. nanosleep, although I don't really know why you would want to
redirect the pointer argument to nanosleep) are unlikely to be
handle-able correctly, and the user should avoid redirecting those
syscalls.

I don't think USER_NOTIF should become more of a security boundary
than it already is, although this redirect feature is very much
security-sensitive if it gets used.

The implementation of the patch is a bit sad -- there is no technical
reason that the code couldn't actually issue the syscall and avoid all
the heap and task-work crud.  The kernel isn't really structured like
that, but I don't think it would be a big departure.

--Andy

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-06-23 22:21 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-13  0:15 [RFC PATCH v3 0/3] seccomp: non-cooperative pinned-memfd argument redirect Cong Wang
2026-06-13  0:15 ` [RFC PATCH v3 1/3] mm: add __do_mmap() and vm_mmap_seal_remote() Cong Wang
2026-06-13  0:15 ` [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect Cong Wang
2026-06-13  4:03   ` Andy Lutomirski
2026-06-20 21:11     ` Cong Wang
2026-06-23 19:02       ` Andy Lutomirski
2026-06-23 19:11         ` Kees Cook
2026-06-23 22:21           ` Andy Lutomirski
2026-06-13  0:15 ` [RFC PATCH v3 3/3] selftests/seccomp: cover non-cooperative pinned-memfd install Cong Wang
2026-06-13 10:07   ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.