From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C743C1A6813 for ; Mon, 4 May 2026 01:12:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.173 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777857132; cv=none; b=pdQN+SRX6H59BUX7wa1ifD/cC/4ahp5FHec4ArLVlM1uiEok7EjMeSZ7sQKI1PvEH7R2014gBTWDYz0SmU3nHbXML8DAwB63ILZkjfM2FdwEQx2SATAX4BB/JWT/vmNaFAJ8uy2c70vrLEaNHRXf/NDOAw8NqmpOykK4dPTK1bs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777857132; c=relaxed/simple; bh=ovmZfjkpdyca8qsxwKt5f41r6LQ3DhPKdOaP+QwEIZc=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=epIrhw7zWMLnvDec59hGIY3tpQ5b6xjFJ1cE1MMCI6CuqyxcH4ddp9lZbB87ldVQzHxlHdbYGBs5mp3QhbDmQ6LbMzYolnK/31LG/xLMb/blr9iRIIp001KV4Zdvh88tjjDsaYYOtFcOz+/Aeg5tUHIVyPhAui15XGHbRJLEdr4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ZJe94FNO; arc=none smtp.client-ip=209.85.214.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ZJe94FNO" Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-2b23fcf90b2so32488905ad.3 for ; Sun, 03 May 2026 18:12:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777857130; x=1778461930; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=XQfdYjYNojtJIhQYS39f+X3kIfDvhxwJ0om3LRv9an0=; b=ZJe94FNO64UGxp7yYluO4tmA+kG01FFyR9Bj9BKhiMj3770KMPPLHrTsBttoDoPsOs HWPuUb4Xqr30ItABPxo6V+MUEZc0WudS4INhRSCFJJPMjEoRxuN3IL/2KhdWM1EgwUDl bD9Xo23IvJ8awgBsl5opistZ5kJ0V6YHXFtetD5nOvQLJCcUL/dEol9t65xa7pML6fHE 1JdQAvdoXnk21X+Tgc4ITst27YkgCW7GLQ8Wb4cl5OIi6DQJpzDO/y4snGxQ3mW7Skc4 0iEqsGWHq0uPChJ4XQC3fhDrqFSS8qvgukmXyo9Zj3NLVdN2dj3p4JTRPIp2CLttkmw8 xjSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777857130; x=1778461930; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=XQfdYjYNojtJIhQYS39f+X3kIfDvhxwJ0om3LRv9an0=; b=i+2CdY0y+FFBo5ZFxCHgOYXGh0vAx+iCpf7V0LAT3OyEM+I2w7wqOSYB+FXbnFVYGY AmwiTgkqtMywCf8f7Z7GS34iBh5l1ZYD8lMg2bSuDDwjh+DxPP2I7AVzHFvkVokVFsRe Kc740ZBF5AhHrDNEvTXvl+Y02wRMqZWx1nTvpxQY6pJ7YxtoKrEw2gM0WOF8gwIVu8sp ArrIrmS+6rS8itXMbgw2YSmp+EeBWIR8SnMxQveVs+uBeKVN59HemhkUZZvIRlDxZCxF Hr0m9lz2F3FATvPIpd4rnyAk/AbLWiEYkQaqYI8E1TGgG5tyM0FSD/amK4u3MN5fRuVa oiKQ== X-Forwarded-Encrypted: i=1; AFNElJ97MsmQFwxt0d51DJzrRCyLxvwI2oH/gjqGNHvQ0Qvu1qWc0L5MUsYmxjBxY5mWU+scLSH/gAEDtO7A/xo=@vger.kernel.org X-Gm-Message-State: AOJu0YxD+442n8F1GTRXQcX5/kYobNgJIaEc9TUej1N8LC65bG+1hT1P IfJosNozCTdkYvM0jS1+3Pc1cKBU7niBj53FdvNmXVKfrzIT5rHQpEjYSys8bw== X-Gm-Gg: AeBDieswJdWYJp3memJ789nqw0xPMDhtAW+m5wd4OTmLFS5r8URnv2QPfip107ZQrkE ShIW33viibBp3wZgWejSbJKbe1HV4fztsvaodoOMTDS9YWx5eBvVS3J8Rjpze1VyW0xgt9HizRo Ll+wAE3WJWobv8NjFwVpVCZfM3yQOKX+R4IrVnLwO5AfxusDQeYh3bFLBfpWCv5CoB8e2EGETjQ m9LVy3XhACZVC5Lrc2UlSlyRi1tGDlz7ysA33bxrGITvivg1NCHGOYFJY1HHOkG+FEeZDIru3oJ SdA9C2d+zlzuRU1+Po74y8V9vCQzwEnbByHurdkch91vC/WHGtCkJEGCRh7BntUc5xlca5Y9b32 ipaZFO02aS2J6E/PDIrbKiHGUDEy8BLOftL6xYwQU4I9bgZTgy3vALFKZDUIdwZ4aBniGalhMG4 pXlkq3seN4WNWVPod5qGixg6zaCbUb+hU0ljBvghUUaiuXdD1WacMYQYitOJrK/UvsJYsi92I= X-Received: by 2002:a17:903:944:b0:2ae:66c9:494f with SMTP id d9443c01a7336-2b9f252cc21mr69148775ad.2.1777857129986; Sun, 03 May 2026 18:12:09 -0700 (PDT) Received: from pop-os.. ([2601:647:6802:dbc0:8bb8:1710:d99a:3c81]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2b9cae3697bsm83810065ad.58.2026.05.03.18.12.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 03 May 2026 18:12:09 -0700 (PDT) From: Cong Wang To: Kees Cook , linux-kernel@vger.kernel.org Cc: Andy Lutomirski , Will Drewry , Christian Brauner , Cong Wang Subject: [RFC PATCH 0/3] seccomp: SECCOMP_IOCTL_NOTIF_PIN_ARGS for race-free unotify Date: Sun, 3 May 2026 18:12:04 -0700 Message-ID: <20260504011207.539408-1-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: Cong Wang This RFC introduces SECCOMP_IOCTL_NOTIF_PIN_ARGS, a new ioctl on the seccomp user-notification listener that lets an unprivileged supervisor atomically snapshot pointer-arg payloads from a trapped task and bind those snapshots to the task's resumed syscall body. It closes the documented TOCTOU race that today makes content-aware policy on SECCOMP_USER_NOTIF_FLAG_CONTINUE unsafe for unprivileged supervisors. Posting as RFC because the UAPI shape, the consumption-hook placement, and the v1 vs v2 cut are all design choices that benefit from review before a non-RFC submission. ## Motivation seccomp_unotify(2) lets a supervisor inspect a trapped task's syscall arguments and either deny, allow, or CONTINUE the syscall. CONTINUE re-runs the syscall body in the trapped task, which re-fetches every pointer argument from user memory. A sibling thread or CLONE_VM peer in the trapped task's address space can mutate that memory between the supervisor's process_vm_readv() and the kernel's re-read, turning any policy that examined the argument into a check on already-stale bytes. The seccomp_unotify(2) man page documents this race explicitly. There is no race-free workaround for unprivileged supervisors today: - ptrace and /proc/pid/mem are not available to them; - process_vm_readv into a userspace buffer doesn't help, because the kernel will re-read user memory regardless on CONTINUE; - SECCOMP_IOCTL_NOTIF_ADDFD only solves the fd-substitution case, not the content-of-pointer-arg case. The result is that unprivileged seccomp supervisors -- which are the target audience of seccomp_unotify(2) in the first place -- cannot implement content-aware allow policies. They can only deny or unconditionally allow. Anything that depends on the actual contents of a path, sockaddr, argv, or write buffer is unsafe. ## Concrete user: Sandlock Sandlock is a process-based unprivileged sandbox for AI agents, built on seccomp_unotify(2). At sandbox setup the agent process installs a seccomp filter with SECCOMP_FILTER_FLAG_NEW_LISTENER and hands the listener fd to a Sandlock supervisor; the supervisor then drives each filtered syscall via SECCOMP_IOCTL_NOTIF_RECV and replies with either an injected errno or SECCOMP_USER_NOTIF_FLAG_CONTINUE. This is what lets Sandlock confine an AI-agent process (coding agents, tool-using agents) to the filesystem and network surface its operator authorized, without root, a container runtime, or virtualization. AI-agent workloads routinely fork helpers, exec compilers and language runtimes, and follow user- or model-controlled paths, so every content-aware policy decision the supervisor makes depends on the contents of syscall pointer arguments -- "allow open(path) iff path is below /allowed/", "allow bind(addr) iff addr->sun_path is in the permitlist", "allow execve(filename, argv) iff argv[0] is one of N approved binaries". Each of these is implemented as a CONTINUE reply, and CONTINUE re-fetches the pointer args from the trapped task's user memory -- which is where the documented TOCTOU race bites. Today Sandlock has to either run privileged (defeating the unprivileged-sandbox model) or accept the TOCTOU race (defeating the guarantee against an adversarial agent that spawns a sibling thread to mutate the buffer between the supervisor's inspection and the kernel's re-read on CONTINUE). PIN_ARGS gives it the missing primitive without either compromise. This is a general primitive, not a Sandlock-specific feature, but having a concrete user with a real and growing threat model (LLM agents executing untrusted or partially-trusted code in-process) should help anchor the design discussion. ## Approach PIN_ARGS is a per-notification ioctl. The supervisor describes which register slots to snapshot and what shape each one is. The kernel walks the trapped task's mm under the existing remote-mm primitives (access_remote_vm, copy_remote_vm_str), copies the bytes into kernel- owned buffers, and stamps the snapshot onto the trapped task. On SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED, the kernel's syscall fetch points consume from the kernel buffer instead of re-reading user memory. Three v1 shapes: - SECCOMP_PIN_FIXED (sockaddr, single-buffer read/write) - SECCOMP_PIN_CSTRING (paths) - SECCOMP_PIN_CSTRING_ARRAY (argv, envp) Each per-arg copy is bounded by max_bytes; total cumulative bytes per request are bounded at a hardcoded 1 MiB. Allocations use GFP_KERNEL_ACCOUNT so the trapped task's memcg pays the cost. The pin is one-shot: cleared on the trapped task's next return-to- userspace via task_work, with fallback paths for task exit, listener release, and explicit discard (CONTINUE without CONTINUE_PINNED). The syscall number is captured at pin time and verified at consumption, so a signal-handler-issued syscall during -ERESTART* resolution will not consume the pin. Pin orchestration uses a three-phase lock dance: validate the notif and snapshot register args under filter->notify_lock, walk the trapped task's mm without locks, then re-validate and attach the snapshot. The walker uses primitives the kernel already uses for arg fetch (copy_remote_vm_str, getname_kernel, copy_string_kernel, iov_iter_kvec), so consumption sites are minimally invasive. ## Why copy and not page-pinning Page-level FOLL_PIN doesn't solve content TOCTOU: the trapped task (or its CLONE_VM peer) is the owner of the mm and can write through the same mapping. There is no kernel primitive for "freeze the contents of these user pages." Copying at decision time is the only way to guarantee the bytes the supervisor inspected equal the bytes the kernel acts on. The kernel already does this copy in syscall bodies today -- getname(), copy_strings(), move_addr_to_kernel(), copy_from_iter() for ITER_UBUF. PIN_ARGS shifts when that copy happens (at supervisor decision time) and re-points the syscall fetch points at the snapshot. Net new copies per syscall: zero. ## Why unprivileged PIN_ARGS is gated by listener-fd possession, which is itself a capability scoped by file-descriptor ownership and SCM_RIGHTS passing. The supervisor already has equivalent remote-mm read access via process_vm_readv() (subject to the same ptrace_may_access checks). NO_NEW_PRIVS, required for unprivileged seccomp filter installation, blocks the obvious execve escalation. The DoS surface is bounded by the 1 MiB per-request cap, the one-shot lifetime, and at-most-one-pin-per-trapped-task, with memcg accounting on top. Requiring CAP_SYS_PTRACE would render PIN_ARGS useless for its only real audience; privileged supervisors already have ptrace and /proc/pid/mem. ## What's NOT covered in v1 - Vector I/O (readv/writev) -- needs per-iovec pin descriptors, intentional v2. - Nested-pointer payloads (sendmsg msghdr, futex_waitv) -- same. - Per-iter consumption hooks beyond getname_flags, move_addr_to_kernel, copy_strings, and import_ubuf. Other syscall fetch sites that re-read user memory still race; v1 covers the four most common cases (path, sockaddr, argv/envp, single-buffer read/write) which together cover the bulk of practical unprivileged-sandbox policies. ## Patches [PATCH 1/3] seccomp: kernel implementation (UAPI, walker, orchestrator, four consumption hooks, one-shot lifecycle) [PATCH 2/3] selftests/seccomp: end-to-end coverage (10 cases across all three shapes + lifecycle) [PATCH 3/3] Documentation: seccomp_filter.rst ("Pinned arguments" section) ## Testing The selftest binary covers all three v1 shapes against real syscalls (bind, openat, execve, write), plus negative paths (CONTINUE without PINNED, double pin, mismatched flags) and the lifecycle (post- syscall clear, SIGKILL teardown). All ten cases pass on x86_64. Cong Wang (3): seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU race selftests/seccomp: add seccomp_pin_args end-to-end coverage Documentation: seccomp: document SECCOMP_IOCTL_NOTIF_PIN_ARGS .../userspace-api/seccomp_filter.rst | 76 ++ MAINTAINERS | 2 + fs/exec.c | 63 ++ fs/namei.c | 19 + fs/read_write.c | 8 +- include/linux/mm.h | 2 +- include/linux/seccomp.h | 35 + include/linux/seccomp_types.h | 33 + include/uapi/linux/seccomp.h | 73 ++ kernel/Makefile | 1 + kernel/exit.c | 1 + kernel/fork.c | 5 + kernel/seccomp.c | 189 +++- kernel/seccomp_pin.c | 453 +++++++++ kernel/seccomp_pin.h | 109 +++ lib/iov_iter.c | 22 + mm/memory.c | 4 +- mm/nommu.c | 4 +- net/socket.c | 16 + tools/testing/selftests/seccomp/.gitignore | 1 + tools/testing/selftests/seccomp/Makefile | 2 +- .../selftests/seccomp/seccomp_pin_args.c | 857 ++++++++++++++++++ 22 files changed, 1961 insertions(+), 14 deletions(-) create mode 100644 kernel/seccomp_pin.c create mode 100644 kernel/seccomp_pin.h create mode 100644 tools/testing/selftests/seccomp/seccomp_pin_args.c -- 2.43.0