From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6AAA01A9F83 for ; Mon, 4 May 2026 01:12:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.177 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777857136; cv=none; b=tC/7WgXEemxtGEfXyCaIEpsnfhb/2NpnNqGhnr/2HSeHdoxb5kdusBnGLJj6n+BJbNNksrZopNnYOkz/TL/kgHM3nx85sdB4FSSKJXov4pDHY2SAxIP88/yygb4oD+VTcdX188LyAxlPXBjWCuWAGJ01h06o7MdYB1OBv2zDD9g= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777857136; c=relaxed/simple; bh=DjVWfX3srKf6vSH6PgssRkQAmq5zFauPEDk1f4mXIRM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ZQVMK4NqXsxS/e91ZgDaFr0tjsDYqBt8sU99jGF1SEQWqsPPlRZlke+exPKbPQcDY32NNq0nuctS1YveUUraCrV/bcJ6RFxZW9Ve+9ExxiSR/YltVwZq69imk23N3eUHu8w6ehPw93+2H6iMGBS3XCKkvDDXWgF0p3zj/f1h7uI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Up4bMEaT; arc=none smtp.client-ip=209.85.214.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Up4bMEaT" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-2ad21f437eeso20790415ad.0 for ; Sun, 03 May 2026 18:12:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777857134; x=1778461934; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=pe+63BlkBmvqkVhFx7hQK5NzSQ5A9S6uPXIXO6WgjYQ=; b=Up4bMEaTBd8mi3XfHo840SzWjdhCsBEhYZbpgdvgTqIAtQ6IrkOa/Xb4PZyAxPXAWX qPaqBJi3oeM+Bn/01lcn2gVJTsRp9u46VP0UNevru+uchyiU8JD3hSrHRpRs8ry2EzLd zcMkSJRsZBykXrti+W7ufEFdob3ZrYKbY01YMvRXXNrmtaHHZ02j4x7iTO2GCFb5fGja qh8bbAqXM6t0Hy/8FGoAM2E9XHvwfu2eC5Rsoqhrkbh5QmeJ+RkAiFCpRrbvmvW6ZPAX BKCE0BbWzz9kGiByaG1onVLew6+CDDn7MQUJH3k23pMbhSlaeCR2gKYLRWJoljU91s4Q hDIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777857134; x=1778461934; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=pe+63BlkBmvqkVhFx7hQK5NzSQ5A9S6uPXIXO6WgjYQ=; b=DUGvBFCaNwaQkOpnxwcthrHBmP42hA6i95qAdLRhORp0WFqYqfrCEFJMJwcUk75Q1B 5cT0jV0WOC48waVlW8GQOO3zQtXdlGqdGsegLXTWlmq6KAoYmCOfKug6LuwusWTNuuW3 1S4PA6HG8sts++dw9gHCqJuYCSxy+odhMKJvQEPIXEUfgxf413e8F6iyNjV5UX853ZVH kAm2nCUa5vtC7ahLECRoSnzlEYUoRpLXlCEsbroWIRcO/7ofN54ULW46VststcLeUMNU rZ3/XIrNnThYwUq9JXQaJTSHEzwARDwMcnbzrUuop2Na69eFleu3Q/ncWA7WsdPTME+v U9Xw== X-Forwarded-Encrypted: i=1; AFNElJ9779htnVNfAz6oCXcC1uI8K0Riv0KFZxKg/hYsDQR+Q4SXPqYk3SxHve+ACir6n8wLNP30302i2+E8n04=@vger.kernel.org X-Gm-Message-State: AOJu0YzPfZX1GnGxbOLlpRBaNIUDIfiGlosBEuHUz+skYm3U/1Q2wqTt D2HQ8UpbpxOjvHWXHa3dO6k6tiXPzKP0EctUwhSuXGc2Zm8uk82od/hFOXqbUQ== X-Gm-Gg: AeBDiev4cAfOmIw33uevnzqmB3CUtTHlt2Pkw29Lt+DGvWu4/2fOWhVT0IvGGyAMknh 8azePj0jyivnKzzrF6ZkxZ3Pju+C5N5cRq6qIzScRxFgjlx0P2O3eY2jRIWOyYhg7INcHlBVBIM JcnlfjvZ33WOrZ2kfq0ltj7lEpBTr76+cLZhg5IFanc5pz6Mgs26516KljHfTzqzmYMW3GuwwUA UB+AxDrpq5uZz8Nf9KkNjorn2lHVot5O79v5C3OyYWjY/n2e0HOE/ObLVoHAfUxB88z/xk9TtUY kd+EGI94ycMq5hUH0ZtEaS5iZbA/O+UZ0R9SFoOaylCsGMluJ/i0CD+0Zq+vNpxMzjv6QgwXMlf YXPimo2b4sjBGXfeVL/mdI3xvctzZ9OmfPYzhdpLI12GPw+otun51lV0iQwsaykVGA4Q0Bpnml4 G2j+mKG37Dw1V1IEfyxtG5GAs4PpXwfytEaDTxV0ISGbckEusB18EMEpEPzimKGHLwiwlVqCQ= X-Received: by 2002:a17:902:db0f:b0:2b2:4dc4:18cc with SMTP id d9443c01a7336-2b9f1dad672mr60012135ad.12.1777857134544; Sun, 03 May 2026 18:12:14 -0700 (PDT) Received: from pop-os.. ([2601:647:6802:dbc0:8bb8:1710:d99a:3c81]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2b9cae3697bsm83810065ad.58.2026.05.03.18.12.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 03 May 2026 18:12:13 -0700 (PDT) From: Cong Wang To: Kees Cook , linux-kernel@vger.kernel.org Cc: Andy Lutomirski , Will Drewry , Christian Brauner , Cong Wang Subject: [RFC PATCH 3/3] Documentation: seccomp: document SECCOMP_IOCTL_NOTIF_PIN_ARGS Date: Sun, 3 May 2026 18:12:07 -0700 Message-ID: <20260504011207.539408-4-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20260504011207.539408-1-xiyou.wangcong@gmail.com> References: <20260504011207.539408-1-xiyou.wangcong@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: Cong Wang Add a "Pinned arguments" section to the userspace API doc covering the motivation (closing the documented TOCTOU window for unprivileged supervisors), the pin/consume flow via SECCOMP_IOCTL_NOTIF_PIN_ARGS and SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED, the three v1 shapes with their per-shape semantics, the single-shot lifecycle, the syscall_nr mismatch check, and the explicitly-not-covered cases left for follow-ups (vector I/O, nested-pointer payloads). Assisted-by: Claude:claude-opus-4.6 Signed-off-by: Cong Wang --- .../userspace-api/seccomp_filter.rst | 76 +++++++++++++++++++ 1 file changed, 76 insertions(+) diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst index cff0fa7f3175..8bbbd923c31d 100644 --- a/Documentation/userspace-api/seccomp_filter.rst +++ b/Documentation/userspace-api/seccomp_filter.rst @@ -289,6 +289,82 @@ above in this document: all arguments being read from the tracee's memory should be read into the tracer's memory before any policy decisions are made. This allows for an atomic decision on syscall arguments. +Pinned arguments +---------------- + +For unprivileged supervisors, ``ptrace()``/``/proc/pid/mem`` are not +available, and reading the tracee's memory via ``process_vm_readv()`` +remains racy: a sibling thread or ``CLONE_VM`` peer can mutate the +buffer between supervisor read and the kernel's re-read on +``SECCOMP_USER_NOTIF_FLAG_CONTINUE``. ``SECCOMP_IOCTL_NOTIF_PIN_ARGS`` +closes that race by atomically copying designated pointer-arg payloads +from the tracee's address space into kernel-owned buffers, and binding +those buffers to the tracee's next-syscall execution. + +The supervisor receives a notification as today, then issues +``ioctl(SECCOMP_IOCTL_NOTIF_PIN_ARGS, &payload)`` with a +``struct seccomp_notif_pin_args`` describing which pointer-args to +snapshot. Each per-arg descriptor names a syscall register slot +(``arg_idx``, 0..5), one of three shapes (``SECCOMP_PIN_FIXED``, +``SECCOMP_PIN_CSTRING``, ``SECCOMP_PIN_CSTRING_ARRAY``), and a +``max_bytes`` cap. The kernel walks the trapped task's mm, copies +the bytes into kernel buffers, and writes them back to a supervisor- +provided byte buffer (``buf`` / ``buf_size``) plus per-arg metadata +(``actual_size``, ``buf_offset``, ``truncated``). + +To consume the snapshot on syscall re-execution, the supervisor sends +``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` with both +``SECCOMP_USER_NOTIF_FLAG_CONTINUE`` and +``SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED`` set. The kernel's syscall +fetch points (``getname_flags``, ``copy_strings``, +``move_addr_to_kernel``, ``import_ubuf``) check +``current->seccomp.pinned_args`` and consume from the kernel buffer +instead of re-reading user memory; mutations to the original buffer +after ``PIN_ARGS`` returns have no effect. + +The pin is single-shot: it is cleared automatically when the trapped +task next returns to user mode after the resumed syscall body +completes, when the task exits, when the listener fd is closed, or +when the supervisor sends ``CONTINUE`` without ``CONTINUE_PINNED`` +(an explicit "I changed my mind" path). Subsequent traps require a +fresh ``PIN_ARGS`` for the new notification id. + +Per-shape semantics: + +* ``SECCOMP_PIN_FIXED`` copies exactly ``max_bytes`` from + ``args[arg_idx]``. Suitable for ``struct sockaddr`` (``bind``, + ``connect``, ``sendto``) and for ``write(fd, buf, count)`` (the + supervisor sets ``max_bytes = count`` from + ``seccomp_data.args[2]``). + +* ``SECCOMP_PIN_CSTRING`` walks to the trailing NUL, capped at + ``max_bytes``. The pinned buffer is always NUL-terminated; if the + cap was hit before the source NUL, ``truncated`` carries + ``SECCOMP_PIN_TRUNCATED_BYTES``. Suitable for paths + (``open``/``openat``/``execve`` filename, etc.). + +* ``SECCOMP_PIN_CSTRING_ARRAY`` walks a NULL-terminated pointer table + at ``args[arg_idx]`` and copies each non-NULL string. Suitable for + ``execve``'s argv and envp. Bounded by both ``max_bytes`` and + ``max_entries``. Result is packed as + ``[u32 count][u32 offsets[count]][u8 strings[]]``. + +The total cumulative ``max_bytes`` across all per-arg descriptors and +the supervisor-provided ``buf_size`` are each bounded at 1 MiB; this +is a hard-coded defensive ceiling, not a tunable. + +The kernel records the syscall number at pin time and verifies a +match at consumption: a signal handler running on the trapped task +during ``-ERESTART*`` resolution that issues an unrelated syscall +will not consume the pin. + +Cumulative scope of v1: ``SECCOMP_PIN_FIXED`` covers sockaddr and +single-buffer write content; ``SECCOMP_PIN_CSTRING`` covers paths; +``SECCOMP_PIN_CSTRING_ARRAY`` covers argv and envp. Vector I/O +(``readv``/``writev``) and nested-pointer payloads +(``sendmsg``/``recvmsg`` ``msghdr``, ``futex_waitv``) are not covered +in v1. + Sysctls ======= -- 2.43.0