* [PATCH v3 0/7] seccomp: support nested listeners
@ 2025-12-11 12:46 Alexander Mikhalitsyn
2025-12-11 12:46 ` [PATCH v3 1/7] seccomp: remove unused argument from seccomp_do_user_notification Alexander Mikhalitsyn
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Alexander Mikhalitsyn @ 2025-12-11 12:46 UTC (permalink / raw)
To: kees
Cc: linux-doc, linux-kernel, linux-kselftest, bpf, Andy Lutomirski,
Will Drewry, Jonathan Corbet, Shuah Khan, Aleksa Sarai,
Tycho Andersen, Andrei Vagin, Christian Brauner,
Stéphane Graber
Dear friends,
this patch series adds support for nested seccomp listeners. It allows container
runtimes and other sandboxing software to install seccomp listeners on top of
existing ones, which is useful for nested LXC containers and other similar use-cases.
Expecting potential discussions around this patch series, I'm going to present a talk
at LPC 2025 about the design and implementation details of this feature [1].
Git tree (based on for-next/seccomp):
v3: https://github.com/mihalicyn/linux/commits/seccomp.mult.listeners.v3
current: https://github.com/mihalicyn/linux/commits/seccomp.mult.listeners
Changelog for version 3:
- almost completely rewritten (no static array on the stack, no nesting limit)
- more testcases
Changelog for version 2:
- add some explanatory comments
- add RWB tags from Tycho Andersen (thanks, Tycho! ;) )
- CC-ed Aleksa as he might be interested in this stuff too
Links to previous versions:
v2: https://lore.kernel.org/all/20251202115200.110646-1-aleksandr.mikhalitsyn@canonical.com
tree: https://github.com/mihalicyn/linux/commits/seccomp.mult.listeners.v2
v1: https://lore.kernel.org/all/20251201122406.105045-1-aleksandr.mikhalitsyn@canonical.com
tree: https://github.com/mihalicyn/linux/commits/seccomp.mult.listeners.v1
Link: https://lpc.events/event/19/contributions/2241/ [1]
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: Kees Cook <kees@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stéphane Graber <stgraber@stgraber.org>
Alexander Mikhalitsyn (7):
seccomp: remove unused argument from seccomp_do_user_notification
seccomp: use bitfields for boolean flags on seccomp_filter struct
seccomp: keep track of seccomp filters with closed listeners
seccomp: mark first listener in the tree
seccomp: handle multiple listeners case
seccomp: allow nested listeners
tools/testing/selftests/seccomp: test nested listeners
.../userspace-api/seccomp_filter.rst | 6 +
include/linux/seccomp.h | 3 +-
include/uapi/linux/seccomp.h | 13 +-
kernel/seccomp.c | 129 +++++++-
tools/include/uapi/linux/seccomp.h | 13 +-
tools/testing/selftests/seccomp/seccomp_bpf.c | 303 ++++++++++++++++++
6 files changed, 438 insertions(+), 29 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH v3 1/7] seccomp: remove unused argument from seccomp_do_user_notification 2025-12-11 12:46 [PATCH v3 0/7] seccomp: support nested listeners Alexander Mikhalitsyn @ 2025-12-11 12:46 ` Alexander Mikhalitsyn 2025-12-11 12:46 ` [PATCH v3 4/7] seccomp: mark first listener in the tree Alexander Mikhalitsyn 2025-12-11 12:46 ` [PATCH v3 6/7] seccomp: allow nested listeners Alexander Mikhalitsyn 2 siblings, 0 replies; 13+ messages in thread From: Alexander Mikhalitsyn @ 2025-12-11 12:46 UTC (permalink / raw) To: kees Cc: linux-doc, linux-kernel, Andy Lutomirski, Will Drewry, Jonathan Corbet, Shuah Khan, Aleksa Sarai, Tycho Andersen, Andrei Vagin, Christian Brauner, Stéphane Graber, Tycho Andersen, Alexander Mikhalitsyn Remove unused this_syscall argument from seccomp_do_user_notification() and add kdoc for it. No functional change intended. Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Kees Cook <kees@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Tycho Andersen <tycho@tycho.pizza> Cc: Andrei Vagin <avagin@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Stéphane Graber <stgraber@stgraber.org> Reviewed-by: Tycho Andersen (AMD) <tycho@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> --- kernel/seccomp.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 25f62867a16d..08476fc0c65b 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1160,8 +1160,18 @@ static bool should_sleep_killable(struct seccomp_filter *match, return match->wait_killable_recv && n->state >= SECCOMP_NOTIFY_SENT; } -static int seccomp_do_user_notification(int this_syscall, - struct seccomp_filter *match, +/** + * seccomp_do_user_notification - sends seccomp notification to the userspace + * listener and waits for a reply. + * @match: seccomp filter we are notifying + * @sd: seccomp data (syscall_nr, args, etc) to be passed to the userspace listener + * + * Returns + * - -1 on success if userspace provided a reply for the syscall, + * - -1 on interrupted wait, + * - 0 on success if userspace requested to continue the syscall + */ +static int seccomp_do_user_notification(struct seccomp_filter *match, const struct seccomp_data *sd) { int err; @@ -1335,7 +1345,7 @@ static int __seccomp_filter(int this_syscall, const bool recheck_after_trace) return 0; case SECCOMP_RET_USER_NOTIF: - if (seccomp_do_user_notification(this_syscall, match, &sd)) + if (seccomp_do_user_notification(match, &sd)) goto skip; return 0; -- 2.43.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v3 4/7] seccomp: mark first listener in the tree 2025-12-11 12:46 [PATCH v3 0/7] seccomp: support nested listeners Alexander Mikhalitsyn 2025-12-11 12:46 ` [PATCH v3 1/7] seccomp: remove unused argument from seccomp_do_user_notification Alexander Mikhalitsyn @ 2025-12-11 12:46 ` Alexander Mikhalitsyn 2026-01-21 12:22 ` Aleksa Sarai 2025-12-11 12:46 ` [PATCH v3 6/7] seccomp: allow nested listeners Alexander Mikhalitsyn 2 siblings, 1 reply; 13+ messages in thread From: Alexander Mikhalitsyn @ 2025-12-11 12:46 UTC (permalink / raw) To: kees Cc: linux-doc, linux-kernel, Andy Lutomirski, Will Drewry, Jonathan Corbet, Shuah Khan, Aleksa Sarai, Tycho Andersen, Andrei Vagin, Christian Brauner, Stéphane Graber, Alexander Mikhalitsyn Let's note if listener was a first one installed in the seccomp filters tree. We will need this information to retain old quirk behavior (as before seccomp nesting introduced). Also, rename has_duplicate_listener() to check_duplicate_listener(), cause now this function is not read-only, but also modifies a state of a new_child seccomp_filter. No functional change intended at this point. Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Kees Cook <kees@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Tycho Andersen <tycho@tycho.pizza> Cc: Andrei Vagin <avagin@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Stéphane Graber <stgraber@stgraber.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> --- kernel/seccomp.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 89ae81f06743..1a139f9ef39b 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -205,6 +205,7 @@ static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged * @wait_killable_recv: Put notifying process in killable state once the * notification is received by the userspace listener. + * @first_listener: true if this is the first seccomp listener installed in the tree. * @prev: points to a previously installed, or inherited, filter * @prog: the BPF program to evaluate * @notif: the struct that holds all notification related information @@ -226,6 +227,7 @@ struct seccomp_filter { refcount_t users; bool log : 1; bool wait_killable_recv : 1; + bool first_listener : 1; struct action_cache cache; struct seccomp_filter *prev; struct bpf_prog *prog; @@ -1939,7 +1941,7 @@ static struct file *init_listener(struct seccomp_filter *filter) * Note that @new_child is not hooked up to its parent at this point yet, so * we use current->seccomp.filter. */ -static bool has_duplicate_listener(struct seccomp_filter *new_child) +static bool check_duplicate_listener(struct seccomp_filter *new_child) { struct seccomp_filter *cur; @@ -1953,6 +1955,8 @@ static bool has_duplicate_listener(struct seccomp_filter *new_child) return true; } + /* Mark first listener in the tree. */ + new_child->first_listener = true; return false; } @@ -2035,7 +2039,7 @@ static long seccomp_set_mode_filter(unsigned int flags, if (!seccomp_may_assign_mode(seccomp_mode)) goto out; - if (has_duplicate_listener(prepared)) { + if (check_duplicate_listener(prepared)) { ret = -EBUSY; goto out; } -- 2.43.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 4/7] seccomp: mark first listener in the tree 2025-12-11 12:46 ` [PATCH v3 4/7] seccomp: mark first listener in the tree Alexander Mikhalitsyn @ 2026-01-21 12:22 ` Aleksa Sarai 2026-01-28 19:05 ` Alexander Mikhalitsyn 0 siblings, 1 reply; 13+ messages in thread From: Aleksa Sarai @ 2026-01-21 12:22 UTC (permalink / raw) To: Alexander Mikhalitsyn Cc: kees, linux-doc, linux-kernel, Andy Lutomirski, Will Drewry, Jonathan Corbet, Shuah Khan, Tycho Andersen, Andrei Vagin, Christian Brauner, Stéphane Graber [-- Attachment #1: Type: text/plain, Size: 691 bytes --] On 2025-12-11, Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> wrote: > Let's note if listener was a first one installed in the seccomp > filters tree. We will need this information to retain old > quirk behavior (as before seccomp nesting introduced). > > Also, rename has_duplicate_listener() to check_duplicate_listener(), > cause now this function is not read-only, but also modifies a state > of a new_child seccomp_filter. > > No functional change intended at this point. Ah sorry, I didn't notice the date of the mails -- this was sent before the LPC discussion! I'll wait for the v4 before reviewing further. -- Aleksa Sarai https://www.cyphar.com/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 265 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 4/7] seccomp: mark first listener in the tree 2026-01-21 12:22 ` Aleksa Sarai @ 2026-01-28 19:05 ` Alexander Mikhalitsyn 2026-01-28 22:32 ` Kees Cook 0 siblings, 1 reply; 13+ messages in thread From: Alexander Mikhalitsyn @ 2026-01-28 19:05 UTC (permalink / raw) To: Aleksa Sarai, Alexander Mikhalitsyn Cc: kees, linux-doc, linux-kernel, Andy Lutomirski, Will Drewry, Jonathan Corbet, Shuah Khan, Tycho Andersen, Andrei Vagin, Christian Brauner, Stéphane Graber On Wed, 2026-01-21 at 13:22 +0100, Aleksa Sarai wrote: > On 2025-12-11, Alexander Mikhalitsyn > <aleksandr.mikhalitsyn@canonical.com> wrote: > > Let's note if listener was a first one installed in the seccomp > > filters tree. We will need this information to retain old > > quirk behavior (as before seccomp nesting introduced). > > > > Also, rename has_duplicate_listener() to > > check_duplicate_listener(), > > cause now this function is not read-only, but also modifies a state > > of a new_child seccomp_filter. > > > > No functional change intended at this point. > > Ah sorry, I didn't notice the date of the mails -- this was sent > before > the LPC discussion! I'll wait for the v4 before reviewing further. Hi Aleksa, Yeah, I'm thinking about preparing a separate patches to address a quirky seccomp behavior we discussed during LPC and then resend this series. Kind regards, Alex ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 4/7] seccomp: mark first listener in the tree 2026-01-28 19:05 ` Alexander Mikhalitsyn @ 2026-01-28 22:32 ` Kees Cook 0 siblings, 0 replies; 13+ messages in thread From: Kees Cook @ 2026-01-28 22:32 UTC (permalink / raw) To: Alexander Mikhalitsyn Cc: Aleksa Sarai, linux-doc, linux-kernel, Andy Lutomirski, Will Drewry, Jonathan Corbet, Shuah Khan, Tycho Andersen, Andrei Vagin, Christian Brauner, Stéphane Graber On Wed, Jan 28, 2026 at 08:05:25PM +0100, Alexander Mikhalitsyn wrote: > Yeah, I'm thinking about preparing a separate patches to address > a quirky seccomp behavior we discussed during LPC and then resend this > series. Yeah, I'd love to see it as a distinct change. :) -- Kees Cook ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 6/7] seccomp: allow nested listeners 2025-12-11 12:46 [PATCH v3 0/7] seccomp: support nested listeners Alexander Mikhalitsyn 2025-12-11 12:46 ` [PATCH v3 1/7] seccomp: remove unused argument from seccomp_do_user_notification Alexander Mikhalitsyn 2025-12-11 12:46 ` [PATCH v3 4/7] seccomp: mark first listener in the tree Alexander Mikhalitsyn @ 2025-12-11 12:46 ` Alexander Mikhalitsyn 2025-12-12 13:57 ` Andy Lutomirski 2026-01-21 7:51 ` Andrei Vagin 2 siblings, 2 replies; 13+ messages in thread From: Alexander Mikhalitsyn @ 2025-12-11 12:46 UTC (permalink / raw) To: kees Cc: linux-doc, linux-kernel, bpf, Andy Lutomirski, Will Drewry, Jonathan Corbet, Shuah Khan, Aleksa Sarai, Tycho Andersen, Andrei Vagin, Christian Brauner, Stéphane Graber, Alexander Mikhalitsyn, Alexander Mikhalitsyn Now everything is ready to get rid of "only one listener per tree" limitation. Let's introduce a new uAPI flag SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS, so userspace may explicitly allow nested listeners when installing a listener. Note, that to install n-th listener, this flag must be set on all the listeners up the tree. Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: bpf@vger.kernel.org Cc: Kees Cook <kees@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Tycho Andersen <tycho@tycho.pizza> Cc: Andrei Vagin <avagin@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Stéphane Graber <stgraber@stgraber.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> --- .../userspace-api/seccomp_filter.rst | 6 +++++ include/linux/seccomp.h | 3 ++- include/uapi/linux/seccomp.h | 13 ++++++----- kernel/seccomp.c | 22 +++++++++++++++---- tools/include/uapi/linux/seccomp.h | 13 ++++++----- 5 files changed, 40 insertions(+), 17 deletions(-) diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst index cff0fa7f3175..b9633ab1ed47 100644 --- a/Documentation/userspace-api/seccomp_filter.rst +++ b/Documentation/userspace-api/seccomp_filter.rst @@ -210,6 +210,12 @@ notifications from both tasks will appear on the same filter fd. Reads and writes to/from a filter fd are also synchronized, so a filter fd can safely have many readers. +By default, only one listener within seccomp filters tree is allowed. On attempt +to add a new listener when one already exists in the filter tree, the +``seccomp()`` call will fail with ``-EBUSY``. To allow multiple listeners, the +``SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS`` flag can be passed in addition to +the ``SECCOMP_FILTER_FLAG_NEW_LISTENER`` flag. + The interface for a seccomp notification fd consists of two structures: .. code-block:: c diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 9b959972bf4a..9b060946019d 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -10,7 +10,8 @@ SECCOMP_FILTER_FLAG_SPEC_ALLOW | \ SECCOMP_FILTER_FLAG_NEW_LISTENER | \ SECCOMP_FILTER_FLAG_TSYNC_ESRCH | \ - SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) + SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV | \ + SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS) /* sizeof() the first published struct seccomp_notif_addfd */ #define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24 diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index dbfc9b37fcae..de78d8e7a70b 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -18,13 +18,14 @@ #define SECCOMP_GET_NOTIF_SIZES 3 /* Valid flags for SECCOMP_SET_MODE_FILTER */ -#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) -#define SECCOMP_FILTER_FLAG_LOG (1UL << 1) -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) -#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) -#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) +#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) +#define SECCOMP_FILTER_FLAG_LOG (1UL << 1) +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) +#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) +#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) /* Received notifications wait in killable state (only respond to fatal signals) */ -#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) +#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) +#define SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS (1UL << 6) /* * All BPF programs must return a 32-bit value. diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 51d0d8adaffb..7667f443ff6c 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -206,6 +206,7 @@ static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) * @wait_killable_recv: Put notifying process in killable state once the * notification is received by the userspace listener. * @first_listener: true if this is the first seccomp listener installed in the tree. + * @allow_nested_listeners: Allow nested seccomp listeners. * @prev: points to a previously installed, or inherited, filter * @prog: the BPF program to evaluate * @notif: the struct that holds all notification related information @@ -228,6 +229,7 @@ struct seccomp_filter { bool log : 1; bool wait_killable_recv : 1; bool first_listener : 1; + bool allow_nested_listeners : 1; struct action_cache cache; struct seccomp_filter *prev; struct bpf_prog *prog; @@ -956,6 +958,10 @@ static long seccomp_attach_filter(unsigned int flags, if (flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) filter->wait_killable_recv = true; + /* Set nested listeners allow flag, if present. */ + if (flags & SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS) + filter->allow_nested_listeners = true; + /* * If there is an existing filter, make it the prev and don't drop its * task reference. @@ -1997,7 +2003,8 @@ static struct file *init_listener(struct seccomp_filter *filter) } /* - * Does @new_child have a listener while an ancestor also has a listener? + * Does @new_child have a listener while an ancestor also has a listener + * and hasn't allowed nesting? * If so, we'll want to reject this filter. * This only has to be tested for the current process, even in the TSYNC case, * because TSYNC installs @child with the same parent on all threads. @@ -2015,7 +2022,12 @@ static bool check_duplicate_listener(struct seccomp_filter *new_child) return false; for (cur = current->seccomp.filter; cur; cur = cur->prev) { if (!IS_ERR_OR_NULL(cur->notif)) - return true; + /* + * We don't need to go up further, because if there is a + * listener with nesting allowed, then all the listeners + * up the tree have allowed nesting as well. + */ + return !cur->allow_nested_listeners; } /* Mark first listener in the tree. */ @@ -2062,10 +2074,12 @@ static long seccomp_set_mode_filter(unsigned int flags, return -EINVAL; /* - * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT flag doesn't make sense + * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT and + * SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS flags don't make sense * without the SECCOMP_FILTER_FLAG_NEW_LISTENER flag. */ - if ((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) && + if (((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) || + (flags & SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS)) && ((flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0)) return -EINVAL; diff --git a/tools/include/uapi/linux/seccomp.h b/tools/include/uapi/linux/seccomp.h index dbfc9b37fcae..de78d8e7a70b 100644 --- a/tools/include/uapi/linux/seccomp.h +++ b/tools/include/uapi/linux/seccomp.h @@ -18,13 +18,14 @@ #define SECCOMP_GET_NOTIF_SIZES 3 /* Valid flags for SECCOMP_SET_MODE_FILTER */ -#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) -#define SECCOMP_FILTER_FLAG_LOG (1UL << 1) -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) -#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) -#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) +#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) +#define SECCOMP_FILTER_FLAG_LOG (1UL << 1) +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) +#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) +#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) /* Received notifications wait in killable state (only respond to fatal signals) */ -#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) +#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) +#define SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS (1UL << 6) /* * All BPF programs must return a 32-bit value. -- 2.43.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 6/7] seccomp: allow nested listeners 2025-12-11 12:46 ` [PATCH v3 6/7] seccomp: allow nested listeners Alexander Mikhalitsyn @ 2025-12-12 13:57 ` Andy Lutomirski 2026-01-28 19:10 ` Alexander Mikhalitsyn 2026-01-21 7:51 ` Andrei Vagin 1 sibling, 1 reply; 13+ messages in thread From: Andy Lutomirski @ 2025-12-12 13:57 UTC (permalink / raw) To: Alexander Mikhalitsyn Cc: kees, linux-doc, linux-kernel, bpf, Will Drewry, Jonathan Corbet, Shuah Khan, Aleksa Sarai, Tycho Andersen, Andrei Vagin, Christian Brauner, Stéphane Graber, Alexander Mikhalitsyn On Thu, Dec 11, 2025 at 8:47 PM Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> wrote: > > Now everything is ready to get rid of "only one listener per tree" > limitation. > > Let's introduce a new uAPI flag > SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS, so userspace may explicitly > allow nested listeners when installing a listener. > > Note, that to install n-th listener, this flag must be set on all > the listeners up the tree. > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst > index cff0fa7f3175..b9633ab1ed47 100644 > --- a/Documentation/userspace-api/seccomp_filter.rst > +++ b/Documentation/userspace-api/seccomp_filter.rst > @@ -210,6 +210,12 @@ notifications from both tasks will appear on the same filter fd. Reads and > writes to/from a filter fd are also synchronized, so a filter fd can safely > have many readers. > > +By default, only one listener within seccomp filters tree is allowed. On attempt > +to add a new listener when one already exists in the filter tree, the > +``seccomp()`` call will fail with ``-EBUSY``. To allow multiple listeners, the > +``SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS`` flag can be passed in addition to > +the ``SECCOMP_FILTER_FLAG_NEW_LISTENER`` flag. > + I read this, and I contemplated: does this mean that this permits additional filters (added later, nested inside) to have listeners or does it permit applying a listener when there already is one? I thought it was surely it's the former, but I had to read the code to confirm that. Maybe clarify the text? (Yes, I realize it's also in the commit message, but that's not a great place to hide this info.) > The interface for a seccomp notification fd consists of two structures: > > .. code-block:: c > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > index 9b959972bf4a..9b060946019d 100644 > --- a/include/linux/seccomp.h > +++ b/include/linux/seccomp.h > @@ -10,7 +10,8 @@ > SECCOMP_FILTER_FLAG_SPEC_ALLOW | \ > SECCOMP_FILTER_FLAG_NEW_LISTENER | \ > SECCOMP_FILTER_FLAG_TSYNC_ESRCH | \ > - SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) > + SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV | \ > + SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS) > > /* sizeof() the first published struct seccomp_notif_addfd */ > #define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24 > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h > index dbfc9b37fcae..de78d8e7a70b 100644 > --- a/include/uapi/linux/seccomp.h > +++ b/include/uapi/linux/seccomp.h > @@ -18,13 +18,14 @@ > #define SECCOMP_GET_NOTIF_SIZES 3 > > /* Valid flags for SECCOMP_SET_MODE_FILTER */ > -#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) > -#define SECCOMP_FILTER_FLAG_LOG (1UL << 1) > -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) > -#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) > -#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) > +#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) > +#define SECCOMP_FILTER_FLAG_LOG (1UL << 1) > +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) > +#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) > +#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) > /* Received notifications wait in killable state (only respond to fatal signals) */ > -#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) > +#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) > +#define SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS (1UL << 6) > > /* > * All BPF programs must return a 32-bit value. > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 51d0d8adaffb..7667f443ff6c 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -206,6 +206,7 @@ static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) > * @wait_killable_recv: Put notifying process in killable state once the > * notification is received by the userspace listener. > * @first_listener: true if this is the first seccomp listener installed in the tree. > + * @allow_nested_listeners: Allow nested seccomp listeners. > * @prev: points to a previously installed, or inherited, filter > * @prog: the BPF program to evaluate > * @notif: the struct that holds all notification related information > @@ -228,6 +229,7 @@ struct seccomp_filter { > bool log : 1; > bool wait_killable_recv : 1; > bool first_listener : 1; > + bool allow_nested_listeners : 1; > struct action_cache cache; > struct seccomp_filter *prev; > struct bpf_prog *prog; > @@ -956,6 +958,10 @@ static long seccomp_attach_filter(unsigned int flags, > if (flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) > filter->wait_killable_recv = true; > > + /* Set nested listeners allow flag, if present. */ > + if (flags & SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS) > + filter->allow_nested_listeners = true; > + > /* > * If there is an existing filter, make it the prev and don't drop its > * task reference. > @@ -1997,7 +2003,8 @@ static struct file *init_listener(struct seccomp_filter *filter) > } > > /* > - * Does @new_child have a listener while an ancestor also has a listener? > + * Does @new_child have a listener while an ancestor also has a listener > + * and hasn't allowed nesting? > * If so, we'll want to reject this filter. > * This only has to be tested for the current process, even in the TSYNC case, > * because TSYNC installs @child with the same parent on all threads. > @@ -2015,7 +2022,12 @@ static bool check_duplicate_listener(struct seccomp_filter *new_child) > return false; > for (cur = current->seccomp.filter; cur; cur = cur->prev) { > if (!IS_ERR_OR_NULL(cur->notif)) > - return true; > + /* > + * We don't need to go up further, because if there is a > + * listener with nesting allowed, then all the listeners > + * up the tree have allowed nesting as well. > + */ > + return !cur->allow_nested_listeners; > } > > /* Mark first listener in the tree. */ > @@ -2062,10 +2074,12 @@ static long seccomp_set_mode_filter(unsigned int flags, > return -EINVAL; > > /* > - * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT flag doesn't make sense > + * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT and > + * SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS flags don't make sense > * without the SECCOMP_FILTER_FLAG_NEW_LISTENER flag. > */ > - if ((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) && > + if (((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) || > + (flags & SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS)) && > ((flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0)) > return -EINVAL; > > diff --git a/tools/include/uapi/linux/seccomp.h b/tools/include/uapi/linux/seccomp.h > index dbfc9b37fcae..de78d8e7a70b 100644 > --- a/tools/include/uapi/linux/seccomp.h > +++ b/tools/include/uapi/linux/seccomp.h > @@ -18,13 +18,14 @@ > #define SECCOMP_GET_NOTIF_SIZES 3 > > /* Valid flags for SECCOMP_SET_MODE_FILTER */ > -#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) > -#define SECCOMP_FILTER_FLAG_LOG (1UL << 1) > -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) > -#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) > -#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) > +#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) > +#define SECCOMP_FILTER_FLAG_LOG (1UL << 1) > +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) > +#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) > +#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) > /* Received notifications wait in killable state (only respond to fatal signals) */ > -#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) > +#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) > +#define SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS (1UL << 6) > > /* > * All BPF programs must return a 32-bit value. > -- > 2.43.0 > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 6/7] seccomp: allow nested listeners 2025-12-12 13:57 ` Andy Lutomirski @ 2026-01-28 19:10 ` Alexander Mikhalitsyn 0 siblings, 0 replies; 13+ messages in thread From: Alexander Mikhalitsyn @ 2026-01-28 19:10 UTC (permalink / raw) To: Andy Lutomirski, Alexander Mikhalitsyn Cc: kees, linux-doc, linux-kernel, bpf, Will Drewry, Jonathan Corbet, Shuah Khan, Aleksa Sarai, Tycho Andersen, Andrei Vagin, Christian Brauner, Stéphane Graber On Fri, 2025-12-12 at 21:57 +0800, Andy Lutomirski wrote: > On Thu, Dec 11, 2025 at 8:47 PM Alexander Mikhalitsyn > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > Now everything is ready to get rid of "only one listener per tree" > > limitation. > > > > Let's introduce a new uAPI flag > > SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS, so userspace may > > explicitly > > allow nested listeners when installing a listener. > > > > Note, that to install n-th listener, this flag must be set on all > > the listeners up the tree. > > > > diff --git a/Documentation/userspace-api/seccomp_filter.rst > > b/Documentation/userspace-api/seccomp_filter.rst > > index cff0fa7f3175..b9633ab1ed47 100644 > > --- a/Documentation/userspace-api/seccomp_filter.rst > > +++ b/Documentation/userspace-api/seccomp_filter.rst > > @@ -210,6 +210,12 @@ notifications from both tasks will appear on > > the same filter fd. Reads and > > writes to/from a filter fd are also synchronized, so a filter fd > > can safely > > have many readers. > > > > +By default, only one listener within seccomp filters tree is > > allowed. On attempt > > +to add a new listener when one already exists in the filter tree, > > the > > +``seccomp()`` call will fail with ``-EBUSY``. To allow multiple > > listeners, the > > +``SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS`` flag can be passed > > in addition to > > +the ``SECCOMP_FILTER_FLAG_NEW_LISTENER`` flag. > > + Hi Andy, thank you for looking into this! > > I read this, and I contemplated: does this mean that this permits > additional filters (added later, nested inside) to have listeners or > does it permit applying a listener when there already is one? I > thought it was surely it's the former, but I had to read the code to > confirm that. > > Maybe clarify the text? Sure, sorry about that! I'll fix that in the next version. I'm going to do some massive rework on this one, because during LPC [1] we've made a conclusion that we gonna fix something in seccomp behavior we have right now and then this series can go on top. [1] https://www.youtube.com/watch?v=-pSeoN68hP8 Kind regards, Alex > > (Yes, I realize it's also in the commit message, but that's not a > great place to hide this info.) > > > > The interface for a seccomp notification fd consists of two > > structures: > > > > .. code-block:: c > > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > > index 9b959972bf4a..9b060946019d 100644 > > --- a/include/linux/seccomp.h > > +++ b/include/linux/seccomp.h > > @@ -10,7 +10,8 @@ > > > > SECCOMP_FILTER_FLAG_SPEC_ALLOW | \ > > > > SECCOMP_FILTER_FLAG_NEW_LISTENER | \ > > > > SECCOMP_FILTER_FLAG_TSYNC_ESRCH | \ > > - > > SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) > > + > > SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV | \ > > + > > SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS) > > > > /* sizeof() the first published struct seccomp_notif_addfd */ > > #define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24 > > diff --git a/include/uapi/linux/seccomp.h > > b/include/uapi/linux/seccomp.h > > index dbfc9b37fcae..de78d8e7a70b 100644 > > --- a/include/uapi/linux/seccomp.h > > +++ b/include/uapi/linux/seccomp.h > > @@ -18,13 +18,14 @@ > > #define SECCOMP_GET_NOTIF_SIZES 3 > > > > /* Valid flags for SECCOMP_SET_MODE_FILTER */ > > -#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) > > -#define SECCOMP_FILTER_FLAG_LOG (1UL << 1) > > -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) > > -#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) > > -#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) > > +#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) > > +#define SECCOMP_FILTER_FLAG_LOG > > (1UL << 1) > > +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) > > +#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) > > +#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH > > (1UL << 4) > > /* Received notifications wait in killable state (only respond to > > fatal signals) */ > > -#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) > > +#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) > > +#define SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS (1UL << 6) > > > > /* > > * All BPF programs must return a 32-bit value. > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > > index 51d0d8adaffb..7667f443ff6c 100644 > > --- a/kernel/seccomp.c > > +++ b/kernel/seccomp.c > > @@ -206,6 +206,7 @@ static inline void seccomp_cache_prepare(struct > > seccomp_filter *sfilter) > > * @wait_killable_recv: Put notifying process in killable state > > once the > > * notification is received by the userspace > > listener. > > * @first_listener: true if this is the first seccomp listener > > installed in the tree. > > + * @allow_nested_listeners: Allow nested seccomp listeners. > > * @prev: points to a previously installed, or inherited, filter > > * @prog: the BPF program to evaluate > > * @notif: the struct that holds all notification related > > information > > @@ -228,6 +229,7 @@ struct seccomp_filter { > > bool log : 1; > > bool wait_killable_recv : 1; > > bool first_listener : 1; > > + bool allow_nested_listeners : 1; > > struct action_cache cache; > > struct seccomp_filter *prev; > > struct bpf_prog *prog; > > @@ -956,6 +958,10 @@ static long seccomp_attach_filter(unsigned int > > flags, > > if (flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) > > filter->wait_killable_recv = true; > > > > + /* Set nested listeners allow flag, if present. */ > > + if (flags & SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS) > > + filter->allow_nested_listeners = true; > > + > > /* > > * If there is an existing filter, make it the prev and > > don't drop its > > * task reference. > > @@ -1997,7 +2003,8 @@ static struct file *init_listener(struct > > seccomp_filter *filter) > > } > > > > /* > > - * Does @new_child have a listener while an ancestor also has a > > listener? > > + * Does @new_child have a listener while an ancestor also has a > > listener > > + * and hasn't allowed nesting? > > * If so, we'll want to reject this filter. > > * This only has to be tested for the current process, even in the > > TSYNC case, > > * because TSYNC installs @child with the same parent on all > > threads. > > @@ -2015,7 +2022,12 @@ static bool check_duplicate_listener(struct > > seccomp_filter *new_child) > > return false; > > for (cur = current->seccomp.filter; cur; cur = cur->prev) { > > if (!IS_ERR_OR_NULL(cur->notif)) > > - return true; > > + /* > > + * We don't need to go up further, because > > if there is a > > + * listener with nesting allowed, then all > > the listeners > > + * up the tree have allowed nesting as > > well. > > + */ > > + return !cur->allow_nested_listeners; > > } > > > > /* Mark first listener in the tree. */ > > @@ -2062,10 +2074,12 @@ static long > > seccomp_set_mode_filter(unsigned int flags, > > return -EINVAL; > > > > /* > > - * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT flag doesn't > > make sense > > + * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT and > > + * SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS flags don't > > make sense > > * without the SECCOMP_FILTER_FLAG_NEW_LISTENER flag. > > */ > > - if ((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) && > > + if (((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) || > > + (flags & SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS)) > > && > > ((flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0)) > > return -EINVAL; > > > > diff --git a/tools/include/uapi/linux/seccomp.h > > b/tools/include/uapi/linux/seccomp.h > > index dbfc9b37fcae..de78d8e7a70b 100644 > > --- a/tools/include/uapi/linux/seccomp.h > > +++ b/tools/include/uapi/linux/seccomp.h > > @@ -18,13 +18,14 @@ > > #define SECCOMP_GET_NOTIF_SIZES 3 > > > > /* Valid flags for SECCOMP_SET_MODE_FILTER */ > > -#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) > > -#define SECCOMP_FILTER_FLAG_LOG (1UL << 1) > > -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) > > -#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) > > -#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) > > +#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) > > +#define SECCOMP_FILTER_FLAG_LOG > > (1UL << 1) > > +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) > > +#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) > > +#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH > > (1UL << 4) > > /* Received notifications wait in killable state (only respond to > > fatal signals) */ > > -#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) > > +#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5) > > +#define SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS (1UL << 6) > > > > /* > > * All BPF programs must return a 32-bit value. > > -- > > 2.43.0 > > > > > -- > Andy Lutomirski > AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 6/7] seccomp: allow nested listeners 2025-12-11 12:46 ` [PATCH v3 6/7] seccomp: allow nested listeners Alexander Mikhalitsyn 2025-12-12 13:57 ` Andy Lutomirski @ 2026-01-21 7:51 ` Andrei Vagin 2026-01-21 15:43 ` Aleksa Sarai 1 sibling, 1 reply; 13+ messages in thread From: Andrei Vagin @ 2026-01-21 7:51 UTC (permalink / raw) To: Alexander Mikhalitsyn Cc: kees, linux-doc, linux-kernel, bpf, Andy Lutomirski, Will Drewry, Jonathan Corbet, Shuah Khan, Aleksa Sarai, Tycho Andersen, Christian Brauner, Stéphane Graber, Alexander Mikhalitsyn On Thu, Dec 11, 2025 at 4:46 AM Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> wrote: > > Now everything is ready to get rid of "only one listener per tree" > limitation. > > Let's introduce a new uAPI flag > SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS, so userspace may explicitly > allow nested listeners when installing a listener. I am not sure we really need SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS. If nested listeners are completely functional, why would we want to implicitly allow or disallow someone from using them? Actually, even the current behavior of SECCOMP_RET_USER_NOTIF looks a bit illogical. I think the following behavior would be more expected: instead of running all filters and picking the most restrictive result, the kernel should execute them one by one (most recent fist). If a filter returns USER_NOTIF, the kernel pauses immediately to let the listener handle the call. If that listener then issues "CONTINUE", the kernel resumes by running the remaining older filters in the chain. Thanks, Andrei ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 6/7] seccomp: allow nested listeners 2026-01-21 7:51 ` Andrei Vagin @ 2026-01-21 15:43 ` Aleksa Sarai 2026-01-21 17:59 ` Andy Lutomirski 0 siblings, 1 reply; 13+ messages in thread From: Aleksa Sarai @ 2026-01-21 15:43 UTC (permalink / raw) To: Andrei Vagin Cc: Alexander Mikhalitsyn, kees, linux-doc, linux-kernel, bpf, Andy Lutomirski, Will Drewry, Jonathan Corbet, Shuah Khan, Tycho Andersen, Christian Brauner, Stéphane Graber, Alexander Mikhalitsyn [-- Attachment #1: Type: text/plain, Size: 2039 bytes --] On 2026-01-20, Andrei Vagin <avagin@gmail.com> wrote: > On Thu, Dec 11, 2025 at 4:46 AM Alexander Mikhalitsyn > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > Now everything is ready to get rid of "only one listener per tree" > > limitation. > > > > Let's introduce a new uAPI flag > > SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS, so userspace may explicitly > > allow nested listeners when installing a listener. > > I am not sure we really need SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS. > If nested listeners are completely functional, why would we want to > implicitly allow or disallow someone from using them? It can be quite easy to deadlock a process using seccomp-notify (even in the single-notifier case) so especially in the case of container managers I can see the argument for wanting this to be an opt-in thing once container runtimes have verified their notifier won't break nesting. Then again, you can also use seccomp to block SECCOMP_FILTER_FLAG_NEW_LISTENER directly, so you don't really need a separate flag to allow nested listeners (unless I'm missing something)? That would make it opt-out but presumably filters that allow seccomp already use an allow-list for flags. > Actually, even the current behavior of SECCOMP_RET_USER_NOTIF looks a > bit illogical. I think the following behavior would be more expected: > instead of running all filters and picking the most restrictive result, > the kernel should execute them one by one (most recent fist). If a filter > returns USER_NOTIF, the kernel pauses immediately to let the listener > handle the call. If that listener then issues "CONTINUE", the kernel > resumes by running the remaining older filters in the chain. I guess there is a philosophical argument that earlier filters are "more trusted" but the seccomp security model has always been that the strictest filter return wins and I don't really see a strong argument for deviating from that for USER_NOTIF. -- Aleksa Sarai https://www.cyphar.com/ [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 265 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 6/7] seccomp: allow nested listeners 2026-01-21 15:43 ` Aleksa Sarai @ 2026-01-21 17:59 ` Andy Lutomirski 2026-01-23 6:26 ` Andrei Vagin 0 siblings, 1 reply; 13+ messages in thread From: Andy Lutomirski @ 2026-01-21 17:59 UTC (permalink / raw) To: Aleksa Sarai Cc: Andrei Vagin, Alexander Mikhalitsyn, kees, linux-doc, linux-kernel, bpf, Will Drewry, Jonathan Corbet, Shuah Khan, Tycho Andersen, Christian Brauner, Stéphane Graber, Alexander Mikhalitsyn On Wed, Jan 21, 2026 at 7:43 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2026-01-20, Andrei Vagin <avagin@gmail.com> wrote: > > On Thu, Dec 11, 2025 at 4:46 AM Alexander Mikhalitsyn > > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > > > Now everything is ready to get rid of "only one listener per tree" > > > limitation. > > > > > > Let's introduce a new uAPI flag > > > SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS, so userspace may explicitly > > > allow nested listeners when installing a listener. > > > > I am not sure we really need SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS. > > If nested listeners are completely functional, why would we want to > > implicitly allow or disallow someone from using them? > > It can be quite easy to deadlock a process using seccomp-notify (even > in the single-notifier case) so especially in the case of container > managers I can see the argument for wanting this to be an opt-in thing > once container runtimes have verified their notifier won't break > nesting. Is the deadlock such that a process and its manager can deadlock in a way that's hard to kill? Or is there some problem that could adversely affect an outer manager? It would be nice for these features to be automatic instead of opt in. (I just wasted half an hour yesterday removing use of unshare(CLONE_FILES) from a program that didn't run under a container manager that, for some reason, thought that was a sensitive syscall.) --Andy > > > Actually, even the current behavior of SECCOMP_RET_USER_NOTIF looks a > > bit illogical. I think the following behavior would be more expected: > > instead of running all filters and picking the most restrictive result, > > the kernel should execute them one by one (most recent fist). If a filter > > returns USER_NOTIF, the kernel pauses immediately to let the listener > > handle the call. If that listener then issues "CONTINUE", the kernel > > resumes by running the remaining older filters in the chain. > > I guess there is a philosophical argument that earlier filters are "more > trusted" but the seccomp security model has always been that the > strictest filter return wins and I don't really see a strong argument > for deviating from that for USER_NOTIF. > I don't know if I agree with that philosophy. I would think the best philosophy is that, when filters are nested, the innermost filter + filtered task combination acts as a unit that is filtered by the outer filter. Without notifiers and without filters that overwrite errno, I think strictest-wins is a decent approximation -- the choices are kill or allow, although one might quibble about the various forms of "kill". With SECCOMP_RET_ERRNO, I would argue that the behavior would be superior if we just stopped processing filters after an inner filter returned SECCOMP_RET_ERROR. After all, the effect is to do no syscall at all, and having a process that didn't do a syscall get killed because it tried a bad syscall is kind of weird. With notifiers, this is all rather more complex. Notifiers can emulate syscalls, and having an outer notifier somehow process the syscall that was replaced by an inner notifier seems rather weird. Or suppose that an outer filter wants to prevent some operation, but an inner system wants to emulate it in a way that doesn't do the offending syscall, why not allow it? So I'd argue for considering changing the behavior for everything, maybe optionally? I'm not really sure where TRACE fits in. --Andy -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 6/7] seccomp: allow nested listeners 2026-01-21 17:59 ` Andy Lutomirski @ 2026-01-23 6:26 ` Andrei Vagin 0 siblings, 0 replies; 13+ messages in thread From: Andrei Vagin @ 2026-01-23 6:26 UTC (permalink / raw) To: Aleksa Sarai Cc: Alexander Mikhalitsyn, kees, linux-doc, linux-kernel, bpf, Will Drewry, Jonathan Corbet, Shuah Khan, Tycho Andersen, Christian Brauner, Stéphane Graber, Alexander Mikhalitsyn, Andy Lutomirski On Wed, Jan 21, 2026 at 9:59 AM Andy Lutomirski <luto@amacapital.net> wrote: > > On Wed, Jan 21, 2026 at 7:43 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > On 2026-01-20, Andrei Vagin <avagin@gmail.com> wrote: > > > On Thu, Dec 11, 2025 at 4:46 AM Alexander Mikhalitsyn > > > <aleksandr.mikhalitsyn@canonical.com> wrote: > > > > > > > > Now everything is ready to get rid of "only one listener per tree" > > > > limitation. > > > > > > > > Let's introduce a new uAPI flag > > > > SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS, so userspace may explicitly > > > > allow nested listeners when installing a listener. > > > > > > I am not sure we really need SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS. > > > If nested listeners are completely functional, why would we want to > > > implicitly allow or disallow someone from using them? > > > > It can be quite easy to deadlock a process using seccomp-notify (even > > in the single-notifier case) so especially in the case of container > > managers I can see the argument for wanting this to be an opt-in thing > > once container runtimes have verified their notifier won't break > > nesting. > > Is the deadlock such that a process and its manager can deadlock in a > way that's hard to kill? Or is there some problem that could > adversely affect an outer manager? It would be nice for these > features to be automatic instead of opt in. Both a process and its manager can always be killed with SIGKILL. I’m not sure I follow the specific deadlock Aleksa is referring to here. In my view, an outer manager should not care about any syscalls that processes are calling and intercepting. The outer manager must be triggered only when a syscall is going to be executed "natively". This kind of overlaps with the second part... BTW: If a user wants to prevent the usage of seccomp notify, they can always install a seccomp filter that rejects the seccomp syscall called with SECCOMP_FILTER_FLAG_NEW_LISTENER. > > (I just wasted half an hour yesterday removing use of > unshare(CLONE_FILES) from a program that didn't run under a container > manager that, for some reason, thought that was a sensitive syscall.) > > --Andy > > > > > > Actually, even the current behavior of SECCOMP_RET_USER_NOTIF looks a > > > bit illogical. I think the following behavior would be more expected: > > > instead of running all filters and picking the most restrictive result, > > > the kernel should execute them one by one (most recent fist). If a filter > > > returns USER_NOTIF, the kernel pauses immediately to let the listener > > > handle the call. If that listener then issues "CONTINUE", the kernel > > > resumes by running the remaining older filters in the chain. > > > > I guess there is a philosophical argument that earlier filters are "more > > trusted" but the seccomp security model has always been that the > > strictest filter return wins and I don't really see a strong argument > > for deviating from that for USER_NOTIF. > > > > I don't know if I agree with that philosophy. I would think the best > philosophy is that, when filters are nested, the innermost filter + > filtered task combination acts as a unit that is filtered by the outer > filter. > > Without notifiers and without filters that overwrite errno, I think > strictest-wins is a decent approximation -- the choices are kill or > allow, although one might quibble about the various forms of "kill". > > With SECCOMP_RET_ERRNO, I would argue that the behavior would be > superior if we just stopped processing filters after an inner filter > returned SECCOMP_RET_ERROR. After all, the effect is to do no syscall > at all, and having a process that didn't do a syscall get killed > because it tried a bad syscall is kind of weird. > > With notifiers, this is all rather more complex. Notifiers can > emulate syscalls, and having an outer notifier somehow process the > syscall that was replaced by an inner notifier seems rather weird. Or > suppose that an outer filter wants to prevent some operation, but an > inner system wants to emulate it in a way that doesn't do the > offending syscall, why not allow it? > > So I'd argue for considering changing the behavior for everything, > maybe optionally? I'm not really sure where TRACE fits in. > gVisor (a user-mode kernel similar to User-Mode Linux) is a real-world example that is impacted by the current seccomp behavior. The gVisor systrap platform uses seccomp to intercept guest syscalls so they can be handled by the Sentry (the gVisor kernel). All guest syscalls are managed by the Sentry and are never executed natively. Thanks, Andrei ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-01-28 22:32 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-11 12:46 [PATCH v3 0/7] seccomp: support nested listeners Alexander Mikhalitsyn 2025-12-11 12:46 ` [PATCH v3 1/7] seccomp: remove unused argument from seccomp_do_user_notification Alexander Mikhalitsyn 2025-12-11 12:46 ` [PATCH v3 4/7] seccomp: mark first listener in the tree Alexander Mikhalitsyn 2026-01-21 12:22 ` Aleksa Sarai 2026-01-28 19:05 ` Alexander Mikhalitsyn 2026-01-28 22:32 ` Kees Cook 2025-12-11 12:46 ` [PATCH v3 6/7] seccomp: allow nested listeners Alexander Mikhalitsyn 2025-12-12 13:57 ` Andy Lutomirski 2026-01-28 19:10 ` Alexander Mikhalitsyn 2026-01-21 7:51 ` Andrei Vagin 2026-01-21 15:43 ` Aleksa Sarai 2026-01-21 17:59 ` Andy Lutomirski 2026-01-23 6:26 ` Andrei Vagin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox