* [PATCH v1 0/6] seccomp: support nested listeners
@ 2025-12-01 12:23 Alexander Mikhalitsyn
2025-12-01 12:24 ` [PATCH v1 3/6] seccomp: limit number of listeners in seccomp tree Alexander Mikhalitsyn
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: Alexander Mikhalitsyn @ 2025-12-01 12:23 UTC (permalink / raw)
To: kees
Cc: linux-doc, linux-kernel, linux-kselftest, bpf, Andy Lutomirski,
Will Drewry, Jonathan Corbet, Shuah Khan, Tycho Andersen,
Andrei Vagin, Christian Brauner, Stéphane Graber
Dear friends,
this patch series adds support for nested seccomp listeners. It allows container
runtimes and other sandboxing software to install seccomp listeners on top of
existing ones, which is useful for nested LXC containers and other similar use-cases.
I decided to go with conservative approach and limit the maximum number of nested listeners
to 8 per seccomp filter chain (MAX_LISTENERS_PER_PATH). This is done to avoid dynamic memory
allocations in the very hot __seccomp_filter() function, where we use a preallocated static
array on the stack to track matched listeners. 8 nested listeners should be enough for
almost any practical scenarios.
Expecting potential discussions around this patch series, I'm going to present a talk
at LPC 2025 about the design and implementation details of this feature [1].
Git tree (based on for-next/seccomp):
v1: https://github.com/mihalicyn/linux/commits/seccomp.mult.listeners.v1
current: https://github.com/mihalicyn/linux/commits/seccomp.mult.listeners
Link: https://lpc.events/event/19/contributions/2241/ [1]
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: Kees Cook <kees@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stéphane Graber <stgraber@stgraber.org>
Alexander Mikhalitsyn (6):
seccomp: remove unused argument from seccomp_do_user_notification
seccomp: prepare seccomp_run_filters() to support more than one
listener
seccomp: limit number of listeners in seccomp tree
seccomp: handle multiple listeners case
seccomp: relax has_duplicate_listeners check
tools/testing/selftests/seccomp: test nested listeners
.../userspace-api/seccomp_filter.rst | 6 +
include/linux/seccomp.h | 3 +-
include/uapi/linux/seccomp.h | 13 +-
kernel/seccomp.c | 99 +++++++++--
tools/include/uapi/linux/seccomp.h | 13 +-
tools/testing/selftests/seccomp/seccomp_bpf.c | 162 ++++++++++++++++++
6 files changed, 269 insertions(+), 27 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v1 3/6] seccomp: limit number of listeners in seccomp tree
2025-12-01 12:23 [PATCH v1 0/6] seccomp: support nested listeners Alexander Mikhalitsyn
@ 2025-12-01 12:24 ` Alexander Mikhalitsyn
2025-12-01 12:24 ` [PATCH v1 4/6] seccomp: handle multiple listeners case Alexander Mikhalitsyn
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Alexander Mikhalitsyn @ 2025-12-01 12:24 UTC (permalink / raw)
To: kees
Cc: linux-kernel, bpf, Andy Lutomirski, Will Drewry, Jonathan Corbet,
Shuah Khan, Tycho Andersen, Andrei Vagin, Christian Brauner,
Stéphane Graber, Alexander Mikhalitsyn
We need to limit number of listeners in seccomp tree to
MAX_LISTENERS_PER_PATH, because we don't want to use dynamic
memory allocations in a very hot __seccomp_filter() function
and we use preallocated static array on the stack.
Also, let's return ELOOP to userspace if it attempts to install
more than MAX_LISTENERS_PER_PATH listeners, instead of ENOMEM as
we do when userspace hits the limit of cBPF instructions.
This will make uAPI a bit more convenient.
Notice, that has_duplicate_listener() check is still in place, so this
change is a preparational.
Cc: linux-kernel@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: Kees Cook <kees@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
kernel/seccomp.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index c9a1062a53bd..ded3f6a6430b 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -931,17 +931,25 @@ static long seccomp_attach_filter(unsigned int flags,
struct seccomp_filter *filter)
{
unsigned long total_insns;
+ unsigned char total_listeners;
struct seccomp_filter *walker;
assert_spin_locked(¤t->sighand->siglock);
- /* Validate resulting filter length. */
+ /* Validate resulting filter length and number of nested listeners. */
total_insns = filter->prog->len;
- for (walker = current->seccomp.filter; walker; walker = walker->prev)
+ total_listeners = filter->notif ? 1 : 0;
+ for (walker = current->seccomp.filter; walker; walker = walker->prev) {
total_insns += walker->prog->len + 4; /* 4 instr penalty */
+ total_listeners += walker->notif ? 1 : 0;
+ }
+
if (total_insns > MAX_INSNS_PER_PATH)
return -ENOMEM;
+ if (total_listeners > MAX_LISTENERS_PER_PATH)
+ return -ELOOP;
+
/* If thread sync has been requested, check that it is possible. */
if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
int ret;
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v1 4/6] seccomp: handle multiple listeners case
2025-12-01 12:23 [PATCH v1 0/6] seccomp: support nested listeners Alexander Mikhalitsyn
2025-12-01 12:24 ` [PATCH v1 3/6] seccomp: limit number of listeners in seccomp tree Alexander Mikhalitsyn
@ 2025-12-01 12:24 ` Alexander Mikhalitsyn
2025-12-01 14:38 ` Tycho Andersen
2025-12-01 12:24 ` [PATCH v1 5/6] seccomp: relax has_duplicate_listeners check Alexander Mikhalitsyn
2025-12-01 12:24 ` [PATCH v1 6/6] tools/testing/selftests/seccomp: test nested listeners Alexander Mikhalitsyn
3 siblings, 1 reply; 7+ messages in thread
From: Alexander Mikhalitsyn @ 2025-12-01 12:24 UTC (permalink / raw)
To: kees
Cc: linux-kernel, bpf, Andy Lutomirski, Will Drewry, Jonathan Corbet,
Shuah Khan, Tycho Andersen, Andrei Vagin, Christian Brauner,
Stéphane Graber, Alexander Mikhalitsyn
If we have more than one listener in the tree and lower listener
wants us to continue syscall (SECCOMP_USER_NOTIF_FLAG_CONTINUE)
we must consult with upper listeners first, otherwise it is a
clear seccomp restrictions bypass scenario.
Cc: linux-kernel@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: Kees Cook <kees@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
kernel/seccomp.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ded3f6a6430b..ad733f849e0f 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -450,6 +450,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
ret = cur_ret;
matches->n = 1;
matches->filters[0] = f;
+ } else if ((ACTION_ONLY(cur_ret) == ACTION_ONLY(ret)) &&
+ ACTION_ONLY(cur_ret) == SECCOMP_RET_USER_NOTIF) {
+ matches->filters[matches->n++] = f;
}
}
return ret;
@@ -1362,8 +1365,17 @@ static int __seccomp_filter(int this_syscall, const bool recheck_after_trace)
return 0;
case SECCOMP_RET_USER_NOTIF:
- if (seccomp_do_user_notification(match, &sd))
- goto skip;
+ for (unsigned char i = 0; i < matches.n; i++) {
+ match = matches.filters[i];
+ /*
+ * If userspace wants us to skip this syscall, do so.
+ * But if userspace wants to continue syscall, we
+ * must consult with the upper-level filters listeners
+ * and act accordingly.
+ */
+ if (seccomp_do_user_notification(match, &sd))
+ goto skip;
+ }
return 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v1 5/6] seccomp: relax has_duplicate_listeners check
2025-12-01 12:23 [PATCH v1 0/6] seccomp: support nested listeners Alexander Mikhalitsyn
2025-12-01 12:24 ` [PATCH v1 3/6] seccomp: limit number of listeners in seccomp tree Alexander Mikhalitsyn
2025-12-01 12:24 ` [PATCH v1 4/6] seccomp: handle multiple listeners case Alexander Mikhalitsyn
@ 2025-12-01 12:24 ` Alexander Mikhalitsyn
2025-12-01 12:24 ` [PATCH v1 6/6] tools/testing/selftests/seccomp: test nested listeners Alexander Mikhalitsyn
3 siblings, 0 replies; 7+ messages in thread
From: Alexander Mikhalitsyn @ 2025-12-01 12:24 UTC (permalink / raw)
To: kees
Cc: linux-doc, linux-kernel, bpf, Andy Lutomirski, Will Drewry,
Jonathan Corbet, Shuah Khan, Tycho Andersen, Andrei Vagin,
Christian Brauner, Stéphane Graber, Alexander Mikhalitsyn,
Alexander Mikhalitsyn
Now everything is ready to get rid of "only one listener per tree"
limitation.
Let's introduce a new uAPI flag
SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS, so userspace may explicitly
allow nested listeners when installing a listener.
Note, that to install n-th listener, this flag must be set on all
the listeners up the tree.
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: Kees Cook <kees@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
.../userspace-api/seccomp_filter.rst | 6 +++++
include/linux/seccomp.h | 3 ++-
include/uapi/linux/seccomp.h | 13 ++++++-----
kernel/seccomp.c | 22 +++++++++++++++----
tools/include/uapi/linux/seccomp.h | 13 ++++++-----
5 files changed, 40 insertions(+), 17 deletions(-)
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index cff0fa7f3175..b9633ab1ed47 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -210,6 +210,12 @@ notifications from both tasks will appear on the same filter fd. Reads and
writes to/from a filter fd are also synchronized, so a filter fd can safely
have many readers.
+By default, only one listener within seccomp filters tree is allowed. On attempt
+to add a new listener when one already exists in the filter tree, the
+``seccomp()`` call will fail with ``-EBUSY``. To allow multiple listeners, the
+``SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS`` flag can be passed in addition to
+the ``SECCOMP_FILTER_FLAG_NEW_LISTENER`` flag.
+
The interface for a seccomp notification fd consists of two structures:
.. code-block:: c
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 9b959972bf4a..9b060946019d 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -10,7 +10,8 @@
SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
SECCOMP_FILTER_FLAG_NEW_LISTENER | \
SECCOMP_FILTER_FLAG_TSYNC_ESRCH | \
- SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
+ SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV | \
+ SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS)
/* sizeof() the first published struct seccomp_notif_addfd */
#define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index dbfc9b37fcae..de78d8e7a70b 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -18,13 +18,14 @@
#define SECCOMP_GET_NOTIF_SIZES 3
/* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0)
-#define SECCOMP_FILTER_FLAG_LOG (1UL << 1)
-#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2)
-#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
-#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4)
+#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0)
+#define SECCOMP_FILTER_FLAG_LOG (1UL << 1)
+#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2)
+#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
+#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4)
/* Received notifications wait in killable state (only respond to fatal signals) */
-#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5)
+#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5)
+#define SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS (1UL << 6)
/*
* All BPF programs must return a 32-bit value.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ad733f849e0f..348e10d403b1 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -205,6 +205,7 @@ static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
* @log: true if all actions except for SECCOMP_RET_ALLOW should be logged
* @wait_killable_recv: Put notifying process in killable state once the
* notification is received by the userspace listener.
+ * @allow_nested_listeners: Allow nested seccomp listeners.
* @prev: points to a previously installed, or inherited, filter
* @prog: the BPF program to evaluate
* @notif: the struct that holds all notification related information
@@ -226,6 +227,7 @@ struct seccomp_filter {
refcount_t users;
bool log;
bool wait_killable_recv;
+ bool allow_nested_listeners;
struct action_cache cache;
struct seccomp_filter *prev;
struct bpf_prog *prog;
@@ -974,6 +976,10 @@ static long seccomp_attach_filter(unsigned int flags,
if (flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
filter->wait_killable_recv = true;
+ /* Set nested listeners allow flag, if present. */
+ if (flags & SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS)
+ filter->allow_nested_listeners = true;
+
/*
* If there is an existing filter, make it the prev and don't drop its
* task reference.
@@ -1955,7 +1961,8 @@ static struct file *init_listener(struct seccomp_filter *filter)
}
/*
- * Does @new_child have a listener while an ancestor also has a listener?
+ * Does @new_child have a listener while an ancestor also has a listener
+ * and hasn't allowed nesting?
* If so, we'll want to reject this filter.
* This only has to be tested for the current process, even in the TSYNC case,
* because TSYNC installs @child with the same parent on all threads.
@@ -1973,7 +1980,12 @@ static bool has_duplicate_listener(struct seccomp_filter *new_child)
return false;
for (cur = current->seccomp.filter; cur; cur = cur->prev) {
if (cur->notif)
- return true;
+ /*
+ * We don't need to go up further, because if there is a
+ * listener with nesting allowed, then all the listeners
+ * up the tree have allowed nesting as well.
+ */
+ return !cur->allow_nested_listeners;
}
return false;
@@ -2018,10 +2030,12 @@ static long seccomp_set_mode_filter(unsigned int flags,
return -EINVAL;
/*
- * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT flag doesn't make sense
+ * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT and
+ * SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS flags don't make sense
* without the SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
*/
- if ((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) &&
+ if (((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) ||
+ (flags & SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS)) &&
((flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0))
return -EINVAL;
diff --git a/tools/include/uapi/linux/seccomp.h b/tools/include/uapi/linux/seccomp.h
index dbfc9b37fcae..de78d8e7a70b 100644
--- a/tools/include/uapi/linux/seccomp.h
+++ b/tools/include/uapi/linux/seccomp.h
@@ -18,13 +18,14 @@
#define SECCOMP_GET_NOTIF_SIZES 3
/* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0)
-#define SECCOMP_FILTER_FLAG_LOG (1UL << 1)
-#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2)
-#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
-#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4)
+#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0)
+#define SECCOMP_FILTER_FLAG_LOG (1UL << 1)
+#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2)
+#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
+#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4)
/* Received notifications wait in killable state (only respond to fatal signals) */
-#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5)
+#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5)
+#define SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS (1UL << 6)
/*
* All BPF programs must return a 32-bit value.
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v1 6/6] tools/testing/selftests/seccomp: test nested listeners
2025-12-01 12:23 [PATCH v1 0/6] seccomp: support nested listeners Alexander Mikhalitsyn
` (2 preceding siblings ...)
2025-12-01 12:24 ` [PATCH v1 5/6] seccomp: relax has_duplicate_listeners check Alexander Mikhalitsyn
@ 2025-12-01 12:24 ` Alexander Mikhalitsyn
3 siblings, 0 replies; 7+ messages in thread
From: Alexander Mikhalitsyn @ 2025-12-01 12:24 UTC (permalink / raw)
To: kees
Cc: linux-kernel, linux-kselftest, bpf, Andy Lutomirski, Will Drewry,
Jonathan Corbet, Shuah Khan, Tycho Andersen, Andrei Vagin,
Christian Brauner, Stéphane Graber, Alexander Mikhalitsyn
Add some basic tests for nested listeners.
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: Kees Cook <kees@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 162 ++++++++++++++++++
1 file changed, 162 insertions(+)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index fc4910d35342..0bf02d04fe15 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -293,6 +293,10 @@ struct seccomp_notif_addfd_big {
#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5)
#endif
+#ifndef SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS
+#define SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS (1UL << 6)
+#endif
+
#ifndef seccomp
int seccomp(unsigned int op, unsigned int flags, void *args)
{
@@ -4408,6 +4412,164 @@ TEST(user_notification_sync)
ASSERT_EQ(status, 0);
}
+/* from kernel/seccomp.c */
+#define MAX_LISTENERS_PER_PATH 8
+
+TEST(user_notification_nested_limits)
+{
+ pid_t pid;
+ long ret;
+ int i, status, listeners[MAX_LISTENERS_PER_PATH];
+
+ struct sock_filter filter[] = {
+ BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)ARRAY_SIZE(filter),
+ .filter = filter,
+ };
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ /* Install 6 levels of listeners and allow nesting. */
+ for (i = 0; i < 6; i++) {
+ listeners[i] = user_notif_syscall(__NR_getppid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER |
+ SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS);
+ ASSERT_GE(listeners[i], 0);
+
+ /* Add some no-op filters for grins. */
+ EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+ }
+
+ /* Check behavior when nesting is not allowed. */
+ pid = fork();
+ ASSERT_GE(pid, 0);
+ if (pid == 0) {
+ /* Install a next listener in the chain without nesting allowed. */
+ listeners[6] = user_notif_syscall(__NR_getppid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+ if (listeners[6] < 0)
+ exit(1);
+
+ /* Add some no-op filters for grins. */
+ ret = seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog);
+ if (ret != 0)
+ exit(2);
+
+ ret = user_notif_syscall(__NR_getppid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+ /* Installing a next listener in the chain should result in EBUSY. */
+ exit((ret >= 0 || errno != EBUSY) ? 3 : 0);
+ }
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ /* Install more filters with listeners to reach nesting levels limit. */
+ for (; i < MAX_LISTENERS_PER_PATH; i++) {
+ listeners[i] = user_notif_syscall(__NR_getppid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER |
+ SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS);
+ ASSERT_GE(listeners[i], 0);
+
+ /* Add some no-op filters for grins. */
+ EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+ }
+
+ /* Installing a next listener in the chain should result in ELOOP. */
+ EXPECT_EQ(user_notif_syscall(__NR_getppid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER),
+ -1);
+ EXPECT_EQ(errno, ELOOP);
+}
+
+TEST(user_notification_nested)
+{
+ pid_t pid;
+ long ret;
+ int i, status, listeners[6];
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+
+ struct sock_filter filter[] = {
+ BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)ARRAY_SIZE(filter),
+ .filter = filter,
+ };
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ /* Install 6 levels of listeners and allow nesting. */
+ for (i = 0; i < 6; i++) {
+ listeners[i] = user_notif_syscall(__NR_getppid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER |
+ SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS);
+ ASSERT_GE(listeners[i], 0);
+
+ /* Add some no-op filters for grins. */
+ EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+ }
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ ret = syscall(__NR_getppid);
+ exit(ret != (USER_NOTIF_MAGIC-3));
+ }
+
+ /*
+ * We want to have the following picture:
+ *
+ * | Listener level (i) | Listener decision |
+ * |--------------------|-------------------|
+ * | 0 | WHATEVER |
+ * | 1 | WHATEVER |
+ * | 2 | WHATEVER |
+ * | 3 | RETURN | <-- stop here
+ * | 4 | CONTINUE SYSCALL |
+ * | 5 | CONTINUE SYSCALL | <- start here (current->seccomp.filter)
+ *
+ * First listener who receives a notification is level 5, then 4,
+ * then we expect to stop on level 3 and return from syscall with
+ * (USER_NOTIF_MAGIC - 3) return value.
+ */
+ for (i = 6 - 1; i >= 3; i--) {
+ memset(&req, 0, sizeof(req));
+ EXPECT_EQ(ioctl(listeners[i], SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+ EXPECT_EQ(req.pid, pid);
+ EXPECT_EQ(req.data.nr, __NR_getppid);
+
+ memset(&resp, 0, sizeof(resp));
+ resp.id = req.id;
+
+ if (i == 5 || i == 4) {
+ resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+ } else {
+ resp.error = 0;
+ resp.val = USER_NOTIF_MAGIC - i;
+ }
+
+ EXPECT_EQ(ioctl(listeners[i], SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+ }
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+ for (i = 0; i < 6; i++)
+ close(listeners[i]);
+}
/* Make sure PTRACE_O_SUSPEND_SECCOMP requires CAP_SYS_ADMIN. */
FIXTURE(O_SUSPEND_SECCOMP) {
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v1 4/6] seccomp: handle multiple listeners case
2025-12-01 12:24 ` [PATCH v1 4/6] seccomp: handle multiple listeners case Alexander Mikhalitsyn
@ 2025-12-01 14:38 ` Tycho Andersen
2025-12-02 12:01 ` Aleksandr Mikhalitsyn
0 siblings, 1 reply; 7+ messages in thread
From: Tycho Andersen @ 2025-12-01 14:38 UTC (permalink / raw)
To: Alexander Mikhalitsyn
Cc: kees, linux-kernel, bpf, Andy Lutomirski, Will Drewry,
Jonathan Corbet, Shuah Khan, Andrei Vagin, Christian Brauner,
Stéphane Graber
On Mon, Dec 01, 2025 at 01:24:01PM +0100, Alexander Mikhalitsyn wrote:
> If we have more than one listener in the tree and lower listener
> wants us to continue syscall (SECCOMP_USER_NOTIF_FLAG_CONTINUE)
> we must consult with upper listeners first, otherwise it is a
> clear seccomp restrictions bypass scenario.
>
> Cc: linux-kernel@vger.kernel.org
> Cc: bpf@vger.kernel.org
> Cc: Kees Cook <kees@kernel.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Will Drewry <wad@chromium.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Tycho Andersen <tycho@tycho.pizza>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Stéphane Graber <stgraber@stgraber.org>
> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> ---
> kernel/seccomp.c | 16 ++++++++++++++--
> 1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index ded3f6a6430b..ad733f849e0f 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -450,6 +450,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
> ret = cur_ret;
> matches->n = 1;
> matches->filters[0] = f;
> + } else if ((ACTION_ONLY(cur_ret) == ACTION_ONLY(ret)) &&
> + ACTION_ONLY(cur_ret) == SECCOMP_RET_USER_NOTIF) {
> + matches->filters[matches->n++] = f;
> }
> }
> return ret;
> @@ -1362,8 +1365,17 @@ static int __seccomp_filter(int this_syscall, const bool recheck_after_trace)
> return 0;
>
> case SECCOMP_RET_USER_NOTIF:
> - if (seccomp_do_user_notification(match, &sd))
> - goto skip;
> + for (unsigned char i = 0; i < matches.n; i++) {
> + match = matches.filters[i];
> + /*
> + * If userspace wants us to skip this syscall, do so.
> + * But if userspace wants to continue syscall, we
> + * must consult with the upper-level filters listeners
> + * and act accordingly.
This looks reasonable to me, pending whatever the outcome is of your
discussion of plumber's (I won't be there), feel free to add:
Reviewed-by: Tycho Andersen (AMD) <tycho@kernel.org>
I did have to think a bit about why matches.filters would be
guaranteed to have a user notification for this filter, but it's
because of your == check above in seccomp_run_filters(). Maybe worth
noting that here.
Tycho
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v1 4/6] seccomp: handle multiple listeners case
2025-12-01 14:38 ` Tycho Andersen
@ 2025-12-02 12:01 ` Aleksandr Mikhalitsyn
0 siblings, 0 replies; 7+ messages in thread
From: Aleksandr Mikhalitsyn @ 2025-12-02 12:01 UTC (permalink / raw)
To: Tycho Andersen
Cc: kees, linux-kernel, bpf, Andy Lutomirski, Will Drewry,
Jonathan Corbet, Shuah Khan, Andrei Vagin, Christian Brauner,
Stéphane Graber
On Mon, Dec 1, 2025 at 3:39 PM Tycho Andersen <tycho@kernel.org> wrote:
>
> On Mon, Dec 01, 2025 at 01:24:01PM +0100, Alexander Mikhalitsyn wrote:
> > If we have more than one listener in the tree and lower listener
> > wants us to continue syscall (SECCOMP_USER_NOTIF_FLAG_CONTINUE)
> > we must consult with upper listeners first, otherwise it is a
> > clear seccomp restrictions bypass scenario.
> >
> > Cc: linux-kernel@vger.kernel.org
> > Cc: bpf@vger.kernel.org
> > Cc: Kees Cook <kees@kernel.org>
> > Cc: Andy Lutomirski <luto@amacapital.net>
> > Cc: Will Drewry <wad@chromium.org>
> > Cc: Jonathan Corbet <corbet@lwn.net>
> > Cc: Shuah Khan <shuah@kernel.org>
> > Cc: Tycho Andersen <tycho@tycho.pizza>
> > Cc: Andrei Vagin <avagin@gmail.com>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Stéphane Graber <stgraber@stgraber.org>
> > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> > ---
> > kernel/seccomp.c | 16 ++++++++++++++--
> > 1 file changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index ded3f6a6430b..ad733f849e0f 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -450,6 +450,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
> > ret = cur_ret;
> > matches->n = 1;
> > matches->filters[0] = f;
> > + } else if ((ACTION_ONLY(cur_ret) == ACTION_ONLY(ret)) &&
> > + ACTION_ONLY(cur_ret) == SECCOMP_RET_USER_NOTIF) {
> > + matches->filters[matches->n++] = f;
> > }
> > }
> > return ret;
> > @@ -1362,8 +1365,17 @@ static int __seccomp_filter(int this_syscall, const bool recheck_after_trace)
> > return 0;
> >
> > case SECCOMP_RET_USER_NOTIF:
> > - if (seccomp_do_user_notification(match, &sd))
> > - goto skip;
> > + for (unsigned char i = 0; i < matches.n; i++) {
> > + match = matches.filters[i];
> > + /*
> > + * If userspace wants us to skip this syscall, do so.
> > + * But if userspace wants to continue syscall, we
> > + * must consult with the upper-level filters listeners
> > + * and act accordingly.
>
Hi Tycho,
> This looks reasonable to me, pending whatever the outcome is of your
> discussion of plumber's (I won't be there), feel free to add:
off:
I will be there only virtually too, btw.
Hope we can make it on FOSDEM 2026 ;)
>
> Reviewed-by: Tycho Andersen (AMD) <tycho@kernel.org>
Added in -v2. Thanks ;)
>
> I did have to think a bit about why matches.filters would be
> guaranteed to have a user notification for this filter, but it's
> because of your == check above in seccomp_run_filters(). Maybe worth
> noting that here.
i agree absolutely, I've added some more explanatory comments to ensure that
we are all on the same page and further code readers too.
>
Kind regards,
Alex
> Tycho
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-12-02 12:01 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-01 12:23 [PATCH v1 0/6] seccomp: support nested listeners Alexander Mikhalitsyn
2025-12-01 12:24 ` [PATCH v1 3/6] seccomp: limit number of listeners in seccomp tree Alexander Mikhalitsyn
2025-12-01 12:24 ` [PATCH v1 4/6] seccomp: handle multiple listeners case Alexander Mikhalitsyn
2025-12-01 14:38 ` Tycho Andersen
2025-12-02 12:01 ` Aleksandr Mikhalitsyn
2025-12-01 12:24 ` [PATCH v1 5/6] seccomp: relax has_duplicate_listeners check Alexander Mikhalitsyn
2025-12-01 12:24 ` [PATCH v1 6/6] tools/testing/selftests/seccomp: test nested listeners Alexander Mikhalitsyn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox