BPF List
 help / color / mirror / Atom feed
* [PATCH v2 0/6] seccomp: support nested listeners
@ 2025-12-02 11:51 Alexander Mikhalitsyn
  2025-12-02 11:51 ` [PATCH v2 3/6] seccomp: limit number of listeners in seccomp tree Alexander Mikhalitsyn
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Alexander Mikhalitsyn @ 2025-12-02 11:51 UTC (permalink / raw)
  To: kees
  Cc: linux-doc, linux-kernel, linux-kselftest, bpf, Andy Lutomirski,
	Will Drewry, Jonathan Corbet, Shuah Khan, Aleksa Sarai,
	Tycho Andersen, Andrei Vagin, Christian Brauner,
	Stéphane Graber

Dear friends,

this patch series adds support for nested seccomp listeners. It allows container
runtimes and other sandboxing software to install seccomp listeners on top of
existing ones, which is useful for nested LXC containers and other similar use-cases.

I decided to go with conservative approach and limit the maximum number of nested listeners
to 8 per seccomp filter chain (MAX_LISTENERS_PER_PATH). This is done to avoid dynamic memory
allocations in the very hot __seccomp_filter() function, where we use a preallocated static
array on the stack to track matched listeners. 8 nested listeners should be enough for
almost any practical scenarios.

Expecting potential discussions around this patch series, I'm going to present a talk
at LPC 2025 about the design and implementation details of this feature [1].

Git tree (based on for-next/seccomp):
v2: https://github.com/mihalicyn/linux/commits/seccomp.mult.listeners.v2
current: https://github.com/mihalicyn/linux/commits/seccomp.mult.listeners

Changelog for version 2:
- add some explanatory comments
- add RWB tags from Tycho Andersen (thanks, Tycho! ;) )
- CC-ed Aleksa as he might be interested in this stuff too

Links to previous versions:
v1: https://lore.kernel.org/all/20251201122406.105045-1-aleksandr.mikhalitsyn@canonical.com
tree: https://github.com/mihalicyn/linux/commits/seccomp.mult.listeners.v1

Link: https://lpc.events/event/19/contributions/2241/ [1]

Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: Kees Cook <kees@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stéphane Graber <stgraber@stgraber.org>


Alexander Mikhalitsyn (6):
  seccomp: remove unused argument from seccomp_do_user_notification
  seccomp: prepare seccomp_run_filters() to support more than one
    listener
  seccomp: limit number of listeners in seccomp tree
  seccomp: handle multiple listeners case
  seccomp: relax has_duplicate_listeners check
  tools/testing/selftests/seccomp: test nested listeners

 .../userspace-api/seccomp_filter.rst          |   6 +
 include/linux/seccomp.h                       |   3 +-
 include/uapi/linux/seccomp.h                  |  13 +-
 kernel/seccomp.c                              | 116 +++++++++++--
 tools/include/uapi/linux/seccomp.h            |  13 +-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 162 ++++++++++++++++++
 6 files changed, 286 insertions(+), 27 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 3/6] seccomp: limit number of listeners in seccomp tree
  2025-12-02 11:51 [PATCH v2 0/6] seccomp: support nested listeners Alexander Mikhalitsyn
@ 2025-12-02 11:51 ` Alexander Mikhalitsyn
  2025-12-02 11:51 ` [PATCH v2 4/6] seccomp: handle multiple listeners case Alexander Mikhalitsyn
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Alexander Mikhalitsyn @ 2025-12-02 11:51 UTC (permalink / raw)
  To: kees
  Cc: linux-kernel, bpf, Andy Lutomirski, Will Drewry, Jonathan Corbet,
	Shuah Khan, Aleksa Sarai, Tycho Andersen, Andrei Vagin,
	Christian Brauner, Stéphane Graber, Alexander Mikhalitsyn

We need to limit number of listeners in seccomp tree to
MAX_LISTENERS_PER_PATH, because we don't want to use dynamic
memory allocations in a very hot __seccomp_filter() function
and we use preallocated static array on the stack.

Also, let's return ELOOP to userspace if it attempts to install
more than MAX_LISTENERS_PER_PATH listeners, instead of ENOMEM as
we do when userspace hits the limit of cBPF instructions.
This will make uAPI a bit more convenient.

Notice, that has_duplicate_listener() check is still in place, so this
change is a preparational.

Cc: linux-kernel@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: Kees Cook <kees@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 kernel/seccomp.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index c9a1062a53bd..ded3f6a6430b 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -931,17 +931,25 @@ static long seccomp_attach_filter(unsigned int flags,
 				  struct seccomp_filter *filter)
 {
 	unsigned long total_insns;
+	unsigned char total_listeners;
 	struct seccomp_filter *walker;
 
 	assert_spin_locked(&current->sighand->siglock);
 
-	/* Validate resulting filter length. */
+	/* Validate resulting filter length and number of nested listeners. */
 	total_insns = filter->prog->len;
-	for (walker = current->seccomp.filter; walker; walker = walker->prev)
+	total_listeners = filter->notif ? 1 : 0;
+	for (walker = current->seccomp.filter; walker; walker = walker->prev) {
 		total_insns += walker->prog->len + 4;  /* 4 instr penalty */
+		total_listeners += walker->notif ? 1 : 0;
+	}
+
 	if (total_insns > MAX_INSNS_PER_PATH)
 		return -ENOMEM;
 
+	if (total_listeners > MAX_LISTENERS_PER_PATH)
+		return -ELOOP;
+
 	/* If thread sync has been requested, check that it is possible. */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
 		int ret;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 4/6] seccomp: handle multiple listeners case
  2025-12-02 11:51 [PATCH v2 0/6] seccomp: support nested listeners Alexander Mikhalitsyn
  2025-12-02 11:51 ` [PATCH v2 3/6] seccomp: limit number of listeners in seccomp tree Alexander Mikhalitsyn
@ 2025-12-02 11:51 ` Alexander Mikhalitsyn
  2025-12-03 15:29   ` Aleksandr Mikhalitsyn
  2025-12-02 11:51 ` [PATCH v2 5/6] seccomp: relax has_duplicate_listeners check Alexander Mikhalitsyn
  2025-12-02 11:51 ` [PATCH v2 6/6] tools/testing/selftests/seccomp: test nested listeners Alexander Mikhalitsyn
  3 siblings, 1 reply; 8+ messages in thread
From: Alexander Mikhalitsyn @ 2025-12-02 11:51 UTC (permalink / raw)
  To: kees
  Cc: linux-kernel, bpf, Andy Lutomirski, Will Drewry, Jonathan Corbet,
	Shuah Khan, Aleksa Sarai, Tycho Andersen, Andrei Vagin,
	Christian Brauner, Stéphane Graber, Tycho Andersen,
	Alexander Mikhalitsyn

If we have more than one listener in the tree and lower listener
wants us to continue syscall (SECCOMP_USER_NOTIF_FLAG_CONTINUE)
we must consult with upper listeners first, otherwise it is a
clear seccomp restrictions bypass scenario.

Cc: linux-kernel@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: Kees Cook <kees@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stéphane Graber <stgraber@stgraber.org>
Reviewed-by: Tycho Andersen (AMD) <tycho@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 kernel/seccomp.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ded3f6a6430b..262390451ff1 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -448,8 +448,21 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 
 		if (ACTION_ONLY(cur_ret) < ACTION_ONLY(ret)) {
 			ret = cur_ret;
+			/*
+			 * No matter what we had before in matches->filters[],
+			 * we need to overwrite it, because current action is more
+			 * restrictive than any previous one.
+			 */
 			matches->n = 1;
 			matches->filters[0] = f;
+		} else if ((ACTION_ONLY(cur_ret) == ACTION_ONLY(ret)) &&
+			    ACTION_ONLY(cur_ret) == SECCOMP_RET_USER_NOTIF) {
+			/*
+			 * For multiple SECCOMP_RET_USER_NOTIF results, we need to
+			 * track all filters that resulted in the same action, because
+			 * we might need to notify a few of them to get a final decision.
+			 */
+			matches->filters[matches->n++] = f;
 		}
 	}
 	return ret;
@@ -1362,8 +1375,24 @@ static int __seccomp_filter(int this_syscall, const bool recheck_after_trace)
 		return 0;
 
 	case SECCOMP_RET_USER_NOTIF:
-		if (seccomp_do_user_notification(match, &sd))
-			goto skip;
+		for (unsigned char i = 0; i < matches.n; i++) {
+			match = matches.filters[i];
+			/*
+			 * If userspace wants us to skip this syscall, do so.
+			 * But if userspace wants to continue syscall, we
+			 * must consult with the upper-level filters listeners
+			 * and act accordingly.
+			 *
+			 * Note, that if there are multiple filters returned
+			 * SECCOMP_RET_USER_NOTIF, and final result is
+			 * SECCOMP_RET_USER_NOTIF too, then seccomp_run_filters()
+			 * has populated matches.filters[] array with all of them
+			 * in order from the lowest-level (closest to a
+			 * current->seccomp.filter) to the highest-level.
+			 */
+			if (seccomp_do_user_notification(match, &sd))
+				goto skip;
+		}
 
 		return 0;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 5/6] seccomp: relax has_duplicate_listeners check
  2025-12-02 11:51 [PATCH v2 0/6] seccomp: support nested listeners Alexander Mikhalitsyn
  2025-12-02 11:51 ` [PATCH v2 3/6] seccomp: limit number of listeners in seccomp tree Alexander Mikhalitsyn
  2025-12-02 11:51 ` [PATCH v2 4/6] seccomp: handle multiple listeners case Alexander Mikhalitsyn
@ 2025-12-02 11:51 ` Alexander Mikhalitsyn
  2025-12-02 11:51 ` [PATCH v2 6/6] tools/testing/selftests/seccomp: test nested listeners Alexander Mikhalitsyn
  3 siblings, 0 replies; 8+ messages in thread
From: Alexander Mikhalitsyn @ 2025-12-02 11:51 UTC (permalink / raw)
  To: kees
  Cc: linux-doc, linux-kernel, bpf, Andy Lutomirski, Will Drewry,
	Jonathan Corbet, Shuah Khan, Aleksa Sarai, Tycho Andersen,
	Andrei Vagin, Christian Brauner, Stéphane Graber,
	Alexander Mikhalitsyn, Alexander Mikhalitsyn

Now everything is ready to get rid of "only one listener per tree"
limitation.

Let's introduce a new uAPI flag
SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS, so userspace may explicitly
allow nested listeners when installing a listener.

Note, that to install n-th listener, this flag must be set on all
the listeners up the tree.

Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: Kees Cook <kees@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 .../userspace-api/seccomp_filter.rst          |  6 +++++
 include/linux/seccomp.h                       |  3 ++-
 include/uapi/linux/seccomp.h                  | 13 ++++++-----
 kernel/seccomp.c                              | 22 +++++++++++++++----
 tools/include/uapi/linux/seccomp.h            | 13 ++++++-----
 5 files changed, 40 insertions(+), 17 deletions(-)

diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index cff0fa7f3175..b9633ab1ed47 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -210,6 +210,12 @@ notifications from both tasks will appear on the same filter fd. Reads and
 writes to/from a filter fd are also synchronized, so a filter fd can safely
 have many readers.
 
+By default, only one listener within seccomp filters tree is allowed. On attempt
+to add a new listener when one already exists in the filter tree, the
+``seccomp()`` call will fail with ``-EBUSY``. To allow multiple listeners, the
+``SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS`` flag can be passed in addition to
+the ``SECCOMP_FILTER_FLAG_NEW_LISTENER`` flag.
+
 The interface for a seccomp notification fd consists of two structures:
 
 .. code-block:: c
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 9b959972bf4a..9b060946019d 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -10,7 +10,8 @@
 					 SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
 					 SECCOMP_FILTER_FLAG_NEW_LISTENER | \
 					 SECCOMP_FILTER_FLAG_TSYNC_ESRCH | \
-					 SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
+					 SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV | \
+					 SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS)
 
 /* sizeof() the first published struct seccomp_notif_addfd */
 #define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index dbfc9b37fcae..de78d8e7a70b 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -18,13 +18,14 @@
 #define SECCOMP_GET_NOTIF_SIZES		3
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
-#define SECCOMP_FILTER_FLAG_LOG			(1UL << 1)
-#define SECCOMP_FILTER_FLAG_SPEC_ALLOW		(1UL << 2)
-#define SECCOMP_FILTER_FLAG_NEW_LISTENER	(1UL << 3)
-#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH		(1UL << 4)
+#define SECCOMP_FILTER_FLAG_TSYNC			(1UL << 0)
+#define SECCOMP_FILTER_FLAG_LOG				(1UL << 1)
+#define SECCOMP_FILTER_FLAG_SPEC_ALLOW			(1UL << 2)
+#define SECCOMP_FILTER_FLAG_NEW_LISTENER		(1UL << 3)
+#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH			(1UL << 4)
 /* Received notifications wait in killable state (only respond to fatal signals) */
-#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV	(1UL << 5)
+#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV		(1UL << 5)
+#define SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS	(1UL << 6)
 
 /*
  * All BPF programs must return a 32-bit value.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 262390451ff1..a59276afc5b4 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -205,6 +205,7 @@ static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
  * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged
  * @wait_killable_recv: Put notifying process in killable state once the
  *			notification is received by the userspace listener.
+ * @allow_nested_listeners: Allow nested seccomp listeners.
  * @prev: points to a previously installed, or inherited, filter
  * @prog: the BPF program to evaluate
  * @notif: the struct that holds all notification related information
@@ -226,6 +227,7 @@ struct seccomp_filter {
 	refcount_t users;
 	bool log;
 	bool wait_killable_recv;
+	bool allow_nested_listeners;
 	struct action_cache cache;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
@@ -984,6 +986,10 @@ static long seccomp_attach_filter(unsigned int flags,
 	if (flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
 		filter->wait_killable_recv = true;
 
+	/* Set nested listeners allow flag, if present. */
+	if (flags & SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS)
+		filter->allow_nested_listeners = true;
+
 	/*
 	 * If there is an existing filter, make it the prev and don't drop its
 	 * task reference.
@@ -1972,7 +1978,8 @@ static struct file *init_listener(struct seccomp_filter *filter)
 }
 
 /*
- * Does @new_child have a listener while an ancestor also has a listener?
+ * Does @new_child have a listener while an ancestor also has a listener
+ * and hasn't allowed nesting?
  * If so, we'll want to reject this filter.
  * This only has to be tested for the current process, even in the TSYNC case,
  * because TSYNC installs @child with the same parent on all threads.
@@ -1990,7 +1997,12 @@ static bool has_duplicate_listener(struct seccomp_filter *new_child)
 		return false;
 	for (cur = current->seccomp.filter; cur; cur = cur->prev) {
 		if (cur->notif)
-			return true;
+			/*
+			 * We don't need to go up further, because if there is a
+			 * listener with nesting allowed, then all the listeners
+			 * up the tree have allowed nesting as well.
+			 */
+			return !cur->allow_nested_listeners;
 	}
 
 	return false;
@@ -2035,10 +2047,12 @@ static long seccomp_set_mode_filter(unsigned int flags,
 		return -EINVAL;
 
 	/*
-	 * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT flag doesn't make sense
+	 * The SECCOMP_FILTER_FLAG_WAIT_KILLABLE_SENT and
+	 * SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS flags don't make sense
 	 * without the SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
 	 */
-	if ((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) &&
+	if (((flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) ||
+	     (flags & SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS)) &&
 	    ((flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0))
 		return -EINVAL;
 
diff --git a/tools/include/uapi/linux/seccomp.h b/tools/include/uapi/linux/seccomp.h
index dbfc9b37fcae..de78d8e7a70b 100644
--- a/tools/include/uapi/linux/seccomp.h
+++ b/tools/include/uapi/linux/seccomp.h
@@ -18,13 +18,14 @@
 #define SECCOMP_GET_NOTIF_SIZES		3
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
-#define SECCOMP_FILTER_FLAG_LOG			(1UL << 1)
-#define SECCOMP_FILTER_FLAG_SPEC_ALLOW		(1UL << 2)
-#define SECCOMP_FILTER_FLAG_NEW_LISTENER	(1UL << 3)
-#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH		(1UL << 4)
+#define SECCOMP_FILTER_FLAG_TSYNC			(1UL << 0)
+#define SECCOMP_FILTER_FLAG_LOG				(1UL << 1)
+#define SECCOMP_FILTER_FLAG_SPEC_ALLOW			(1UL << 2)
+#define SECCOMP_FILTER_FLAG_NEW_LISTENER		(1UL << 3)
+#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH			(1UL << 4)
 /* Received notifications wait in killable state (only respond to fatal signals) */
-#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV	(1UL << 5)
+#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV		(1UL << 5)
+#define SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS	(1UL << 6)
 
 /*
  * All BPF programs must return a 32-bit value.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 6/6] tools/testing/selftests/seccomp: test nested listeners
  2025-12-02 11:51 [PATCH v2 0/6] seccomp: support nested listeners Alexander Mikhalitsyn
                   ` (2 preceding siblings ...)
  2025-12-02 11:51 ` [PATCH v2 5/6] seccomp: relax has_duplicate_listeners check Alexander Mikhalitsyn
@ 2025-12-02 11:51 ` Alexander Mikhalitsyn
  3 siblings, 0 replies; 8+ messages in thread
From: Alexander Mikhalitsyn @ 2025-12-02 11:51 UTC (permalink / raw)
  To: kees
  Cc: linux-kernel, linux-kselftest, bpf, Andy Lutomirski, Will Drewry,
	Jonathan Corbet, Shuah Khan, Aleksa Sarai, Tycho Andersen,
	Andrei Vagin, Christian Brauner, Stéphane Graber,
	Alexander Mikhalitsyn

Add some basic tests for nested listeners.

Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: Kees Cook <kees@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Tycho Andersen <tycho@tycho.pizza>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 162 ++++++++++++++++++
 1 file changed, 162 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index fc4910d35342..0bf02d04fe15 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -293,6 +293,10 @@ struct seccomp_notif_addfd_big {
 #define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5)
 #endif
 
+#ifndef SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS
+#define SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS (1UL << 6)
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
@@ -4408,6 +4412,164 @@ TEST(user_notification_sync)
 	ASSERT_EQ(status, 0);
 }
 
+/* from kernel/seccomp.c */
+#define MAX_LISTENERS_PER_PATH 8
+
+TEST(user_notification_nested_limits)
+{
+	pid_t pid;
+	long ret;
+	int i, status, listeners[MAX_LISTENERS_PER_PATH];
+
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	/* Install 6 levels of listeners and allow nesting. */
+	for (i = 0; i < 6; i++) {
+		listeners[i] = user_notif_syscall(__NR_getppid,
+						  SECCOMP_FILTER_FLAG_NEW_LISTENER |
+						  SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS);
+		ASSERT_GE(listeners[i], 0);
+
+		/* Add some no-op filters for grins. */
+		EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	}
+
+	/* Check behavior when nesting is not allowed. */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+	if (pid == 0) {
+		/* Install a next listener in the chain without nesting allowed. */
+		listeners[6] = user_notif_syscall(__NR_getppid,
+						 SECCOMP_FILTER_FLAG_NEW_LISTENER);
+		if (listeners[6] < 0)
+			exit(1);
+
+		/* Add some no-op filters for grins. */
+		ret = seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog);
+		if (ret != 0)
+			exit(2);
+
+		ret = user_notif_syscall(__NR_getppid,
+					 SECCOMP_FILTER_FLAG_NEW_LISTENER);
+		/* Installing a next listener in the chain should result in EBUSY. */
+		exit((ret >= 0 || errno != EBUSY) ? 3 : 0);
+	}
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/* Install more filters with listeners to reach nesting levels limit. */
+	for (; i < MAX_LISTENERS_PER_PATH; i++) {
+		listeners[i] = user_notif_syscall(__NR_getppid,
+						  SECCOMP_FILTER_FLAG_NEW_LISTENER |
+						  SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS);
+		ASSERT_GE(listeners[i], 0);
+
+		/* Add some no-op filters for grins. */
+		EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	}
+
+	/* Installing a next listener in the chain should result in ELOOP. */
+	EXPECT_EQ(user_notif_syscall(__NR_getppid,
+				     SECCOMP_FILTER_FLAG_NEW_LISTENER),
+		  -1);
+	EXPECT_EQ(errno, ELOOP);
+}
+
+TEST(user_notification_nested)
+{
+	pid_t pid;
+	long ret;
+	int i, status, listeners[6];
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	ASSERT_EQ(0, ret) {
+		TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+	}
+
+	/* Install 6 levels of listeners and allow nesting. */
+	for (i = 0; i < 6; i++) {
+		listeners[i] = user_notif_syscall(__NR_getppid,
+						  SECCOMP_FILTER_FLAG_NEW_LISTENER |
+						  SECCOMP_FILTER_FLAG_ALLOW_NESTED_LISTENERS);
+		ASSERT_GE(listeners[i], 0);
+
+		/* Add some no-op filters for grins. */
+		EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	}
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getppid);
+		exit(ret != (USER_NOTIF_MAGIC-3));
+	}
+
+	/*
+	 * We want to have the following picture:
+	 *
+	 * | Listener level (i) | Listener decision |
+	 * |--------------------|-------------------|
+	 * |	     0		|      WHATEVER     |
+	 * |	     1		|      WHATEVER     |
+	 * |	     2		|      WHATEVER     |
+	 * |	     3		|       RETURN      | <-- stop here
+	 * |	     4		|  CONTINUE SYSCALL |
+	 * |	     5		|  CONTINUE SYSCALL | <- start here (current->seccomp.filter)
+	 *
+	 * First listener who receives a notification is level 5, then 4,
+	 * then we expect to stop on level 3 and return from syscall with
+	 * (USER_NOTIF_MAGIC - 3) return value.
+	 */
+	for (i = 6 - 1; i >= 3; i--) {
+		memset(&req, 0, sizeof(req));
+		EXPECT_EQ(ioctl(listeners[i], SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+		EXPECT_EQ(req.pid, pid);
+		EXPECT_EQ(req.data.nr,  __NR_getppid);
+
+		memset(&resp, 0, sizeof(resp));
+		resp.id = req.id;
+
+		if (i == 5 || i == 4) {
+			resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+		} else {
+			resp.error = 0;
+			resp.val = USER_NOTIF_MAGIC - i;
+		}
+
+		EXPECT_EQ(ioctl(listeners[i], SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+	}
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	for (i = 0; i < 6; i++)
+		close(listeners[i]);
+}
 
 /* Make sure PTRACE_O_SUSPEND_SECCOMP requires CAP_SYS_ADMIN. */
 FIXTURE(O_SUSPEND_SECCOMP) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 4/6] seccomp: handle multiple listeners case
  2025-12-02 11:51 ` [PATCH v2 4/6] seccomp: handle multiple listeners case Alexander Mikhalitsyn
@ 2025-12-03 15:29   ` Aleksandr Mikhalitsyn
  2025-12-04 15:18     ` Tycho Andersen
  0 siblings, 1 reply; 8+ messages in thread
From: Aleksandr Mikhalitsyn @ 2025-12-03 15:29 UTC (permalink / raw)
  To: kees
  Cc: linux-kernel, bpf, Andy Lutomirski, Will Drewry, Jonathan Corbet,
	Shuah Khan, Aleksa Sarai, Tycho Andersen, Andrei Vagin,
	Christian Brauner, Stéphane Graber, Tycho Andersen

On Tue, Dec 2, 2025 at 12:52 PM Alexander Mikhalitsyn
<aleksandr.mikhalitsyn@canonical.com> wrote:
>
> If we have more than one listener in the tree and lower listener
> wants us to continue syscall (SECCOMP_USER_NOTIF_FLAG_CONTINUE)
> we must consult with upper listeners first, otherwise it is a
> clear seccomp restrictions bypass scenario.
>
> Cc: linux-kernel@vger.kernel.org
> Cc: bpf@vger.kernel.org
> Cc: Kees Cook <kees@kernel.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Will Drewry <wad@chromium.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Tycho Andersen <tycho@tycho.pizza>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Stéphane Graber <stgraber@stgraber.org>
> Reviewed-by: Tycho Andersen (AMD) <tycho@kernel.org>
> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> ---
>  kernel/seccomp.c | 33 +++++++++++++++++++++++++++++++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index ded3f6a6430b..262390451ff1 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -448,8 +448,21 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
>
>                 if (ACTION_ONLY(cur_ret) < ACTION_ONLY(ret)) {
>                         ret = cur_ret;
> +                       /*
> +                        * No matter what we had before in matches->filters[],
> +                        * we need to overwrite it, because current action is more
> +                        * restrictive than any previous one.
> +                        */
>                         matches->n = 1;
>                         matches->filters[0] = f;
> +               } else if ((ACTION_ONLY(cur_ret) == ACTION_ONLY(ret)) &&
> +                           ACTION_ONLY(cur_ret) == SECCOMP_RET_USER_NOTIF) {

My bad. We also have to check f->notif in there like that:

                } else if ((ACTION_ONLY(cur_ret) == ACTION_ONLY(ret)) &&
-                           ACTION_ONLY(cur_ret) == SECCOMP_RET_USER_NOTIF) {
+                          (ACTION_ONLY(cur_ret) == SECCOMP_RET_USER_NOTIF) &&
+                          f->notif) {
                        /*

After Kees's comment I have some idea about how to potentially get rid
of matches->filters static
array. I'll try to rework this.

> +                       /*
> +                        * For multiple SECCOMP_RET_USER_NOTIF results, we need to
> +                        * track all filters that resulted in the same action, because
> +                        * we might need to notify a few of them to get a final decision.
> +                        */
> +                       matches->filters[matches->n++] = f;
>                 }
>         }
>         return ret;
> @@ -1362,8 +1375,24 @@ static int __seccomp_filter(int this_syscall, const bool recheck_after_trace)
>                 return 0;
>
>         case SECCOMP_RET_USER_NOTIF:
> -               if (seccomp_do_user_notification(match, &sd))
> -                       goto skip;
> +               for (unsigned char i = 0; i < matches.n; i++) {
> +                       match = matches.filters[i];
> +                       /*
> +                        * If userspace wants us to skip this syscall, do so.
> +                        * But if userspace wants to continue syscall, we
> +                        * must consult with the upper-level filters listeners
> +                        * and act accordingly.
> +                        *
> +                        * Note, that if there are multiple filters returned
> +                        * SECCOMP_RET_USER_NOTIF, and final result is
> +                        * SECCOMP_RET_USER_NOTIF too, then seccomp_run_filters()
> +                        * has populated matches.filters[] array with all of them
> +                        * in order from the lowest-level (closest to a
> +                        * current->seccomp.filter) to the highest-level.
> +                        */
> +                       if (seccomp_do_user_notification(match, &sd))
> +                               goto skip;
> +               }
>
>                 return 0;
>
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 4/6] seccomp: handle multiple listeners case
  2025-12-03 15:29   ` Aleksandr Mikhalitsyn
@ 2025-12-04 15:18     ` Tycho Andersen
  2025-12-10 23:18       ` Aleksandr Mikhalitsyn
  0 siblings, 1 reply; 8+ messages in thread
From: Tycho Andersen @ 2025-12-04 15:18 UTC (permalink / raw)
  To: Aleksandr Mikhalitsyn
  Cc: kees, linux-kernel, bpf, Andy Lutomirski, Will Drewry,
	Jonathan Corbet, Shuah Khan, Aleksa Sarai, Andrei Vagin,
	Christian Brauner, Stéphane Graber

On Wed, Dec 03, 2025 at 04:29:49PM +0100, Aleksandr Mikhalitsyn wrote:
> On Tue, Dec 2, 2025 at 12:52 PM Alexander Mikhalitsyn
> <aleksandr.mikhalitsyn@canonical.com> wrote:
> >
> > If we have more than one listener in the tree and lower listener
> > wants us to continue syscall (SECCOMP_USER_NOTIF_FLAG_CONTINUE)
> > we must consult with upper listeners first, otherwise it is a
> > clear seccomp restrictions bypass scenario.
> >
> > Cc: linux-kernel@vger.kernel.org
> > Cc: bpf@vger.kernel.org
> > Cc: Kees Cook <kees@kernel.org>
> > Cc: Andy Lutomirski <luto@amacapital.net>
> > Cc: Will Drewry <wad@chromium.org>
> > Cc: Jonathan Corbet <corbet@lwn.net>
> > Cc: Shuah Khan <shuah@kernel.org>
> > Cc: Aleksa Sarai <cyphar@cyphar.com>
> > Cc: Tycho Andersen <tycho@tycho.pizza>
> > Cc: Andrei Vagin <avagin@gmail.com>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Stéphane Graber <stgraber@stgraber.org>
> > Reviewed-by: Tycho Andersen (AMD) <tycho@kernel.org>
> > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> > ---
> >  kernel/seccomp.c | 33 +++++++++++++++++++++++++++++++--
> >  1 file changed, 31 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index ded3f6a6430b..262390451ff1 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -448,8 +448,21 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
> >
> >                 if (ACTION_ONLY(cur_ret) < ACTION_ONLY(ret)) {
> >                         ret = cur_ret;
> > +                       /*
> > +                        * No matter what we had before in matches->filters[],
> > +                        * we need to overwrite it, because current action is more
> > +                        * restrictive than any previous one.
> > +                        */
> >                         matches->n = 1;
> >                         matches->filters[0] = f;
> > +               } else if ((ACTION_ONLY(cur_ret) == ACTION_ONLY(ret)) &&
> > +                           ACTION_ONLY(cur_ret) == SECCOMP_RET_USER_NOTIF) {
> 
> My bad. We also have to check f->notif in there like that:

For my own education: why is that? Shouldn't
seccomp_do_user_notification() be smart enough to catch this case (and
indeed, there is a TOCTOU if you do it here?)?

Thanks,

Tycho

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 4/6] seccomp: handle multiple listeners case
  2025-12-04 15:18     ` Tycho Andersen
@ 2025-12-10 23:18       ` Aleksandr Mikhalitsyn
  0 siblings, 0 replies; 8+ messages in thread
From: Aleksandr Mikhalitsyn @ 2025-12-10 23:18 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: kees, linux-kernel, bpf, Andy Lutomirski, Will Drewry,
	Jonathan Corbet, Shuah Khan, Aleksa Sarai, Andrei Vagin,
	Christian Brauner, Stéphane Graber

On Thu, Dec 4, 2025 at 4:18 PM Tycho Andersen <tycho@kernel.org> wrote:
>
> On Wed, Dec 03, 2025 at 04:29:49PM +0100, Aleksandr Mikhalitsyn wrote:
> > On Tue, Dec 2, 2025 at 12:52 PM Alexander Mikhalitsyn
> > <aleksandr.mikhalitsyn@canonical.com> wrote:
> > >
> > > If we have more than one listener in the tree and lower listener
> > > wants us to continue syscall (SECCOMP_USER_NOTIF_FLAG_CONTINUE)
> > > we must consult with upper listeners first, otherwise it is a
> > > clear seccomp restrictions bypass scenario.
> > >
> > > Cc: linux-kernel@vger.kernel.org
> > > Cc: bpf@vger.kernel.org
> > > Cc: Kees Cook <kees@kernel.org>
> > > Cc: Andy Lutomirski <luto@amacapital.net>
> > > Cc: Will Drewry <wad@chromium.org>
> > > Cc: Jonathan Corbet <corbet@lwn.net>
> > > Cc: Shuah Khan <shuah@kernel.org>
> > > Cc: Aleksa Sarai <cyphar@cyphar.com>
> > > Cc: Tycho Andersen <tycho@tycho.pizza>
> > > Cc: Andrei Vagin <avagin@gmail.com>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Stéphane Graber <stgraber@stgraber.org>
> > > Reviewed-by: Tycho Andersen (AMD) <tycho@kernel.org>
> > > Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> > > ---
> > >  kernel/seccomp.c | 33 +++++++++++++++++++++++++++++++--
> > >  1 file changed, 31 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > index ded3f6a6430b..262390451ff1 100644
> > > --- a/kernel/seccomp.c
> > > +++ b/kernel/seccomp.c
> > > @@ -448,8 +448,21 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
> > >
> > >                 if (ACTION_ONLY(cur_ret) < ACTION_ONLY(ret)) {
> > >                         ret = cur_ret;
> > > +                       /*
> > > +                        * No matter what we had before in matches->filters[],
> > > +                        * we need to overwrite it, because current action is more
> > > +                        * restrictive than any previous one.
> > > +                        */
> > >                         matches->n = 1;
> > >                         matches->filters[0] = f;
> > > +               } else if ((ACTION_ONLY(cur_ret) == ACTION_ONLY(ret)) &&
> > > +                           ACTION_ONLY(cur_ret) == SECCOMP_RET_USER_NOTIF) {
> >
> > My bad. We also have to check f->notif in there like that:
>

Hi Tycho,

sorry for the delay with a reply.

> For my own education: why is that? Shouldn't
> seccomp_do_user_notification() be smart enough to catch this case (and
> indeed, there is a TOCTOU if you do it here?)?

seccomp_do_user_notification() is smart enough to handle the case when
a listener file descriptor was closed,
but a tricky part here is that SECCOMP_RET_USER_NOTIF can be (legally)
returned by the seccomp filter
program even when there was no listener at all.

Then, as nothing prevents you from loading a program like:
    BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF)
with
     seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog) // << no
SECCOMP_FILTER_FLAG_NEW_LISTENER flag

we can easily do OOB write in matches->filters[] array, because our
limitation with 8 elements only works for those who
set the SECCOMP_FILTER_FLAG_NEW_LISTENER flag.

Actually, I decided to get rid of this array in the next version of patchset.

Hope to virtually meet you at LPC soon! ;)

Kind regards,
Alex

>
> Thanks,
>
> Tycho

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-12-10 23:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-02 11:51 [PATCH v2 0/6] seccomp: support nested listeners Alexander Mikhalitsyn
2025-12-02 11:51 ` [PATCH v2 3/6] seccomp: limit number of listeners in seccomp tree Alexander Mikhalitsyn
2025-12-02 11:51 ` [PATCH v2 4/6] seccomp: handle multiple listeners case Alexander Mikhalitsyn
2025-12-03 15:29   ` Aleksandr Mikhalitsyn
2025-12-04 15:18     ` Tycho Andersen
2025-12-10 23:18       ` Aleksandr Mikhalitsyn
2025-12-02 11:51 ` [PATCH v2 5/6] seccomp: relax has_duplicate_listeners check Alexander Mikhalitsyn
2025-12-02 11:51 ` [PATCH v2 6/6] tools/testing/selftests/seccomp: test nested listeners Alexander Mikhalitsyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox