linux-security-module.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] landlock: Multithreaded policy enforcement
@ 2025-02-21 18:44 Günther Noack
  2025-02-21 18:44 ` [RFC 1/2] landlock: Multithreading support for landlock_restrict_self() Günther Noack
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Günther Noack @ 2025-02-21 18:44 UTC (permalink / raw)
  To: linux-security-module, Mickaël Salaün
  Cc: Günther Noack, Paul Moore, Konstantin Meskhidze, Jann Horn

Hello!

This patch set adds the LANDLOCK_RESTRICT_SELF_TSYNC flag to
landlock_restrict_self().  With this flag, the passed Landlock ruleset
will not only be applied to the calling thread, but to all threads
which belong to the same process.

I am sending this intentionally early.  At this point, I am mostly
looking for high-level comments to check whether the general approach
is feasible at all and whether I am using appropriate concurrency
mechanisms for it.

Motivation
==========

TL;DR: The libpsx/nptl(7) signal hack which we use in user space for
multi-threaded Landlock enforcement is incompatible with Landlock's
signal scoping support.  Landlock can restrict the use of signals
across Landlock domains, but we need signals ourselves in user space
in ways that are not permitted any more under these restrictions.

Enabling Landlock proves to be difficult in processes that are already
multi-threaded at the time of enforcement:

* Enforcement in only one thread is usually a mistake because threads
  do not normally have proper security boundaries between them.

* Also, multithreading is unavoidable in some circumstances, such as
  when using Landlock from a Go program.  Go programs are already
  multithreaded by the time that they enter the "func main()".

So far, the approach in Go[1] was to use libpsx[2].  This library
implements the mechanism described in nptl(7) [3]: It keeps track of
all threads with a linker hack and then makes all threads do the same
syscall by registering a signal handler for them and invoking it.

With commit 54a6e6bbf3be ("landlock: Add signal scoping"), Landlock
gained the ability to restrict the use of signals across different
Landlock domains.

Landlock's signal scoping support is incompatible with the libpsx
approach of enabling Landlock:

(1) With libpsx, although all threads enforce the same ruleset object,
    they technically do the operation separately and end up in
    distinct Landlock domains.  When the ruleset uses
    LANDLOCK_SCOPE_SIGNAL, it becomes impossible to use cross-thread
    signals in that process.

(2) Cross-thread Signals are themselves needed to enforce further
    nested Landlock domains across multiple threads.  So nested
    Landlock policies become impossible there.

In addition to Landlock itself, cross-thread signals are also needed
for other seemingly-harmless API calls like the setuid(2) [4] and for
the use of libcap (co-developed with libpsx), which have the same
problem where the underlying syscall only applies to the calling
thread.

Implementation approach
=======================

The implementation is inspired by Jann Horn's earlier patch set for
cred_transfer() [5] as well as by the libpsx approach described in [2]
and [3].  In some way, it is a libpsx-like implementation in the
kernel:

When the flag is provided, the calling thread adds a task_work to all
threads to be executed through a pseudo-signal and makes them run it.
The logic is in the function restrict_one_thread().  The threads
execute the following phases in lock step (with a barrier
synchronization in between):

* Preparation Phase
  * Check that the thread meets all prerequisites
  * Create a new struct cred with prepare_creds()
  * Manipulate the struct cred as needed for policy enforcement
  * Write back possible errors to a shared memory location
* BARRIER: Wait for all threads to catch up to this point
* Commit Phase
  * Check that none of the threads had an error.
  * Invoke commit_creds() (or abort_creds() in case of error)

None of the functions used in the commit_phase() can return errors, so
at the time where each thread decides to commit, we are already sure
that it will work across all threads.

At the end, the calling thread waits for all task_work to finish
before returning from the syscall.

(The calling thread itself obviously does not schedule a task_work for
itself, but executes the same restrict_one_thread() logic inline.)

Open questions
==============

The patch set is in early stages, and there are some open questions
for which I am seeking feedback:

Conceptual open questions
~~~~~~~~~~~~~~~~~~~~~~~~~

* To what extent should we expect the threads to be in a known state?

  The single-threaded variant currently requires the
  PR_SET_NO_NEW_PRIVS flag or CAP_SYS_ADMIN as prerequisite for
  enforcing a Landlock ruleset.

  When we are in the multi-threaded variant, there are two main approaches:

  a) Expect same state on all threads: The simplest implementation
     would be that we expect all threads to also have
     PR_SET_NO_NEW_PRIVS or CAP_SYS_ADMIN, and to already be part of
     the same Landlock domain.  Otherwise, the operation is aborted on
     all threads.
  
  b) Pull all threads towards the same state: The 'synchronization'
     option would be that we implicitly establish the same
     PR_SET_NO_NEW_PRIVS and Landlock domain configuration on all
     threads.

     Weird case: If the calling thread has CAP_SYS_ADMIN but not
     PR_SET_NO_NEW_PRIVS, does this mean that a Landlock domain can be
     enabled on a thread without PR_SET_NO_NEW_PRIVS?  (We surely
     should not implicitly grant CAP_SYS_ADMIN to another thread?)
  
  Solutions in the middle between these two might also be possible.
  Depending on the approach, we might want to change the flag name to
  say something else but ..._TSYNC. (That name is just carried over
  from the similarly-named SECCOMP_FILTER_FLAG_TSYNC)

Implementation open questions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Are there better synchronization mechanisms for this purpose than
  task_work with barriers?

* Need to double check deadlock scenarios.

* Need to double check that we are not racing with a concurrent thread
  creation.

* Use the multi-threaded code for the single-threaded variant as well?

  At the current stage, I left the single-threaded code intact and
  copied some of its logic into the multi-threaded variant.  It is
  technically possible to unify this, if we are OK with the single
  threaded code using atomic operations and struct completion all by
  itself (or by interleaving many if-checks, but not sure whether
  that's worth it).

* Does landlock_restrict_self() need to be interruptible?

  (For reference, Jann's patch [5] used
  wait_for_completion_interruptible())


[1] https://github.com/landlock-lsm/go-landlock
[2] https://sites.google.com/site/fullycapable/who-ordered-libpsx
[3] https://man.gnoack.org/7/nptl
[4] https://man.gnoack.org/2/setuid#VERSIONS
[5] https://lore.kernel.org/all/20240805-remove-cred-transfer-v2-0-a2aa1d45e6b8@google.com/

Günther Noack (2):
  landlock: Multithreading support for landlock_restrict_self()
  landlock: selftests for LANDLOCK_RESTRICT_SELF_TSYNC

 include/uapi/linux/landlock.h                 |  10 +
 security/landlock/syscalls.c                  | 232 +++++++++++++++++-
 tools/testing/selftests/landlock/tsync_test.c |  94 +++++++
 3 files changed, 331 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/landlock/tsync_test.c


base-commit: 69e858e0b8b2ea07759e995aa383e8780d9d140c
-- 
2.48.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-02-21 18:44 [RFC 0/2] landlock: Multithreaded policy enforcement Günther Noack
@ 2025-02-21 18:44 ` Günther Noack
  2025-02-27 20:53   ` Mickaël Salaün
  2025-02-21 18:44 ` [RFC 2/2] landlock: selftests for LANDLOCK_RESTRICT_SELF_TSYNC Günther Noack
  2025-02-27 20:52 ` [RFC 0/2] landlock: Multithreaded policy enforcement Mickaël Salaün
  2 siblings, 1 reply; 18+ messages in thread
From: Günther Noack @ 2025-02-21 18:44 UTC (permalink / raw)
  To: linux-security-module, Mickaël Salaün
  Cc: Günther Noack, Paul Moore, Konstantin Meskhidze, Jann Horn

Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag.  With this flag, a
given Landlock ruleset is applied to all threads of the calling
process, instead of only the current one.

Signed-off-by: Günther Noack <gnoack3000@gmail.com>
---
 include/uapi/linux/landlock.h |  10 ++
 security/landlock/syscalls.c  | 232 +++++++++++++++++++++++++++++++++-
 2 files changed, 237 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
index 33745642f7875..fb971e4e0fb2b 100644
--- a/include/uapi/linux/landlock.h
+++ b/include/uapi/linux/landlock.h
@@ -62,6 +62,16 @@ struct landlock_ruleset_attr {
 #define LANDLOCK_CREATE_RULESET_VERSION			(1U << 0)
 /* clang-format on */
 
+/*
+ * sys_landlock_restrict_self() flags:
+ *
+ * - %LANDLOCK_RESTRICT_SELF_TSYNC: Apply the given ruleset to all threads of
+ *    the current process.
+ */
+/* clang-format off */
+#define LANDLOCK_RESTRICT_SELF_TSYNC			(1U << 0)
+/* clang-format on */
+
 /**
  * enum landlock_rule_type - Landlock rule type
  *
diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
index a9760d252fc2d..63792a6cde5ca 100644
--- a/security/landlock/syscalls.c
+++ b/security/landlock/syscalls.c
@@ -23,6 +23,7 @@
 #include <linux/security.h>
 #include <linux/stddef.h>
 #include <linux/syscalls.h>
+#include <linux/task_work.h>
 #include <linux/types.h>
 #include <linux/uaccess.h>
 #include <uapi/linux/landlock.h>
@@ -425,17 +426,233 @@ SYSCALL_DEFINE4(landlock_add_rule, const int, ruleset_fd,
 
 /* Enforcement */
 
+/*
+ * Shared state between multiple threads which are enforcing Landlock rulesets
+ * in lockstep with each other.
+ */
+struct landlock_tsync_shared_context {
+	/* Expected existing Landlock domain on every thread. */
+	struct landlock_ruleset *old_dom;
+
+	/* Replacement Landlock domain to be applied if prerequisites are met */
+	struct landlock_ruleset *new_dom;
+
+	/* Barrier after preparation step. */
+	atomic_t num_preparing;
+	struct completion all_prepared;
+
+	/* An error encountered in preparation step, or 0. */
+	atomic_t preparation_error;
+
+	/*
+	 * Barrier after commit step (used by syscall impl to wait for
+	 * completion).
+	 */
+	atomic_t num_unfinished;
+	struct completion all_finished;
+};
+
+struct landlock_tsync_context {
+	struct callback_head work;
+	struct landlock_tsync_shared_context *ctx;
+};
+
+static struct cred *
+landlock_prepare_creds(const struct landlock_ruleset *old_dom,
+		       struct landlock_ruleset *new_dom)
+{
+	struct cred *new_cred;
+	struct landlock_cred_security *new_llcred;
+
+	/*
+	 * Similar checks as for seccomp(2), except that an -EPERM may be
+	 * returned.
+	 */
+	if (!task_no_new_privs(current) &&
+	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	/* Prepares new credentials. */
+	new_cred = prepare_creds();
+	if (!new_cred)
+		return ERR_PTR(-ENOMEM);
+
+	new_llcred = landlock_cred(new_cred);
+	if (new_llcred->domain != old_dom) {
+		abort_creds(new_cred);
+		return ERR_PTR(-ESRCH);
+	}
+
+	/* Replaces the old (prepared) domain. */
+	landlock_put_ruleset(new_llcred->domain);
+	landlock_get_ruleset(new_dom);
+	new_llcred->domain = new_dom;
+
+	return new_cred;
+}
+
+/*
+ * restrict_one_thread - update a thread's Landlock domain in lockstep with the
+ * other threads in the same process.
+ *
+ * When this is run, the same function gets run in all other threads in the same
+ * process.  The concurrently running function invocations coordinate to wait
+ * until they are all done with the preparation step and have reported back
+ * errors from the preparation step, if necessary.
+ *
+ * Afterwards, depending on the presence of an error, all threads either commit
+ * or abort the prepared credentials.  The commit operation can not fail any more.
+ */
+static void restrict_one_thread(struct landlock_tsync_shared_context *ctx)
+{
+	int res;
+	struct cred *cred = landlock_prepare_creds(ctx->old_dom, ctx->new_dom);
+
+	/* On error, set the error and continue. (Do not return early.) */
+	if (IS_ERR(cred))
+		atomic_set(&ctx->preparation_error, PTR_ERR(cred));
+
+	/*
+	 * Barrier: Wait until all threads are done preparing.
+	 * After this point, we can have no more failures.
+	 */
+	if (atomic_dec_return(&ctx->num_preparing) == 0)
+		complete_all(&ctx->all_prepared);
+	wait_for_completion(&ctx->all_prepared);
+
+	/*
+	 * Abort the commit if any of the threads had an error.
+	 * (It might also be this thread.)  Otherwise, commit.
+	 */
+	res = atomic_read(&ctx->preparation_error);
+	if (res) {
+		if (!IS_ERR(cred))
+			abort_creds(cred);
+	} else {
+		commit_creds(cred);
+	}
+
+	/* Notify the calling thread once all threads are done */
+	if (atomic_dec_return(&ctx->num_unfinished) == 0)
+		complete_all(&ctx->all_finished);
+}
+
+/*
+ * restrict_one_thread_callback - task_work callback for restricting a thread
+ *
+ * Calls restrict_one_thread with the struct landlock_shared_tsync_context
+ * and frees up the per-work_task landlock_tsync_context afterwards.
+ */
+static void restrict_one_thread_callback(struct callback_head *work)
+{
+	struct landlock_tsync_context *ctx =
+		container_of(work, struct landlock_tsync_context, work);
+	struct landlock_tsync_shared_context *sctx = ctx->ctx;
+
+	restrict_one_thread(sctx);
+	kfree(ctx);
+}
+
+/*
+ * restrict_all_threads - enables a Landlock policy for all threads
+ */
+static int restrict_all_threads(const int ruleset_fd)
+{
+	int res;
+	struct task_struct *thread, *caller;
+	struct landlock_tsync_shared_context sctx;
+	struct landlock_ruleset *ruleset;
+
+	/*
+	 * XXX: Need to double check race conditions and deadlocks before
+	 * merging this upstream. We probably need additional locking.
+	 */
+
+	/* Starting with 1, as we're already counting current. */
+	atomic_set(&sctx.num_preparing, 1);
+	atomic_set(&sctx.num_unfinished, 1);
+	init_completion(&sctx.all_prepared);
+	init_completion(&sctx.all_finished);
+	atomic_set(&sctx.preparation_error, 0);
+	sctx.old_dom = landlock_get_current_domain();
+
+	/* Gets and checks the ruleset. */
+	ruleset = get_ruleset_from_fd(ruleset_fd, FMODE_CAN_READ);
+	if (IS_ERR(ruleset))
+		return PTR_ERR(ruleset);
+
+	/*
+	 * We pre-merge the domain for all threads,
+	 * so that all threads end up with exactly the same domain.
+	 */
+	sctx.new_dom = landlock_merge_ruleset(sctx.old_dom, ruleset);
+	landlock_put_ruleset(ruleset);
+	if (IS_ERR(sctx.new_dom))
+		return PTR_ERR(sctx.new_dom);
+
+	landlock_get_ruleset(sctx.old_dom);
+
+	caller = current;
+	for_each_thread(caller, thread) {
+		/* Skip current, since it is initiating the sync. */
+		if (thread == caller)
+			continue;
+
+		/* Skip exited threads. */
+		if (thread->flags & PF_EXITING)
+			continue;
+
+		/* Deallocation is done by the task_work itself. */
+		struct landlock_tsync_context *ctx =
+			kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
+		if (!ctx) {
+			/*
+			 * On error, keep already started threads from
+			 * accidentally committing.  Do not start additional
+			 * threads.
+			 */
+			atomic_set(&sctx.preparation_error, -ENOMEM);
+			break;
+		}
+
+		ctx->ctx = &sctx;
+		atomic_inc(&sctx.num_preparing);
+		atomic_inc(&sctx.num_unfinished);
+		init_task_work(&ctx->work, restrict_one_thread_callback);
+		res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
+		if (res) {
+			atomic_set(&sctx.preparation_error, res);
+			atomic_dec(&sctx.num_preparing);
+			atomic_dec(&sctx.num_unfinished);
+		}
+	}
+
+	/* and do the same on the current thread */
+	restrict_one_thread(&sctx);
+
+	res = atomic_read(&sctx.preparation_error);
+	wait_for_completion(&sctx.all_finished);
+	landlock_put_ruleset(sctx.new_dom);
+	landlock_put_ruleset(sctx.old_dom);
+	return res;
+}
+
 /**
  * sys_landlock_restrict_self - Enforce a ruleset on the calling thread
  *
  * @ruleset_fd: File descriptor tied to the ruleset to merge with the target.
- * @flags: Must be 0.
+ * @flags: Flags to modify the behavior.
  *
- * This system call enables to enforce a Landlock ruleset on the current
- * thread.  Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
+ * This system call enforces a Landlock ruleset on the current thread.
+ * Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
  * namespace or is running with no_new_privs.  This avoids scenarios where
  * unprivileged tasks can affect the behavior of privileged children.
  *
+ * If %LANDLOCK_RESTRICT_SELF_TSYNC is specified in @flags, the ruleset will be
+ * applied to all threads of the current process.  For this, all threads must be
+ * in the same Landlock domain and fulfill the normal requirements for enforcing
+ * a Landlock ruleset.
+ *
  * Possible returned errors are:
  *
  * - %EOPNOTSUPP: Landlock is supported by the kernel but disabled at boot time;
@@ -447,6 +664,8 @@ SYSCALL_DEFINE4(landlock_add_rule, const int, ruleset_fd,
  *   %CAP_SYS_ADMIN in its namespace.
  * - %E2BIG: The maximum number of stacked rulesets is reached for the current
  *   thread.
+ * - %ESRCH: When called with %LANDLOCK_RESTRICT_SELF_TSYNC, the processes'
+ *   threads were in different Landlock domains.
  */
 SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
 		flags)
@@ -467,10 +686,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
 	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
 		return -EPERM;
 
-	/* No flag for now. */
-	if (flags)
+	/* TSYNC is the only supported flag. */
+	if (flags & ~LANDLOCK_RESTRICT_SELF_TSYNC)
 		return -EINVAL;
 
+	if (flags & LANDLOCK_RESTRICT_SELF_TSYNC)
+		return restrict_all_threads(ruleset_fd);
+
 	/* Gets and checks the ruleset. */
 	ruleset = get_ruleset_from_fd(ruleset_fd, FMODE_CAN_READ);
 	if (IS_ERR(ruleset))
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC 2/2] landlock: selftests for LANDLOCK_RESTRICT_SELF_TSYNC
  2025-02-21 18:44 [RFC 0/2] landlock: Multithreaded policy enforcement Günther Noack
  2025-02-21 18:44 ` [RFC 1/2] landlock: Multithreading support for landlock_restrict_self() Günther Noack
@ 2025-02-21 18:44 ` Günther Noack
  2025-02-27 20:52 ` [RFC 0/2] landlock: Multithreaded policy enforcement Mickaël Salaün
  2 siblings, 0 replies; 18+ messages in thread
From: Günther Noack @ 2025-02-21 18:44 UTC (permalink / raw)
  To: linux-security-module, Mickaël Salaün
  Cc: Günther Noack, Paul Moore, Konstantin Meskhidze, Jann Horn

Exercise various scenarios where Landlock domains are enforced across
all of a processes' threads.

Signed-off-by: Günther Noack <gnoack3000@gmail.com>
---
 tools/testing/selftests/landlock/tsync_test.c | 94 +++++++++++++++++++
 1 file changed, 94 insertions(+)
 create mode 100644 tools/testing/selftests/landlock/tsync_test.c

diff --git a/tools/testing/selftests/landlock/tsync_test.c b/tools/testing/selftests/landlock/tsync_test.c
new file mode 100644
index 0000000000000..e8261a58c13ba
--- /dev/null
+++ b/tools/testing/selftests/landlock/tsync_test.c
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Landlock tests - Enforcing the same restrictions across multiple threads
+ *
+ * Copyright © 2025 Günther Noack <gnoack3000@gmail.com>
+ */
+
+#define _GNU_SOURCE
+#include <pthread.h>
+#include <sys/prctl.h>
+#include <linux/landlock.h>
+
+#include "common.h"
+
+/* create_ruleset - Create a simple ruleset FD common to all tests */
+static int create_ruleset(struct __test_metadata *const _metadata)
+{
+	struct landlock_ruleset_attr ruleset_attr = {
+		.handled_access_fs = (LANDLOCK_ACCESS_FS_WRITE_FILE |
+				      LANDLOCK_ACCESS_FS_TRUNCATE),
+	};
+	const int ruleset_fd =
+		landlock_create_ruleset(&ruleset_attr, sizeof(ruleset_attr), 0);
+
+	ASSERT_LE(0, ruleset_fd);
+	return ruleset_fd;
+}
+
+TEST(single_threaded_success)
+{
+	const int ruleset_fd = create_ruleset(_metadata);
+
+	disable_caps(_metadata);
+
+	ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
+	ASSERT_EQ(0, landlock_restrict_self(ruleset_fd,
+					    LANDLOCK_RESTRICT_SELF_TSYNC));
+
+	ASSERT_EQ(0, close(ruleset_fd));
+}
+
+void *idle(void *data)
+{
+	while (true)
+		sleep(1);
+}
+
+TEST(multi_threaded_success)
+{
+	pthread_t t1, t2;
+	const int ruleset_fd = create_ruleset(_metadata);
+
+	disable_caps(_metadata);
+
+	ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
+
+	ASSERT_EQ(0, pthread_create(&t1, NULL, idle, NULL));
+	ASSERT_EQ(0, pthread_create(&t2, NULL, idle, NULL));
+
+	EXPECT_EQ(0, landlock_restrict_self(ruleset_fd,
+					    LANDLOCK_RESTRICT_SELF_TSYNC));
+
+	ASSERT_EQ(0, pthread_cancel(t1));
+	ASSERT_EQ(0, pthread_cancel(t2));
+	ASSERT_EQ(0, pthread_join(t1, NULL));
+	ASSERT_EQ(0, pthread_join(t2, NULL));
+	ASSERT_EQ(0, close(ruleset_fd));
+}
+
+TEST(multi_threaded_failure)
+{
+	pthread_t t1;
+	const int ruleset_fd = create_ruleset(_metadata);
+
+	disable_caps(_metadata);
+
+	ASSERT_EQ(0, pthread_create(&t1, NULL, idle, NULL));
+
+	/*
+	 * landlock_restrict_self is expected to fail, because the other thread
+	 * has neither the NO_NEW_PRIVS flag nor CAP_SYS_ADMIN, and thus, it may
+	 * not enforce the Landlock ruleset.
+	 */
+	ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
+	EXPECT_EQ(-1, landlock_restrict_self(ruleset_fd,
+					     LANDLOCK_RESTRICT_SELF_TSYNC));
+	EXPECT_EQ(errno, EPERM);
+
+	ASSERT_EQ(0, pthread_cancel(t1));
+	ASSERT_EQ(0, pthread_join(t1, NULL));
+	ASSERT_EQ(0, close(ruleset_fd));
+}
+
+TEST_HARNESS_MAIN
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC 0/2] landlock: Multithreaded policy enforcement
  2025-02-21 18:44 [RFC 0/2] landlock: Multithreaded policy enforcement Günther Noack
  2025-02-21 18:44 ` [RFC 1/2] landlock: Multithreading support for landlock_restrict_self() Günther Noack
  2025-02-21 18:44 ` [RFC 2/2] landlock: selftests for LANDLOCK_RESTRICT_SELF_TSYNC Günther Noack
@ 2025-02-27 20:52 ` Mickaël Salaün
  2 siblings, 0 replies; 18+ messages in thread
From: Mickaël Salaün @ 2025-02-27 20:52 UTC (permalink / raw)
  To: Günther Noack, Kees Cook, Paul Moore
  Cc: linux-security-module, Konstantin Meskhidze, Jann Horn,
	linux-kernel

On Fri, Feb 21, 2025 at 07:44:16PM +0100, Günther Noack wrote:
> Hello!
> 
> This patch set adds the LANDLOCK_RESTRICT_SELF_TSYNC flag to
> landlock_restrict_self().  With this flag, the passed Landlock ruleset
> will not only be applied to the calling thread, but to all threads
> which belong to the same process.
> 
> I am sending this intentionally early.  At this point, I am mostly
> looking for high-level comments to check whether the general approach
> is feasible at all and whether I am using appropriate concurrency
> mechanisms for it.

Thanks Günther!  This feature is indeed very important to be able to
enforce consistent sandboxing.

> 
> Motivation
> ==========
> 
> TL;DR: The libpsx/nptl(7) signal hack which we use in user space for
> multi-threaded Landlock enforcement is incompatible with Landlock's
> signal scoping support.  Landlock can restrict the use of signals
> across Landlock domains, but we need signals ourselves in user space
> in ways that are not permitted any more under these restrictions.
> 
> Enabling Landlock proves to be difficult in processes that are already
> multi-threaded at the time of enforcement:
> 
> * Enforcement in only one thread is usually a mistake because threads
>   do not normally have proper security boundaries between them.
> 
> * Also, multithreading is unavoidable in some circumstances, such as
>   when using Landlock from a Go program.  Go programs are already
>   multithreaded by the time that they enter the "func main()".
> 
> So far, the approach in Go[1] was to use libpsx[2].  This library
> implements the mechanism described in nptl(7) [3]: It keeps track of
> all threads with a linker hack and then makes all threads do the same
> syscall by registering a signal handler for them and invoking it.
> 
> With commit 54a6e6bbf3be ("landlock: Add signal scoping"), Landlock
> gained the ability to restrict the use of signals across different
> Landlock domains.
> 
> Landlock's signal scoping support is incompatible with the libpsx
> approach of enabling Landlock:
> 
> (1) With libpsx, although all threads enforce the same ruleset object,
>     they technically do the operation separately and end up in
>     distinct Landlock domains.  When the ruleset uses
>     LANDLOCK_SCOPE_SIGNAL, it becomes impossible to use cross-thread
>     signals in that process.
> 
> (2) Cross-thread Signals are themselves needed to enforce further
>     nested Landlock domains across multiple threads.  So nested
>     Landlock policies become impossible there.
> 
> In addition to Landlock itself, cross-thread signals are also needed
> for other seemingly-harmless API calls like the setuid(2) [4] and for
> the use of libcap (co-developed with libpsx), which have the same
> problem where the underlying syscall only applies to the calling
> thread.
> 
> Implementation approach
> =======================
> 
> The implementation is inspired by Jann Horn's earlier patch set for
> cred_transfer() [5] as well as by the libpsx approach described in [2]
> and [3].  In some way, it is a libpsx-like implementation in the
> kernel:
> 
> When the flag is provided, the calling thread adds a task_work to all
> threads to be executed through a pseudo-signal and makes them run it.
> The logic is in the function restrict_one_thread().  The threads
> execute the following phases in lock step (with a barrier
> synchronization in between):
> 
> * Preparation Phase
>   * Check that the thread meets all prerequisites
>   * Create a new struct cred with prepare_creds()
>   * Manipulate the struct cred as needed for policy enforcement
>   * Write back possible errors to a shared memory location
> * BARRIER: Wait for all threads to catch up to this point
> * Commit Phase
>   * Check that none of the threads had an error.
>   * Invoke commit_creds() (or abort_creds() in case of error)
> 
> None of the functions used in the commit_phase() can return errors, so
> at the time where each thread decides to commit, we are already sure
> that it will work across all threads.
> 
> At the end, the calling thread waits for all task_work to finish
> before returning from the syscall.
> 
> (The calling thread itself obviously does not schedule a task_work for
> itself, but executes the same restrict_one_thread() logic inline.)
> 
> Open questions
> ==============
> 
> The patch set is in early stages, and there are some open questions
> for which I am seeking feedback:
> 
> Conceptual open questions
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> * To what extent should we expect the threads to be in a known state?
> 
>   The single-threaded variant currently requires the
>   PR_SET_NO_NEW_PRIVS flag or CAP_SYS_ADMIN as prerequisite for
>   enforcing a Landlock ruleset.
> 
>   When we are in the multi-threaded variant, there are two main approaches:
> 
>   a) Expect same state on all threads: The simplest implementation
>      would be that we expect all threads to also have
>      PR_SET_NO_NEW_PRIVS or CAP_SYS_ADMIN, and to already be part of
>      the same Landlock domain.  Otherwise, the operation is aborted on
>      all threads.
>   
>   b) Pull all threads towards the same state: The 'synchronization'
>      option would be that we implicitly establish the same
>      PR_SET_NO_NEW_PRIVS and Landlock domain configuration on all
>      threads.
> 
>      Weird case: If the calling thread has CAP_SYS_ADMIN but not
>      PR_SET_NO_NEW_PRIVS, does this mean that a Landlock domain can be
>      enabled on a thread without PR_SET_NO_NEW_PRIVS?  (We surely
>      should not implicitly grant CAP_SYS_ADMIN to another thread?)

We should assume that there is not security boundaries between threads,
but we should also minimize inconsistent results.  So, no, we should not
grant any particular privileges/capabilites to other threads but only
drop them in a least surprising way for users.  See my proposal in reply
to the first patch.

>   
>   Solutions in the middle between these two might also be possible.
>   Depending on the approach, we might want to change the flag name to
>   say something else but ..._TSYNC. (That name is just carried over
>   from the similarly-named SECCOMP_FILTER_FLAG_TSYNC)

The issue with Landlock also exists with PR_SET_NO_NEW_PRIVS, so we need
to be able to synchronize both properties (as already possible with
seccomp).

> 
> Implementation open questions
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> * Are there better synchronization mechanisms for this purpose than
>   task_work with barriers?

I think it's the right approach, but the difficult part is handling
errors in a safe and consistent way.

> 
> * Need to double check deadlock scenarios.
> 
> * Need to double check that we are not racing with a concurrent thread
>   creation.
> 
> * Use the multi-threaded code for the single-threaded variant as well?
> 
>   At the current stage, I left the single-threaded code intact and
>   copied some of its logic into the multi-threaded variant.  It is
>   technically possible to unify this, if we are OK with the single
>   threaded code using atomic operations and struct completion all by
>   itself (or by interleaving many if-checks, but not sure whether
>   that's worth it).

Most of the single-threaded code can be reused.

> 
> * Does landlock_restrict_self() need to be interruptible?
> 
>   (For reference, Jann's patch [5] used
>   wait_for_completion_interruptible())

We should follow seccomp's footsteps as much as possible.

> 
> 
> [1] https://github.com/landlock-lsm/go-landlock
> [2] https://sites.google.com/site/fullycapable/who-ordered-libpsx
> [3] https://man.gnoack.org/7/nptl
> [4] https://man.gnoack.org/2/setuid#VERSIONS
> [5] https://lore.kernel.org/all/20240805-remove-cred-transfer-v2-0-a2aa1d45e6b8@google.com/
> 
> Günther Noack (2):
>   landlock: Multithreading support for landlock_restrict_self()
>   landlock: selftests for LANDLOCK_RESTRICT_SELF_TSYNC
> 
>  include/uapi/linux/landlock.h                 |  10 +
>  security/landlock/syscalls.c                  | 232 +++++++++++++++++-
>  tools/testing/selftests/landlock/tsync_test.c |  94 +++++++
>  3 files changed, 331 insertions(+), 5 deletions(-)
>  create mode 100644 tools/testing/selftests/landlock/tsync_test.c
> 
> 
> base-commit: 69e858e0b8b2ea07759e995aa383e8780d9d140c
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-02-21 18:44 ` [RFC 1/2] landlock: Multithreading support for landlock_restrict_self() Günther Noack
@ 2025-02-27 20:53   ` Mickaël Salaün
  2025-02-28 17:33     ` Günther Noack
  0 siblings, 1 reply; 18+ messages in thread
From: Mickaël Salaün @ 2025-02-27 20:53 UTC (permalink / raw)
  To: Günther Noack, Kees Cook, Paul Moore
  Cc: linux-security-module, Konstantin Meskhidze, Jann Horn,
	linux-kernel

On Fri, Feb 21, 2025 at 07:44:17PM +0100, Günther Noack wrote:
> Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag.  With this flag, a
> given Landlock ruleset is applied to all threads of the calling
> process, instead of only the current one.

I initially though that we may pick another flag name but I now think
reusing the same name as for seccomp would be less confusing for users.

> 
> Signed-off-by: Günther Noack <gnoack3000@gmail.com>
> ---
>  include/uapi/linux/landlock.h |  10 ++
>  security/landlock/syscalls.c  | 232 +++++++++++++++++++++++++++++++++-
>  2 files changed, 237 insertions(+), 5 deletions(-)
> 
> diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> index 33745642f7875..fb971e4e0fb2b 100644
> --- a/include/uapi/linux/landlock.h
> +++ b/include/uapi/linux/landlock.h
> @@ -62,6 +62,16 @@ struct landlock_ruleset_attr {
>  #define LANDLOCK_CREATE_RULESET_VERSION			(1U << 0)
>  /* clang-format on */
>  
> +/*
> + * sys_landlock_restrict_self() flags:
> + *
> + * - %LANDLOCK_RESTRICT_SELF_TSYNC: Apply the given ruleset to all threads of
> + *    the current process.
> + */
> +/* clang-format off */
> +#define LANDLOCK_RESTRICT_SELF_TSYNC			(1U << 0)
> +/* clang-format on */
> +
>  /**
>   * enum landlock_rule_type - Landlock rule type
>   *
> diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
> index a9760d252fc2d..63792a6cde5ca 100644
> --- a/security/landlock/syscalls.c
> +++ b/security/landlock/syscalls.c
> @@ -23,6 +23,7 @@
>  #include <linux/security.h>
>  #include <linux/stddef.h>
>  #include <linux/syscalls.h>
> +#include <linux/task_work.h>
>  #include <linux/types.h>
>  #include <linux/uaccess.h>
>  #include <uapi/linux/landlock.h>
> @@ -425,17 +426,233 @@ SYSCALL_DEFINE4(landlock_add_rule, const int, ruleset_fd,
>  
>  /* Enforcement */
>  
> +/*
> + * Shared state between multiple threads which are enforcing Landlock rulesets
> + * in lockstep with each other.
> + */
> +struct landlock_tsync_shared_context {
> +	/* Expected existing Landlock domain on every thread. */
> +	struct landlock_ruleset *old_dom;
> +
> +	/* Replacement Landlock domain to be applied if prerequisites are met */
> +	struct landlock_ruleset *new_dom;
> +
> +	/* Barrier after preparation step. */
> +	atomic_t num_preparing;
> +	struct completion all_prepared;
> +
> +	/* An error encountered in preparation step, or 0. */
> +	atomic_t preparation_error;
> +
> +	/*
> +	 * Barrier after commit step (used by syscall impl to wait for
> +	 * completion).
> +	 */
> +	atomic_t num_unfinished;
> +	struct completion all_finished;
> +};
> +
> +struct landlock_tsync_context {
> +	struct callback_head work;
> +	struct landlock_tsync_shared_context *ctx;
> +};
> +
> +static struct cred *
> +landlock_prepare_creds(const struct landlock_ruleset *old_dom,
> +		       struct landlock_ruleset *new_dom)
> +{
> +	struct cred *new_cred;
> +	struct landlock_cred_security *new_llcred;
> +
> +	/*
> +	 * Similar checks as for seccomp(2), except that an -EPERM may be
> +	 * returned.
> +	 */
> +	if (!task_no_new_privs(current) &&
> +	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
> +		return ERR_PTR(-EPERM);

This should only be done for the calling thread, so just reusing the
existing code from landlock_restrict_self().

If the no_new_privs property is set for the current thread, it would
make sense to just set it for all other threads once we are sure we can
sandbox all of them (see seccomp_sync_threads).

For this we need to have two for_each_thread loops: one to check that
the current thread is legitimate to sandbox all the other threads and
make sure no error could arise (e.g. memory allocation), and another
loop to safely enforce the restrictions if the first one was OK.

The handling of the following prepare_cred()'s failure is the main
issue, but see my proposal below.

> +
> +	/* Prepares new credentials. */
> +	new_cred = prepare_creds();
> +	if (!new_cred)
> +		return ERR_PTR(-ENOMEM);

Instead of creating one new credential per thread, we should use the
newly created one (from landlock_restrict_self) and increment its
references.  In most cases that would avoid duplicating information and
wasting memory by creating one struct cred per thread.

> +
> +	new_llcred = landlock_cred(new_cred);
> +	if (new_llcred->domain != old_dom) {
> +		abort_creds(new_cred);
> +		return ERR_PTR(-ESRCH);

To avoid dealing with prepare_creds()'s error, I think we could just
check that all threads have (initially) the same credentials, and either
swap this same old credential with the new one (created by the caller),
or swap the Landlock domain (and set no_new_privs) if the credentials
changed between the initial check and now.  Actually it might be up to
three loops:

(lock cred_guard_mutex)

1. Check that all threads use the same credential, or abort the whole
   operation otherwise.

2. If the caller has no_new_privs (and passed the Landlock checks), set
   no_new_privs to all threads.

3. Run a work_task on each thread that will either:
   - if old thread's credential is still the same as the caller's
     credentials, swap it with the new one with get_cred()/put_cred();
   - otherwise, this means that the thread changed its credential after
     the first loop but because we set no_new_privs, this alternative
     credential should not give more privileges and we should be safe to
     only swap the Landlock domain with the new one (if different).
     Indeed, this alternative credential might be used by other threads
     from the same process, but AFAIK not by threads from other
     processes thanks to cred_guard_mutex.  This would require to bypass
     the const pointer though.  Holding cred_guard_mutex will also make
     sure that domains were not updated under our feet by another call
     to landlock_restrict_self().

(unlock cred_guard_mutex)

Another approach would be to leverage an LSM hook to deny credential
changes for a thread while we are changing them for the whole process,
but I'm not sure it would be better.

Kees, Paul, Jann, any better idea?

> +	}
> +
> +	/* Replaces the old (prepared) domain. */
> +	landlock_put_ruleset(new_llcred->domain);
> +	landlock_get_ruleset(new_dom);
> +	new_llcred->domain = new_dom;
> +
> +	return new_cred;
> +}
> +
> +/*
> + * restrict_one_thread - update a thread's Landlock domain in lockstep with the
> + * other threads in the same process.
> + *
> + * When this is run, the same function gets run in all other threads in the same
> + * process.  The concurrently running function invocations coordinate to wait
> + * until they are all done with the preparation step and have reported back
> + * errors from the preparation step, if necessary.
> + *
> + * Afterwards, depending on the presence of an error, all threads either commit
> + * or abort the prepared credentials.  The commit operation can not fail any more.
> + */
> +static void restrict_one_thread(struct landlock_tsync_shared_context *ctx)
> +{
> +	int res;
> +	struct cred *cred = landlock_prepare_creds(ctx->old_dom, ctx->new_dom);
> +
> +	/* On error, set the error and continue. (Do not return early.) */
> +	if (IS_ERR(cred))
> +		atomic_set(&ctx->preparation_error, PTR_ERR(cred));
> +
> +	/*
> +	 * Barrier: Wait until all threads are done preparing.
> +	 * After this point, we can have no more failures.
> +	 */
> +	if (atomic_dec_return(&ctx->num_preparing) == 0)
> +		complete_all(&ctx->all_prepared);
> +	wait_for_completion(&ctx->all_prepared);

The logic is OK but the side effect would be to block all threads, which
is an issue.  That's why we need at least two loops run by the calling
thread, and the other threads should not have to wait.

> +
> +	/*
> +	 * Abort the commit if any of the threads had an error.
> +	 * (It might also be this thread.)  Otherwise, commit.
> +	 */
> +	res = atomic_read(&ctx->preparation_error);
> +	if (res) {
> +		if (!IS_ERR(cred))
> +			abort_creds(cred);
> +	} else {
> +		commit_creds(cred);
> +	}
> +
> +	/* Notify the calling thread once all threads are done */
> +	if (atomic_dec_return(&ctx->num_unfinished) == 0)
> +		complete_all(&ctx->all_finished);
> +}
> +
> +/*
> + * restrict_one_thread_callback - task_work callback for restricting a thread
> + *
> + * Calls restrict_one_thread with the struct landlock_shared_tsync_context
> + * and frees up the per-work_task landlock_tsync_context afterwards.
> + */
> +static void restrict_one_thread_callback(struct callback_head *work)
> +{
> +	struct landlock_tsync_context *ctx =
> +		container_of(work, struct landlock_tsync_context, work);
> +	struct landlock_tsync_shared_context *sctx = ctx->ctx;
> +
> +	restrict_one_thread(sctx);
> +	kfree(ctx);
> +}
> +
> +/*
> + * restrict_all_threads - enables a Landlock policy for all threads
> + */
> +static int restrict_all_threads(const int ruleset_fd)
> +{
> +	int res;
> +	struct task_struct *thread, *caller;
> +	struct landlock_tsync_shared_context sctx;
> +	struct landlock_ruleset *ruleset;
> +
> +	/*
> +	 * XXX: Need to double check race conditions and deadlocks before
> +	 * merging this upstream. We probably need additional locking.
> +	 */
> +
> +	/* Starting with 1, as we're already counting current. */
> +	atomic_set(&sctx.num_preparing, 1);
> +	atomic_set(&sctx.num_unfinished, 1);
> +	init_completion(&sctx.all_prepared);
> +	init_completion(&sctx.all_finished);
> +	atomic_set(&sctx.preparation_error, 0);
> +	sctx.old_dom = landlock_get_current_domain();
> +
> +	/* Gets and checks the ruleset. */
> +	ruleset = get_ruleset_from_fd(ruleset_fd, FMODE_CAN_READ);
> +	if (IS_ERR(ruleset))
> +		return PTR_ERR(ruleset);
> +
> +	/*
> +	 * We pre-merge the domain for all threads,
> +	 * so that all threads end up with exactly the same domain.
> +	 */
> +	sctx.new_dom = landlock_merge_ruleset(sctx.old_dom, ruleset);
> +	landlock_put_ruleset(ruleset);
> +	if (IS_ERR(sctx.new_dom))
> +		return PTR_ERR(sctx.new_dom);

What is specific to this function should only start here.  The flag
checks (new ones are coming with the audit patch series), the ruleset
FD, the preapre_creds(), and the landlock_merge_ruleset() should be the
same.  We can initialize shared states from here.

> +
> +	landlock_get_ruleset(sctx.old_dom);
> +
> +	caller = current;
> +	for_each_thread(caller, thread) {
> +		/* Skip current, since it is initiating the sync. */
> +		if (thread == caller)
> +			continue;
> +
> +		/* Skip exited threads. */
> +		if (thread->flags & PF_EXITING)
> +			continue;
> +
> +		/* Deallocation is done by the task_work itself. */
> +		struct landlock_tsync_context *ctx =
> +			kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
> +		if (!ctx) {
> +			/*
> +			 * On error, keep already started threads from
> +			 * accidentally committing.  Do not start additional
> +			 * threads.
> +			 */
> +			atomic_set(&sctx.preparation_error, -ENOMEM);
> +			break;
> +		}
> +
> +		ctx->ctx = &sctx;
> +		atomic_inc(&sctx.num_preparing);
> +		atomic_inc(&sctx.num_unfinished);
> +		init_task_work(&ctx->work, restrict_one_thread_callback);
> +		res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
> +		if (res) {
> +			atomic_set(&sctx.preparation_error, res);
> +			atomic_dec(&sctx.num_preparing);
> +			atomic_dec(&sctx.num_unfinished);
> +		}
> +	}
> +
> +	/* and do the same on the current thread */
> +	restrict_one_thread(&sctx);
> +
> +	res = atomic_read(&sctx.preparation_error);
> +	wait_for_completion(&sctx.all_finished);
> +	landlock_put_ruleset(sctx.new_dom);
> +	landlock_put_ruleset(sctx.old_dom);
> +	return res;
> +}
> +
>  /**
>   * sys_landlock_restrict_self - Enforce a ruleset on the calling thread
>   *
>   * @ruleset_fd: File descriptor tied to the ruleset to merge with the target.
> - * @flags: Must be 0.
> + * @flags: Flags to modify the behavior.
>   *
> - * This system call enables to enforce a Landlock ruleset on the current
> - * thread.  Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
> + * This system call enforces a Landlock ruleset on the current thread.
> + * Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
>   * namespace or is running with no_new_privs.  This avoids scenarios where
>   * unprivileged tasks can affect the behavior of privileged children.
>   *
> + * If %LANDLOCK_RESTRICT_SELF_TSYNC is specified in @flags, the ruleset will be
> + * applied to all threads of the current process.  For this, all threads must be
> + * in the same Landlock domain and fulfill the normal requirements for enforcing
> + * a Landlock ruleset.
> + *
>   * Possible returned errors are:
>   *
>   * - %EOPNOTSUPP: Landlock is supported by the kernel but disabled at boot time;
> @@ -447,6 +664,8 @@ SYSCALL_DEFINE4(landlock_add_rule, const int, ruleset_fd,
>   *   %CAP_SYS_ADMIN in its namespace.
>   * - %E2BIG: The maximum number of stacked rulesets is reached for the current
>   *   thread.
> + * - %ESRCH: When called with %LANDLOCK_RESTRICT_SELF_TSYNC, the processes'
> + *   threads were in different Landlock domains.
>   */
>  SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
>  		flags)
> @@ -467,10 +686,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
>  	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
>  		return -EPERM;
>  
> -	/* No flag for now. */
> -	if (flags)
> +	/* TSYNC is the only supported flag. */
> +	if (flags & ~LANDLOCK_RESTRICT_SELF_TSYNC)
>  		return -EINVAL;
>  
> +	if (flags & LANDLOCK_RESTRICT_SELF_TSYNC)
> +		return restrict_all_threads(ruleset_fd);
> +

As explained above, restrict_all_threads()'s logic should be called
later.

Starting here, as seccomp does, we can lock siglock (whatever flags are
used) to be sure there is not concurrent call to
landlock_restrict_self() on other threads of the same process.

If LANDLOCK_RESTRICT_SELF_TSYNC is set, we also need to lock
cred_guard_mutex to avoid race conditions with concurrent execve calls.

>  	/* Gets and checks the ruleset. */
>  	ruleset = get_ruleset_from_fd(ruleset_fd, FMODE_CAN_READ);
>  	if (IS_ERR(ruleset))
> -- 
> 2.48.1
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-02-27 20:53   ` Mickaël Salaün
@ 2025-02-28 17:33     ` Günther Noack
  2025-03-04 20:25       ` Mickaël Salaün
  0 siblings, 1 reply; 18+ messages in thread
From: Günther Noack @ 2025-02-28 17:33 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Kees Cook, Paul Moore, linux-security-module,
	Konstantin Meskhidze, Jann Horn, linux-kernel, Peter Newman,
	David Howells

Hello!

Thanks for the review!

I'm adding David Howells to this thread as well.  David, maybe you can
help us and suggest a appropriate way to update the struct cred across
multiple threads?

On Thu, Feb 27, 2025 at 09:53:11PM +0100, Mickaël Salaün wrote:
> On Fri, Feb 21, 2025 at 07:44:17PM +0100, Günther Noack wrote:
> > Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag.  With this flag, a
> > given Landlock ruleset is applied to all threads of the calling
> > process, instead of only the current one.
> 
> I initially though that we may pick another flag name but I now think
> reusing the same name as for seccomp would be less confusing for users.

+1, under the assumption that we implement a "syncing" behaviour here,
I also think that the name is good.


> > diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
> > index a9760d252fc2d..63792a6cde5ca 100644
> > --- a/security/landlock/syscalls.c
> > +++ b/security/landlock/syscalls.c

> > +static struct cred *
> > +landlock_prepare_creds(const struct landlock_ruleset *old_dom,
> > +		       struct landlock_ruleset *new_dom)
> > +{
> > +	struct cred *new_cred;
> > +	struct landlock_cred_security *new_llcred;
> > +
> > +	/*
> > +	 * Similar checks as for seccomp(2), except that an -EPERM may be
> > +	 * returned.
> > +	 */
> > +	if (!task_no_new_privs(current) &&
> > +	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
> > +		return ERR_PTR(-EPERM);
> 
> This should only be done for the calling thread, so just reusing the
> existing code from landlock_restrict_self().
> 
> If the no_new_privs property is set for the current thread, it would
> make sense to just set it for all other threads once we are sure we can
> sandbox all of them (see seccomp_sync_threads).

(See discussion below regarding NO_NEW_PRIVS)


> For this we need to have two for_each_thread loops: one to check that
> the current thread is legitimate to sandbox all the other threads and
> make sure no error could arise (e.g. memory allocation), and another
> loop to safely enforce the restrictions if the first one was OK.
> 
> The handling of the following prepare_cred()'s failure is the main
> issue, but see my proposal below.
> 
> > +
> > +	/* Prepares new credentials. */
> > +	new_cred = prepare_creds();
> > +	if (!new_cred)
> > +		return ERR_PTR(-ENOMEM);
> 
> Instead of creating one new credential per thread, we should use the
> newly created one (from landlock_restrict_self) and increment its
> references.  In most cases that would avoid duplicating information and
> wasting memory by creating one struct cred per thread.

struct cred contains all kinds of other credentials as well, besides
Landlock and NO_NEW_PRIVS. *In normal scenarios* I believe that the
struct cred is the same for all threads, but this is not guaranteed.

We can make it safe by requiring that all threads must point to the
same struct cred, as you are suggesting below.  But how are users
supposed to reason about whether all threads share the same struct
cred?  To my knowledge, it is an implementation detail of the kernel.
We can't expect callers to understand what properties belong to struct
cred, and that list of properties can potentially change in the future
as well.

This is why in the proposal I sent, the only thing that matters about
the preexisting struct cred is the Landlock domain which is stored in
it.  This is a lower bar to meet and seems more appropriate for this
Landlock-specific syscall.


> > +	new_llcred = landlock_cred(new_cred);
> > +	if (new_llcred->domain != old_dom) {
> > +		abort_creds(new_cred);
> > +		return ERR_PTR(-ESRCH);
> 
> To avoid dealing with prepare_creds()'s error, I think we could just
> check that all threads have (initially) the same credentials, and either
> swap this same old credential with the new one (created by the caller),
> or swap the Landlock domain (and set no_new_privs) if the credentials
> changed between the initial check and now.

This would be working around the API documented in
Documentation/security/credentials.rst, where the affected task should
do its own prepare_creds()/abort_creds()/commit_creds() dance.

After seeing how it had turned out in the keyctl case with an
additional LSM hook, I was hoping to use the suggested API and do
things as intended here.  prepare_creds() and commit_creds() do more
than just switching pointers and refcounts, it is not clear to me that
prepare_creds() on one task and commit_creds() on another would work
with the same struct cred.


> Actually it might be up to
> three loops:
> 
> (lock cred_guard_mutex)

FWIW, Peter Newman (CC'd) also pointed me at the tasklist lock, which
is also used in Seccomp.  I believe we would also have to grab this
one *before* starting to iterate over the list of tasks and *after*
committing the last credentials, so that we don't spawn new tasks in
the middle of the operation, which might still inherit old
credentials.  (The drawback is that it is a global lock.)


> 1. Check that all threads use the same credential, or abort the whole
>    operation otherwise.
> 
> 2. If the caller has no_new_privs (and passed the Landlock checks), set
>    no_new_privs to all threads.

I am still conflicted about this.  This seems like a too implicit way
to set NO_NEW_PRIVS and might surprise callers.  After all, we are not
setting it in the single-threaded landlock_restrict_self() case
either.


> 3. Run a work_task on each thread that will either:
>    - if old thread's credential is still the same as the caller's
>      credentials, swap it with the new one with get_cred()/put_cred();
>    - otherwise, this means that the thread changed its credential after
>      the first loop but because we set no_new_privs, this alternative
>      credential should not give more privileges and we should be safe to
>      only swap the Landlock domain with the new one (if different).
>      Indeed, this alternative credential might be used by other threads
>      from the same process, but AFAIK not by threads from other
>      processes thanks to cred_guard_mutex.  This would require to bypass
>      the const pointer though.  Holding cred_guard_mutex will also make
>      sure that domains were not updated under our feet by another call
>      to landlock_restrict_self().

Hmm, in the second case we'd still have to use the
prepare_creds()/abort_creds()/commit_creds() API with all of its
potential failure modes.  Or, we would have to swizzle out contents of
struct cred directly, but that would be side-stepping that API again.

I would prefer not to side-step that cred-updating API, given how it
turned out in the keyctl case.  If it is absolutely necessary to
change struct cred differently, I would rather introduce a new
"official" way to change struct cred than to work around it from the
outside without David Howells' knowledge.  But I'd prefer to do it
with the existing update API, if we can.


> (unlock cred_guard_mutex)
> 
> Another approach would be to leverage an LSM hook to deny credential
> changes for a thread while we are changing them for the whole process,
> but I'm not sure it would be better.
> 
> Kees, Paul, Jann, any better idea?
> 
> > +	}
> > +
> > +	/* Replaces the old (prepared) domain. */
> > +	landlock_put_ruleset(new_llcred->domain);
> > +	landlock_get_ruleset(new_dom);
> > +	new_llcred->domain = new_dom;
> > +
> > +	return new_cred;
> > +}
> > +
> > +/*
> > + * restrict_one_thread - update a thread's Landlock domain in lockstep with the
> > + * other threads in the same process.
> > + *
> > + * When this is run, the same function gets run in all other threads in the same
> > + * process.  The concurrently running function invocations coordinate to wait
> > + * until they are all done with the preparation step and have reported back
> > + * errors from the preparation step, if necessary.
> > + *
> > + * Afterwards, depending on the presence of an error, all threads either commit
> > + * or abort the prepared credentials.  The commit operation can not fail any more.
> > + */
> > +static void restrict_one_thread(struct landlock_tsync_shared_context *ctx)
> > +{
> > +	int res;
> > +	struct cred *cred = landlock_prepare_creds(ctx->old_dom, ctx->new_dom);
> > +
> > +	/* On error, set the error and continue. (Do not return early.) */
> > +	if (IS_ERR(cred))
> > +		atomic_set(&ctx->preparation_error, PTR_ERR(cred));
> > +
> > +	/*
> > +	 * Barrier: Wait until all threads are done preparing.
> > +	 * After this point, we can have no more failures.
> > +	 */
> > +	if (atomic_dec_return(&ctx->num_preparing) == 0)
> > +		complete_all(&ctx->all_prepared);
> > +	wait_for_completion(&ctx->all_prepared);
> 
> The logic is OK but the side effect would be to block all threads, which
> is an issue.  That's why we need at least two loops run by the calling
> thread, and the other threads should not have to wait.

If we do "prepare" and "commit" in two different loops, we'd still
need to start a "task_work" for the threads in the "commit" loop so
that they update their struct cred.  And if the struct cred changed in
the meantime between the two loops, what do we do?  If we want
all-or-nothing semantics, we'd still have to check for modification in
the task_work and synchronize with a barrier between all tasks, right?
Doesn't that have the same problems?  (I would also expect that
wait-time-wise, the "landlock_prepare_creds" part would be quicker
than waiting for the "all_prepared" barrier, so this doesn't seem to
make a big difference?)

> 
> > +
> > +	/*
> > +	 * Abort the commit if any of the threads had an error.
> > +	 * (It might also be this thread.)  Otherwise, commit.
> > +	 */
> > +	res = atomic_read(&ctx->preparation_error);
> > +	if (res) {
> > +		if (!IS_ERR(cred))
> > +			abort_creds(cred);
> > +	} else {
> > +		commit_creds(cred);
> > +	}
> > +
> > +	/* Notify the calling thread once all threads are done */
> > +	if (atomic_dec_return(&ctx->num_unfinished) == 0)
> > +		complete_all(&ctx->all_finished);
> > +}
> > +
> > +/*
> > + * restrict_one_thread_callback - task_work callback for restricting a thread
> > + *
> > + * Calls restrict_one_thread with the struct landlock_shared_tsync_context
> > + * and frees up the per-work_task landlock_tsync_context afterwards.
> > + */
> > +static void restrict_one_thread_callback(struct callback_head *work)
> > +{
> > +	struct landlock_tsync_context *ctx =
> > +		container_of(work, struct landlock_tsync_context, work);
> > +	struct landlock_tsync_shared_context *sctx = ctx->ctx;
> > +
> > +	restrict_one_thread(sctx);
> > +	kfree(ctx);
> > +}
> > +
> > +/*
> > + * restrict_all_threads - enables a Landlock policy for all threads
> > + */
> > +static int restrict_all_threads(const int ruleset_fd)
> > +{
> > +	int res;
> > +	struct task_struct *thread, *caller;
> > +	struct landlock_tsync_shared_context sctx;
> > +	struct landlock_ruleset *ruleset;
> > +
> > +	/*
> > +	 * XXX: Need to double check race conditions and deadlocks before
> > +	 * merging this upstream. We probably need additional locking.
> > +	 */
> > +
> > +	/* Starting with 1, as we're already counting current. */
> > +	atomic_set(&sctx.num_preparing, 1);
> > +	atomic_set(&sctx.num_unfinished, 1);
> > +	init_completion(&sctx.all_prepared);
> > +	init_completion(&sctx.all_finished);
> > +	atomic_set(&sctx.preparation_error, 0);
> > +	sctx.old_dom = landlock_get_current_domain();
> > +
> > +	/* Gets and checks the ruleset. */
> > +	ruleset = get_ruleset_from_fd(ruleset_fd, FMODE_CAN_READ);
> > +	if (IS_ERR(ruleset))
> > +		return PTR_ERR(ruleset);
> > +
> > +	/*
> > +	 * We pre-merge the domain for all threads,
> > +	 * so that all threads end up with exactly the same domain.
> > +	 */
> > +	sctx.new_dom = landlock_merge_ruleset(sctx.old_dom, ruleset);
> > +	landlock_put_ruleset(ruleset);
> > +	if (IS_ERR(sctx.new_dom))
> > +		return PTR_ERR(sctx.new_dom);
> 
> What is specific to this function should only start here.  The flag
> checks (new ones are coming with the audit patch series), the ruleset
> FD, the preapre_creds(), and the landlock_merge_ruleset() should be the
> same.  We can initialize shared states from here.

I'm not sure I understand fully; is the suggestion that I reorder the
work at the top of restrict_all_threads() so that the parts that are
also done in the single threaded-case can be factored out and done in
the calling function?

> 
> > +
> > +	landlock_get_ruleset(sctx.old_dom);
> > +
> > +	caller = current;
> > +	for_each_thread(caller, thread) {
> > +		/* Skip current, since it is initiating the sync. */
> > +		if (thread == caller)
> > +			continue;
> > +
> > +		/* Skip exited threads. */
> > +		if (thread->flags & PF_EXITING)
> > +			continue;
> > +
> > +		/* Deallocation is done by the task_work itself. */
> > +		struct landlock_tsync_context *ctx =
> > +			kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
> > +		if (!ctx) {
> > +			/*
> > +			 * On error, keep already started threads from
> > +			 * accidentally committing.  Do not start additional
> > +			 * threads.
> > +			 */
> > +			atomic_set(&sctx.preparation_error, -ENOMEM);
> > +			break;
> > +		}
> > +
> > +		ctx->ctx = &sctx;
> > +		atomic_inc(&sctx.num_preparing);
> > +		atomic_inc(&sctx.num_unfinished);
> > +		init_task_work(&ctx->work, restrict_one_thread_callback);
> > +		res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
> > +		if (res) {
> > +			atomic_set(&sctx.preparation_error, res);
> > +			atomic_dec(&sctx.num_preparing);
> > +			atomic_dec(&sctx.num_unfinished);
> > +		}
> > +	}
> > +
> > +	/* and do the same on the current thread */
> > +	restrict_one_thread(&sctx);
> > +
> > +	res = atomic_read(&sctx.preparation_error);
> > +	wait_for_completion(&sctx.all_finished);
> > +	landlock_put_ruleset(sctx.new_dom);
> > +	landlock_put_ruleset(sctx.old_dom);
> > +	return res;
> > +}
> > +
> >  /**
> >   * sys_landlock_restrict_self - Enforce a ruleset on the calling thread
> >   *
> >   * @ruleset_fd: File descriptor tied to the ruleset to merge with the target.
> > - * @flags: Must be 0.
> > + * @flags: Flags to modify the behavior.
> >   *
> > - * This system call enables to enforce a Landlock ruleset on the current
> > - * thread.  Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
> > + * This system call enforces a Landlock ruleset on the current thread.
> > + * Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
> >   * namespace or is running with no_new_privs.  This avoids scenarios where
> >   * unprivileged tasks can affect the behavior of privileged children.
> >   *
> > + * If %LANDLOCK_RESTRICT_SELF_TSYNC is specified in @flags, the ruleset will be
> > + * applied to all threads of the current process.  For this, all threads must be
> > + * in the same Landlock domain and fulfill the normal requirements for enforcing
> > + * a Landlock ruleset.
> > + *
> >   * Possible returned errors are:
> >   *
> >   * - %EOPNOTSUPP: Landlock is supported by the kernel but disabled at boot time;
> > @@ -447,6 +664,8 @@ SYSCALL_DEFINE4(landlock_add_rule, const int, ruleset_fd,
> >   *   %CAP_SYS_ADMIN in its namespace.
> >   * - %E2BIG: The maximum number of stacked rulesets is reached for the current
> >   *   thread.
> > + * - %ESRCH: When called with %LANDLOCK_RESTRICT_SELF_TSYNC, the processes'
> > + *   threads were in different Landlock domains.
> >   */
> >  SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> >  		flags)
> > @@ -467,10 +686,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> >  	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
> >  		return -EPERM;
> >  
> > -	/* No flag for now. */
> > -	if (flags)
> > +	/* TSYNC is the only supported flag. */
> > +	if (flags & ~LANDLOCK_RESTRICT_SELF_TSYNC)
> >  		return -EINVAL;
> >  
> > +	if (flags & LANDLOCK_RESTRICT_SELF_TSYNC)
> > +		return restrict_all_threads(ruleset_fd);
> > +
> 
> As explained above, restrict_all_threads()'s logic should be called
> later.
> 
> Starting here, as seccomp does, we can lock siglock (whatever flags are
> used) to be sure there is not concurrent call to
> landlock_restrict_self() on other threads of the same process.
> 
> If LANDLOCK_RESTRICT_SELF_TSYNC is set, we also need to lock
> cred_guard_mutex to avoid race conditions with concurrent execve calls.

I was confused about this lock, but found the remark in the seccomp
logic again as well, which says:

  /*
   * Make sure we cannot change seccomp or nnp state via TSYNC
   * while another thread is in the middle of calling exec.
   */

In my understanding, cred_guard_mutex gets locked in early stages of a
execve() that prepares the "bprm".  If I understand this correctly,
this is the scenario where the other thread is starting to run
execve() (about to replace the entire process), but at the same time
it still looks like a normal thread that belongs to the process.  Is
that right?

> >  	/* Gets and checks the ruleset. */
> >  	ruleset = get_ruleset_from_fd(ruleset_fd, FMODE_CAN_READ);
> >  	if (IS_ERR(ruleset))
> > -- 
> > 2.48.1
> > 
> > 

Thanks for the review and for keeping the discussion at the more
conceptual level for now!

–Günther

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-02-28 17:33     ` Günther Noack
@ 2025-03-04 20:25       ` Mickaël Salaün
  2025-03-10 13:04         ` Günther Noack
  0 siblings, 1 reply; 18+ messages in thread
From: Mickaël Salaün @ 2025-03-04 20:25 UTC (permalink / raw)
  To: Günther Noack, David Howells
  Cc: Kees Cook, Paul Moore, linux-security-module,
	Konstantin Meskhidze, Jann Horn, linux-kernel, Peter Newman

On Fri, Feb 28, 2025 at 06:33:55PM +0100, Günther Noack wrote:
> Hello!
> 
> Thanks for the review!
> 
> I'm adding David Howells to this thread as well.  David, maybe you can
> help us and suggest a appropriate way to update the struct cred across
> multiple threads?
> 
> On Thu, Feb 27, 2025 at 09:53:11PM +0100, Mickaël Salaün wrote:
> > On Fri, Feb 21, 2025 at 07:44:17PM +0100, Günther Noack wrote:
> > > Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag.  With this flag, a
> > > given Landlock ruleset is applied to all threads of the calling
> > > process, instead of only the current one.
> > 
> > I initially though that we may pick another flag name but I now think
> > reusing the same name as for seccomp would be less confusing for users.
> 
> +1, under the assumption that we implement a "syncing" behaviour here,
> I also think that the name is good.
> 
> 
> > > diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
> > > index a9760d252fc2d..63792a6cde5ca 100644
> > > --- a/security/landlock/syscalls.c
> > > +++ b/security/landlock/syscalls.c
> 
> > > +static struct cred *
> > > +landlock_prepare_creds(const struct landlock_ruleset *old_dom,
> > > +		       struct landlock_ruleset *new_dom)
> > > +{
> > > +	struct cred *new_cred;
> > > +	struct landlock_cred_security *new_llcred;
> > > +
> > > +	/*
> > > +	 * Similar checks as for seccomp(2), except that an -EPERM may be
> > > +	 * returned.
> > > +	 */
> > > +	if (!task_no_new_privs(current) &&
> > > +	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
> > > +		return ERR_PTR(-EPERM);
> > 
> > This should only be done for the calling thread, so just reusing the
> > existing code from landlock_restrict_self().
> > 
> > If the no_new_privs property is set for the current thread, it would
> > make sense to just set it for all other threads once we are sure we can
> > sandbox all of them (see seccomp_sync_threads).
> 
> (See discussion below regarding NO_NEW_PRIVS)
> 
> 
> > For this we need to have two for_each_thread loops: one to check that
> > the current thread is legitimate to sandbox all the other threads and
> > make sure no error could arise (e.g. memory allocation), and another
> > loop to safely enforce the restrictions if the first one was OK.
> > 
> > The handling of the following prepare_cred()'s failure is the main
> > issue, but see my proposal below.
> > 
> > > +
> > > +	/* Prepares new credentials. */
> > > +	new_cred = prepare_creds();
> > > +	if (!new_cred)
> > > +		return ERR_PTR(-ENOMEM);
> > 
> > Instead of creating one new credential per thread, we should use the
> > newly created one (from landlock_restrict_self) and increment its
> > references.  In most cases that would avoid duplicating information and
> > wasting memory by creating one struct cred per thread.
> 
> struct cred contains all kinds of other credentials as well, besides
> Landlock and NO_NEW_PRIVS.

no_new_privs is stored per task, not per cred, but that does not change
much.

> *In normal scenarios* I believe that the
> struct cred is the same for all threads, but this is not guaranteed.

Yes

> 
> We can make it safe by requiring that all threads must point to the
> same struct cred, as you are suggesting below.  But how are users
> supposed to reason about whether all threads share the same struct
> cred?  To my knowledge, it is an implementation detail of the kernel.
> We can't expect callers to understand what properties belong to struct
> cred, and that list of properties can potentially change in the future
> as well.

That makes sense.

> 
> This is why in the proposal I sent, the only thing that matters about
> the preexisting struct cred is the Landlock domain which is stored in
> it.  This is a lower bar to meet and seems more appropriate for this
> Landlock-specific syscall.

My point was to *opportunistically* avoid useless cred allocation in the
case where the thread's old cred is the same as the caller's old cred.
There is no issue with that and it would avoid useless cred duplicates.

Checking the old/current Landlock domains should be a prerequisite
though, and user space would get a specific error code if not all
threads are in the same domain.

> 
> 
> > > +	new_llcred = landlock_cred(new_cred);
> > > +	if (new_llcred->domain != old_dom) {
> > > +		abort_creds(new_cred);
> > > +		return ERR_PTR(-ESRCH);
> > 
> > To avoid dealing with prepare_creds()'s error, I think we could just
> > check that all threads have (initially) the same credentials, and either
> > swap this same old credential with the new one (created by the caller),
> > or swap the Landlock domain (and set no_new_privs) if the credentials
> > changed between the initial check and now.
> 
> This would be working around the API documented in
> Documentation/security/credentials.rst, where the affected task should
> do its own prepare_creds()/abort_creds()/commit_creds() dance.

...and the creds could also be really read-only:
https://lore.kernel.org/all/20250203102809.1223255-1-kevin.brodsky@arm.com/

> 
> After seeing how it had turned out in the keyctl case with an
> additional LSM hook, I was hoping to use the suggested API and do
> things as intended here.  prepare_creds() and commit_creds() do more
> than just switching pointers and refcounts, it is not clear to me that
> prepare_creds() on one task and commit_creds() on another would work
> with the same struct cred.

Yes, let's use the proper API.

I'd really like to avoid blocking all threads because of prepare_creds()
potential errors though.

> 
> 
> > Actually it might be up to
> > three loops:
> > 
> > (lock cred_guard_mutex)
> 
> FWIW, Peter Newman (CC'd) also pointed me at the tasklist lock, which
> is also used in Seccomp.  I believe we would also have to grab this
> one *before* starting to iterate over the list of tasks and *after*
> committing the last credentials, so that we don't spawn new tasks in
> the middle of the operation, which might still inherit old
> credentials.  (The drawback is that it is a global lock.)

It would be a lock for Landlock and seccomp operations on one process,
which is OK.

> 
> 
> > 1. Check that all threads use the same credential, or abort the whole
> >    operation otherwise.
> > 
> > 2. If the caller has no_new_privs (and passed the Landlock checks), set
> >    no_new_privs to all threads.
> 
> I am still conflicted about this.  This seems like a too implicit way
> to set NO_NEW_PRIVS and might surprise callers.  After all, we are not
> setting it in the single-threaded landlock_restrict_self() case
> either.

seccomp does it and I think it makes sense and it simplify the use.  It
should be documented though.

> 
> 
> > 3. Run a work_task on each thread that will either:
> >    - if old thread's credential is still the same as the caller's
> >      credentials, swap it with the new one with get_cred()/put_cred();
> >    - otherwise, this means that the thread changed its credential after
> >      the first loop but because we set no_new_privs, this alternative
> >      credential should not give more privileges and we should be safe to
> >      only swap the Landlock domain with the new one (if different).
> >      Indeed, this alternative credential might be used by other threads
> >      from the same process, but AFAIK not by threads from other
> >      processes thanks to cred_guard_mutex.  This would require to bypass
> >      the const pointer though.  Holding cred_guard_mutex will also make
> >      sure that domains were not updated under our feet by another call
> >      to landlock_restrict_self().
> 
> Hmm, in the second case we'd still have to use the
> prepare_creds()/abort_creds()/commit_creds() API with all of its
> potential failure modes.  Or, we would have to swizzle out contents of
> struct cred directly, but that would be side-stepping that API again.
> 
> I would prefer not to side-step that cred-updating API, given how it
> turned out in the keyctl case.  If it is absolutely necessary to
> change struct cred differently, I would rather introduce a new
> "official" way to change struct cred than to work around it from the
> outside without David Howells' knowledge.  But I'd prefer to do it
> with the existing update API, if we can.

So yes, extending the cred API would be an option but maybe not a good
one.

> 
> 
> > (unlock cred_guard_mutex)
> > 
> > Another approach would be to leverage an LSM hook to deny credential
> > changes for a thread while we are changing them for the whole process,
> > but I'm not sure it would be better.
> > 
> > Kees, Paul, Jann, any better idea?

What about a credential hook being used to run the equivalent of the
task_work if a thread tries to change its credential in a racy way?
The hook implementation will check if the group leader of current is
being Landlocked+tsync, and if it is the case, the hook would call
itself the task_work before continuing the execution and letting the
original credential update being applied *after* the Landlock update.
If this happen, the real task_work will just do nothing once it sees
that the Landlock domain is already the new one.

> > 
> > > +	}
> > > +
> > > +	/* Replaces the old (prepared) domain. */
> > > +	landlock_put_ruleset(new_llcred->domain);
> > > +	landlock_get_ruleset(new_dom);
> > > +	new_llcred->domain = new_dom;
> > > +
> > > +	return new_cred;
> > > +}
> > > +
> > > +/*
> > > + * restrict_one_thread - update a thread's Landlock domain in lockstep with the
> > > + * other threads in the same process.
> > > + *
> > > + * When this is run, the same function gets run in all other threads in the same
> > > + * process.  The concurrently running function invocations coordinate to wait
> > > + * until they are all done with the preparation step and have reported back
> > > + * errors from the preparation step, if necessary.
> > > + *
> > > + * Afterwards, depending on the presence of an error, all threads either commit
> > > + * or abort the prepared credentials.  The commit operation can not fail any more.
> > > + */
> > > +static void restrict_one_thread(struct landlock_tsync_shared_context *ctx)
> > > +{
> > > +	int res;
> > > +	struct cred *cred = landlock_prepare_creds(ctx->old_dom, ctx->new_dom);
> > > +
> > > +	/* On error, set the error and continue. (Do not return early.) */
> > > +	if (IS_ERR(cred))
> > > +		atomic_set(&ctx->preparation_error, PTR_ERR(cred));
> > > +
> > > +	/*
> > > +	 * Barrier: Wait until all threads are done preparing.
> > > +	 * After this point, we can have no more failures.
> > > +	 */
> > > +	if (atomic_dec_return(&ctx->num_preparing) == 0)
> > > +		complete_all(&ctx->all_prepared);
> > > +	wait_for_completion(&ctx->all_prepared);
> > 
> > The logic is OK but the side effect would be to block all threads, which
> > is an issue.  That's why we need at least two loops run by the calling
> > thread, and the other threads should not have to wait.
> 
> If we do "prepare" and "commit" in two different loops, we'd still
> need to start a "task_work" for the threads in the "commit" loop so
> that they update their struct cred.  And if the struct cred changed in
> the meantime between the two loops, what do we do?  If we want
> all-or-nothing semantics, we'd still have to check for modification in
> the task_work and synchronize with a barrier between all tasks, right?
> Doesn't that have the same problems?  (I would also expect that
> wait-time-wise, the "landlock_prepare_creds" part would be quicker
> than waiting for the "all_prepared" barrier, so this doesn't seem to
> make a big difference?)

Leveraging the LSM hook to avoid race condition and preparing ahead of
time a set of creds (if not the same as the current one) should enables
Landlock to avoid locking all threads and to avoid any race conditions.

> 
> > 
> > > +
> > > +	/*
> > > +	 * Abort the commit if any of the threads had an error.
> > > +	 * (It might also be this thread.)  Otherwise, commit.
> > > +	 */
> > > +	res = atomic_read(&ctx->preparation_error);
> > > +	if (res) {
> > > +		if (!IS_ERR(cred))
> > > +			abort_creds(cred);
> > > +	} else {
> > > +		commit_creds(cred);
> > > +	}
> > > +
> > > +	/* Notify the calling thread once all threads are done */
> > > +	if (atomic_dec_return(&ctx->num_unfinished) == 0)
> > > +		complete_all(&ctx->all_finished);
> > > +}
> > > +
> > > +/*
> > > + * restrict_one_thread_callback - task_work callback for restricting a thread
> > > + *
> > > + * Calls restrict_one_thread with the struct landlock_shared_tsync_context
> > > + * and frees up the per-work_task landlock_tsync_context afterwards.
> > > + */
> > > +static void restrict_one_thread_callback(struct callback_head *work)
> > > +{
> > > +	struct landlock_tsync_context *ctx =
> > > +		container_of(work, struct landlock_tsync_context, work);
> > > +	struct landlock_tsync_shared_context *sctx = ctx->ctx;
> > > +
> > > +	restrict_one_thread(sctx);
> > > +	kfree(ctx);
> > > +}
> > > +
> > > +/*
> > > + * restrict_all_threads - enables a Landlock policy for all threads
> > > + */
> > > +static int restrict_all_threads(const int ruleset_fd)
> > > +{
> > > +	int res;
> > > +	struct task_struct *thread, *caller;
> > > +	struct landlock_tsync_shared_context sctx;
> > > +	struct landlock_ruleset *ruleset;
> > > +
> > > +	/*
> > > +	 * XXX: Need to double check race conditions and deadlocks before
> > > +	 * merging this upstream. We probably need additional locking.
> > > +	 */
> > > +
> > > +	/* Starting with 1, as we're already counting current. */
> > > +	atomic_set(&sctx.num_preparing, 1);
> > > +	atomic_set(&sctx.num_unfinished, 1);
> > > +	init_completion(&sctx.all_prepared);
> > > +	init_completion(&sctx.all_finished);
> > > +	atomic_set(&sctx.preparation_error, 0);
> > > +	sctx.old_dom = landlock_get_current_domain();
> > > +
> > > +	/* Gets and checks the ruleset. */
> > > +	ruleset = get_ruleset_from_fd(ruleset_fd, FMODE_CAN_READ);
> > > +	if (IS_ERR(ruleset))
> > > +		return PTR_ERR(ruleset);
> > > +
> > > +	/*
> > > +	 * We pre-merge the domain for all threads,
> > > +	 * so that all threads end up with exactly the same domain.
> > > +	 */
> > > +	sctx.new_dom = landlock_merge_ruleset(sctx.old_dom, ruleset);
> > > +	landlock_put_ruleset(ruleset);
> > > +	if (IS_ERR(sctx.new_dom))
> > > +		return PTR_ERR(sctx.new_dom);
> > 
> > What is specific to this function should only start here.  The flag
> > checks (new ones are coming with the audit patch series), the ruleset
> > FD, the preapre_creds(), and the landlock_merge_ruleset() should be the
> > same.  We can initialize shared states from here.
> 
> I'm not sure I understand fully; is the suggestion that I reorder the
> work at the top of restrict_all_threads() so that the parts that are
> also done in the single threaded-case can be factored out and done in
> the calling function?

Yes, the initial part of landlock_restrict_self() should be reused as
much as possible in both cases.  The no_new_privs and capability checks
don't need to be duplicated and can stay in landlock_restrict_self().
The task_work would just need to check no_new_privs and call
task_set_no_new_privs(), but without any possible error.  The
prepare_creds() and landlock_merge_ruleset() will be called at least
once but maybe several times, so we can move that in a dedicated helper.

 
> > 
> > > +
> > > +	landlock_get_ruleset(sctx.old_dom);
> > > +
> > > +	caller = current;
> > > +	for_each_thread(caller, thread) {
> > > +		/* Skip current, since it is initiating the sync. */
> > > +		if (thread == caller)
> > > +			continue;
> > > +
> > > +		/* Skip exited threads. */
> > > +		if (thread->flags & PF_EXITING)
> > > +			continue;
> > > +
> > > +		/* Deallocation is done by the task_work itself. */
> > > +		struct landlock_tsync_context *ctx =
> > > +			kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
> > > +		if (!ctx) {
> > > +			/*
> > > +			 * On error, keep already started threads from
> > > +			 * accidentally committing.  Do not start additional
> > > +			 * threads.
> > > +			 */
> > > +			atomic_set(&sctx.preparation_error, -ENOMEM);
> > > +			break;
> > > +		}
> > > +
> > > +		ctx->ctx = &sctx;
> > > +		atomic_inc(&sctx.num_preparing);
> > > +		atomic_inc(&sctx.num_unfinished);
> > > +		init_task_work(&ctx->work, restrict_one_thread_callback);
> > > +		res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
> > > +		if (res) {
> > > +			atomic_set(&sctx.preparation_error, res);
> > > +			atomic_dec(&sctx.num_preparing);
> > > +			atomic_dec(&sctx.num_unfinished);
> > > +		}
> > > +	}
> > > +
> > > +	/* and do the same on the current thread */
> > > +	restrict_one_thread(&sctx);
> > > +
> > > +	res = atomic_read(&sctx.preparation_error);
> > > +	wait_for_completion(&sctx.all_finished);
> > > +	landlock_put_ruleset(sctx.new_dom);
> > > +	landlock_put_ruleset(sctx.old_dom);
> > > +	return res;
> > > +}
> > > +
> > >  /**
> > >   * sys_landlock_restrict_self - Enforce a ruleset on the calling thread
> > >   *
> > >   * @ruleset_fd: File descriptor tied to the ruleset to merge with the target.
> > > - * @flags: Must be 0.
> > > + * @flags: Flags to modify the behavior.
> > >   *
> > > - * This system call enables to enforce a Landlock ruleset on the current
> > > - * thread.  Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
> > > + * This system call enforces a Landlock ruleset on the current thread.
> > > + * Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
> > >   * namespace or is running with no_new_privs.  This avoids scenarios where
> > >   * unprivileged tasks can affect the behavior of privileged children.
> > >   *
> > > + * If %LANDLOCK_RESTRICT_SELF_TSYNC is specified in @flags, the ruleset will be
> > > + * applied to all threads of the current process.  For this, all threads must be
> > > + * in the same Landlock domain and fulfill the normal requirements for enforcing
> > > + * a Landlock ruleset.
> > > + *
> > >   * Possible returned errors are:
> > >   *
> > >   * - %EOPNOTSUPP: Landlock is supported by the kernel but disabled at boot time;
> > > @@ -447,6 +664,8 @@ SYSCALL_DEFINE4(landlock_add_rule, const int, ruleset_fd,
> > >   *   %CAP_SYS_ADMIN in its namespace.
> > >   * - %E2BIG: The maximum number of stacked rulesets is reached for the current
> > >   *   thread.
> > > + * - %ESRCH: When called with %LANDLOCK_RESTRICT_SELF_TSYNC, the processes'
> > > + *   threads were in different Landlock domains.
> > >   */
> > >  SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> > >  		flags)
> > > @@ -467,10 +686,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> > >  	    !ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
> > >  		return -EPERM;
> > >  
> > > -	/* No flag for now. */
> > > -	if (flags)
> > > +	/* TSYNC is the only supported flag. */
> > > +	if (flags & ~LANDLOCK_RESTRICT_SELF_TSYNC)
> > >  		return -EINVAL;
> > >  
> > > +	if (flags & LANDLOCK_RESTRICT_SELF_TSYNC)
> > > +		return restrict_all_threads(ruleset_fd);
> > > +
> > 
> > As explained above, restrict_all_threads()'s logic should be called
> > later.
> > 
> > Starting here, as seccomp does, we can lock siglock (whatever flags are
> > used) to be sure there is not concurrent call to
> > landlock_restrict_self() on other threads of the same process.
> > 
> > If LANDLOCK_RESTRICT_SELF_TSYNC is set, we also need to lock
> > cred_guard_mutex to avoid race conditions with concurrent execve calls.
> 
> I was confused about this lock, but found the remark in the seccomp
> logic again as well, which says:
> 
>   /*
>    * Make sure we cannot change seccomp or nnp state via TSYNC
>    * while another thread is in the middle of calling exec.
>    */
> 
> In my understanding, cred_guard_mutex gets locked in early stages of a
> execve() that prepares the "bprm".  If I understand this correctly,
> this is the scenario where the other thread is starting to run
> execve() (about to replace the entire process), but at the same time
> it still looks like a normal thread that belongs to the process.  Is
> that right?

Yes, that's my understanding too.

> 
> > >  	/* Gets and checks the ruleset. */
> > >  	ruleset = get_ruleset_from_fd(ruleset_fd, FMODE_CAN_READ);
> > >  	if (IS_ERR(ruleset))
> > > -- 
> > > 2.48.1
> > > 
> > > 
> 
> Thanks for the review and for keeping the discussion at the more
> conceptual level for now!
> 
> –Günther
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-03-04 20:25       ` Mickaël Salaün
@ 2025-03-10 13:04         ` Günther Noack
  2025-03-11 14:32           ` Mickaël Salaün
  2025-05-27 14:26           ` Jann Horn
  0 siblings, 2 replies; 18+ messages in thread
From: Günther Noack @ 2025-03-10 13:04 UTC (permalink / raw)
  To: Paul Moore, Mickaël Salaün, sergeh
  Cc: David Howells, Kees Cook, Paul Moore, linux-security-module,
	Konstantin Meskhidze, Jann Horn, linux-kernel, Peter Newman

Hello Paul and Serge!

On Tue, Mar 04, 2025 at 09:25:51PM +0100, Mickaël Salaün wrote:
> On Fri, Feb 28, 2025 at 06:33:55PM +0100, Günther Noack wrote:
> > Hello!
> > 
> > Thanks for the review!
> > 
> > I'm adding David Howells to this thread as well.  David, maybe you can
> > help us and suggest a appropriate way to update the struct cred across
> > multiple threads?

Paul and Serge, since you are volunteering to take ownership of
credentials, maybe you can advise on what is the best approach here?

To summarize the approaches that I have been discussing with Mickaël:

Approach 1: Use the creds API thread-by-thread (implemented here)

  * Each task calls prepare_creds() and commit_creds() on its own, in
    line with the way the API is designed to be used (from a single
    task).
  * Task work gets scheduled with a pseudo-signal and the task that
    invoked the syscall is waiting for all of them to return.
  * Task work can fail at the beginning due to prepare_creds(), in
    which case all tasks have to abort_creds(). Additional
    synchronization is needed for that.

  Drawback: We need to grab the system-global task lock to prevent new
  thread creation and also grab the per-process signal lock to prevent
  races with other creds accesses, for the entire time as we wait for
  each task to do the task work.

Approach 2: Attempt to do the prepare_creds() step in the calling task.

  * Would use an API similar to what keyctl uses for the
    parent-process update.
  * This side-steps the credentials update API as it is documented in
    Documentation, using the cred_alloc_blank() helper and replicating
    some prepare_creds() logic.

  Drawback: This would introduce another use of the cred_alloc_blank()
  API (and the cred_transfer LSM hook), which would otherwise be
  reasonable to delete if we can remove the keyctl use case.
  (https://lore.kernel.org/all/20240805-remove-cred-transfer-v2-0-a2aa1d45e6b8@google.com/)

Approach 3: Store Landlock domains outside of credentials altogether

  * We could also store a task's Landlock domain as a pointer in the
    per-task security blob, and refcount these.  We would need to make
    sure that they get newly referenced and updated in the same
    scenarios as they do within struct cred today.
  * We could then guard accesses to a task's Landlock domain with a
    more classic locking mechanism.  This would make it possible to
    update the Landlock domain of all tasks in a process without
    having to go through pseudo-signals.

  Drawbacks:
  * Would have to make sure that the Landlock domain the task's LSM
    blob behaves exactly the same as before in the struct cred.
  * Potentially slower to access Landlock domains that are guarded by
    a mutex.

I'd be interested to hear your opinion on how we should best approach
this.

P.S. This is probably already clear to everyone who read this far, but
the problem that creds can't be updated across tasks has already lead
to other difficult APIs and also bleeds into user-level interfaces
such as the setuid() family of syscalls as well as capability
updating.  Both of these are solved from user space through the signal
hack documented in nptl(7), which is used in glibc for setuid()-style
calls and in libpsx for capabilities and Landlock's Go library.  If we
had a working way to do these cross-thread updates in the kernel, that
would simplify these userspace implementations.  (There are more links
in the cover letter at the top of this thread.)

Thanks,
–Günther

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-03-10 13:04         ` Günther Noack
@ 2025-03-11 14:32           ` Mickaël Salaün
  2025-05-18  7:40             ` Günther Noack
  2025-05-27 14:26           ` Jann Horn
  1 sibling, 1 reply; 18+ messages in thread
From: Mickaël Salaün @ 2025-03-11 14:32 UTC (permalink / raw)
  To: Günther Noack
  Cc: Paul Moore, sergeh, David Howells, Kees Cook,
	linux-security-module, Konstantin Meskhidze, Jann Horn,
	linux-kernel, Peter Newman

On Mon, Mar 10, 2025 at 02:04:23PM +0100, Günther Noack wrote:
> Hello Paul and Serge!
> 
> On Tue, Mar 04, 2025 at 09:25:51PM +0100, Mickaël Salaün wrote:
> > On Fri, Feb 28, 2025 at 06:33:55PM +0100, Günther Noack wrote:
> > > Hello!
> > > 
> > > Thanks for the review!
> > > 
> > > I'm adding David Howells to this thread as well.  David, maybe you can
> > > help us and suggest a appropriate way to update the struct cred across
> > > multiple threads?
> 
> Paul and Serge, since you are volunteering to take ownership of
> credentials, maybe you can advise on what is the best approach here?
> 
> To summarize the approaches that I have been discussing with Mickaël:
> 
> Approach 1: Use the creds API thread-by-thread (implemented here)
> 
>   * Each task calls prepare_creds() and commit_creds() on its own, in
>     line with the way the API is designed to be used (from a single
>     task).
>   * Task work gets scheduled with a pseudo-signal and the task that
>     invoked the syscall is waiting for all of them to return.
>   * Task work can fail at the beginning due to prepare_creds(), in
>     which case all tasks have to abort_creds(). Additional
>     synchronization is needed for that.
> 
>   Drawback: We need to grab the system-global task lock to prevent new
>   thread creation and also grab the per-process signal lock to prevent
>   races with other creds accesses, for the entire time as we wait for
>   each task to do the task work.

In other words, this approach blocks all threads from the same process.

> 
> Approach 2: Attempt to do the prepare_creds() step in the calling task.
> 
>   * Would use an API similar to what keyctl uses for the
>     parent-process update.
>   * This side-steps the credentials update API as it is documented in
>     Documentation, using the cred_alloc_blank() helper and replicating
>     some prepare_creds() logic.
> 
>   Drawback: This would introduce another use of the cred_alloc_blank()
>   API (and the cred_transfer LSM hook), which would otherwise be
>   reasonable to delete if we can remove the keyctl use case.
>   (https://lore.kernel.org/all/20240805-remove-cred-transfer-v2-0-a2aa1d45e6b8@google.com/)

cred_alloc_blank() was designed to avoid dealing with -ENOMEM, which is
a required property for this Landlock TSYNC feature (i.e. atomic and
consistent synchronization).

I think it would make sense to replace most of the
key_change_session_keyring() code with a new cred_transfer() helper that
will memcpy the old cred to the new, increment the appropriate ref
counters, and call security_transfer_creds().  We could then use this
helper in Landlock too.

To properly handle race conditions with a thread changing its own
credentials, we would need a new LSM hook called by commit_creds().
For the Landlock implementation, this hook would check if the process is
being Landlocked+TSYNC and return -ERESTARTNOINTR if it is the case.
The newly created task_work would then be free to update each thread's
credentials while only blocking the calling thread (which is also a
required feature).

Alternatively, instead of a new LSM hook, commit_creds() could check
itself a new group leader's flag set if all the credentials from the
calling process are being updated, and return -ERESTARTNOINTR in this
case.

> 
> Approach 3: Store Landlock domains outside of credentials altogether
> 
>   * We could also store a task's Landlock domain as a pointer in the
>     per-task security blob, and refcount these.  We would need to make
>     sure that they get newly referenced and updated in the same
>     scenarios as they do within struct cred today.
>   * We could then guard accesses to a task's Landlock domain with a
>     more classic locking mechanism.  This would make it possible to
>     update the Landlock domain of all tasks in a process without
>     having to go through pseudo-signals.
> 
>   Drawbacks:
>   * Would have to make sure that the Landlock domain the task's LSM
>     blob behaves exactly the same as before in the struct cred.
>   * Potentially slower to access Landlock domains that are guarded by
>     a mutex.

This would not work because the kernel (including LSM hooks) uses
credentials to check access.

> 
> I'd be interested to hear your opinion on how we should best approach
> this.
> 
> P.S. This is probably already clear to everyone who read this far, but
> the problem that creds can't be updated across tasks has already lead
> to other difficult APIs and also bleeds into user-level interfaces
> such as the setuid() family of syscalls as well as capability
> updating.  Both of these are solved from user space through the signal
> hack documented in nptl(7), which is used in glibc for setuid()-style
> calls and in libpsx for capabilities and Landlock's Go library.  If we
> had a working way to do these cross-thread updates in the kernel, that
> would simplify these userspace implementations.  (There are more links
> in the cover letter at the top of this thread.)
> 
> Thanks,
> –Günther
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-03-11 14:32           ` Mickaël Salaün
@ 2025-05-18  7:40             ` Günther Noack
  2025-05-18 19:57               ` Mickaël Salaün
  0 siblings, 1 reply; 18+ messages in thread
From: Günther Noack @ 2025-05-18  7:40 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Paul Moore, sergeh, David Howells, Kees Cook,
	linux-security-module, Konstantin Meskhidze, Jann Horn,
	linux-kernel, Peter Newman

Hello!

On Tue, Mar 11, 2025 at 03:32:53PM +0100, Mickaël Salaün wrote:
> On Mon, Mar 10, 2025 at 02:04:23PM +0100, Günther Noack wrote:
> > On Tue, Mar 04, 2025 at 09:25:51PM +0100, Mickaël Salaün wrote:
> > > On Fri, Feb 28, 2025 at 06:33:55PM +0100, Günther Noack wrote:
> > > > Hello!
> > > > 
> > > > Thanks for the review!
> > > > 
> > > > I'm adding David Howells to this thread as well.  David, maybe you can
> > > > help us and suggest a appropriate way to update the struct cred across
> > > > multiple threads?
> > 
> > Paul and Serge, since you are volunteering to take ownership of
> > credentials, maybe you can advise on what is the best approach here?
> > 
> > To summarize the approaches that I have been discussing with Mickaël:
> > 
> > Approach 1: Use the creds API thread-by-thread (implemented here)
> > 
> >   * Each task calls prepare_creds() and commit_creds() on its own, in
> >     line with the way the API is designed to be used (from a single
> >     task).
> >   * Task work gets scheduled with a pseudo-signal and the task that
> >     invoked the syscall is waiting for all of them to return.
> >   * Task work can fail at the beginning due to prepare_creds(), in
> >     which case all tasks have to abort_creds(). Additional
> >     synchronization is needed for that.
> > 
> >   Drawback: We need to grab the system-global task lock to prevent new
> >   thread creation and also grab the per-process signal lock to prevent
> >   races with other creds accesses, for the entire time as we wait for
> >   each task to do the task work.
> 
> In other words, this approach blocks all threads from the same process.

It does, but that is still an improvement over the current
libpsx-based implementation in userspace.  That existing
implementation does not block, but it is running the risk that
prepare_creds() might fail on one of the threads (e.g. allocation
failure), which would leave the processes' threads in an inconsistent
state.

Another upside that the in-kernel implementation has is that the
implementation of that is hidden behind an API, so if we can
eventually find a better approach, we can migrate to it.  It gives us
flexibility.

I guess a possible variant (approach 1B) would be to do the equivalent
to what userspace does today, and not make all threads wait for the
possible error of prepare_creds() on the other threads.


> > Approach 2: Attempt to do the prepare_creds() step in the calling task.
> > 
> >   * Would use an API similar to what keyctl uses for the
> >     parent-process update.
> >   * This side-steps the credentials update API as it is documented in
> >     Documentation, using the cred_alloc_blank() helper and replicating
> >     some prepare_creds() logic.
> > 
> >   Drawback: This would introduce another use of the cred_alloc_blank()
> >   API (and the cred_transfer LSM hook), which would otherwise be
> >   reasonable to delete if we can remove the keyctl use case.
> >   (https://lore.kernel.org/all/20240805-remove-cred-transfer-v2-0-a2aa1d45e6b8@google.com/)
> 
> cred_alloc_blank() was designed to avoid dealing with -ENOMEM, which is
> a required property for this Landlock TSYNC feature (i.e. atomic and
> consistent synchronization).

Remark on the side, I suspect that the error handling in nptl(7)
probably also does not guarantee that, also for setuid(2) and friends.


> I think it would make sense to replace most of the
> key_change_session_keyring() code with a new cred_transfer() helper that
> will memcpy the old cred to the new, increment the appropriate ref
> counters, and call security_transfer_creds().  We could then use this
> helper in Landlock too.
> 
> To properly handle race conditions with a thread changing its own
> credentials, we would need a new LSM hook called by commit_creds().
> For the Landlock implementation, this hook would check if the process is
> being Landlocked+TSYNC and return -ERESTARTNOINTR if it is the case.
> The newly created task_work would then be free to update each thread's
> credentials while only blocking the calling thread (which is also a
> required feature).
> 
> Alternatively, instead of a new LSM hook, commit_creds() could check
> itself a new group leader's flag set if all the credentials from the
> calling process are being updated, and return -ERESTARTNOINTR in this
> case.

commit_creds() is explicitly documented to never return errors.
It returns a 0 integer so that it lends itself for tail calls,
and some of those usages might also rely on it always working.
There are ~15 existing calls where the return value is discarded.

If commit_creds() returns -ERESTARTNOINTR, I assume that your idea is
that the task_work would retry the prepare-and-commit when
encountering that?

We would have to store the fact that the process is being
Landlock+TSYNC'd in a central place (e.g. group leader flag set).
When that is done, don't we need more synchronization mechanisms to
access that (which RCU was meant to avoid)?

I am having a hard time wrapping my head around these synchronization
schemes, I feel this is getting too complicated for what it is trying
to do and might become difficult to maintain if we implemented it.

> > Approach 3: Store Landlock domains outside of credentials altogether
> > 
> >   * We could also store a task's Landlock domain as a pointer in the
> >     per-task security blob, and refcount these.  We would need to make
> >     sure that they get newly referenced and updated in the same
> >     scenarios as they do within struct cred today.
> >   * We could then guard accesses to a task's Landlock domain with a
> >     more classic locking mechanism.  This would make it possible to
> >     update the Landlock domain of all tasks in a process without
> >     having to go through pseudo-signals.
> > 
> >   Drawbacks:
> >   * Would have to make sure that the Landlock domain the task's LSM
> >     blob behaves exactly the same as before in the struct cred.
> >   * Potentially slower to access Landlock domains that are guarded by
> >     a mutex.
> 
> This would not work because the kernel (including LSM hooks) uses
> credentials to check access.

It's unclear to me what you mean by that.

Do you mean that it is hard to replicate for Landlock the cases where
the pointer would have to be copied, because the LSM hooks are not
suited for it?


Here is another possible approach which a colleague suggested in a
discussion:

Approach 4: Freeze-and re-enforce the Landlock ruleset

Another option would be to have a different user space API for this,
with a flag LANDLOCK_RESTRICT_SELF_ENTER (name TBD) to enter a given
domain.

On first usage of landlock_restrict_self() with the flag, the enforced
ruleset would be frozen and linked to the Landlock domain which was
enforced at the end.

Subsequent attempts to add rules to the ruleset would fail when the
ruleset is frozen.  The ruleset FD is now representing the created
domain including all its nesting.

Subsequent usages of landlock_restrict_self() on a frozen ruleset would:

(a) check that the ruleset's domain is a narrower (nested) domain of
    the current thread's domain (so that we retain the property of
    only locking in a task further than it was before).

(b) set the task's domain to the domain attached to the ruleset

This way, we would keep a per-thread userspace API, avoiding the
issues discussed before.  It would become possible to use ruleset file
descriptors as handles for entering Landlock domains and pass them
around between processes.

The only drawback I can see is that it has the same issues as libpsx
and nptl(7) in that the syscall can fail on individual threads due to
ENOMEM.

If we can not find a solution for "TSYNC", it seems that this might be
a viable alternative.  For multithreaded applications enforcing a
Landlock policy, it would become an application of libpsx with the
LANDLOCK_RESTRICT_SELF_ENTER flag.

Let me know what you think.

–Günther

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-05-18  7:40             ` Günther Noack
@ 2025-05-18 19:57               ` Mickaël Salaün
  2025-05-30 13:16                 ` Günther Noack
  0 siblings, 1 reply; 18+ messages in thread
From: Mickaël Salaün @ 2025-05-18 19:57 UTC (permalink / raw)
  To: Günther Noack
  Cc: Paul Moore, sergeh, David Howells, Kees Cook,
	linux-security-module, Konstantin Meskhidze, Jann Horn,
	linux-kernel, Peter Newman, Andy Lutomirski, Will Drewry

On Sun, May 18, 2025 at 09:40:05AM +0200, Günther Noack wrote:
> Hello!
> 
> On Tue, Mar 11, 2025 at 03:32:53PM +0100, Mickaël Salaün wrote:
> > On Mon, Mar 10, 2025 at 02:04:23PM +0100, Günther Noack wrote:
> > > On Tue, Mar 04, 2025 at 09:25:51PM +0100, Mickaël Salaün wrote:
> > > > On Fri, Feb 28, 2025 at 06:33:55PM +0100, Günther Noack wrote:
> > > > > Hello!
> > > > > 
> > > > > Thanks for the review!
> > > > > 
> > > > > I'm adding David Howells to this thread as well.  David, maybe you can
> > > > > help us and suggest a appropriate way to update the struct cred across
> > > > > multiple threads?
> > > 
> > > Paul and Serge, since you are volunteering to take ownership of
> > > credentials, maybe you can advise on what is the best approach here?
> > > 
> > > To summarize the approaches that I have been discussing with Mickaël:
> > > 
> > > Approach 1: Use the creds API thread-by-thread (implemented here)
> > > 
> > >   * Each task calls prepare_creds() and commit_creds() on its own, in
> > >     line with the way the API is designed to be used (from a single
> > >     task).
> > >   * Task work gets scheduled with a pseudo-signal and the task that
> > >     invoked the syscall is waiting for all of them to return.
> > >   * Task work can fail at the beginning due to prepare_creds(), in
> > >     which case all tasks have to abort_creds(). Additional
> > >     synchronization is needed for that.
> > > 
> > >   Drawback: We need to grab the system-global task lock to prevent new
> > >   thread creation and also grab the per-process signal lock to prevent
> > >   races with other creds accesses, for the entire time as we wait for
> > >   each task to do the task work.
> > 
> > In other words, this approach blocks all threads from the same process.
> 
> It does, but that is still an improvement over the current
> libpsx-based implementation in userspace.  That existing
> implementation does not block, but it is running the risk that
> prepare_creds() might fail on one of the threads (e.g. allocation
> failure), which would leave the processes' threads in an inconsistent
> state.
> 
> Another upside that the in-kernel implementation has is that the
> implementation of that is hidden behind an API, so if we can
> eventually find a better approach, we can migrate to it.  It gives us
> flexibility.



> 
> I guess a possible variant (approach 1B) would be to do the equivalent
> to what userspace does today, and not make all threads wait for the
> possible error of prepare_creds() on the other threads.

This 1B variant is not OK because it would remove the guarantee that the
whole process is restricted.

> 
> 
> > > Approach 2: Attempt to do the prepare_creds() step in the calling task.
> > > 
> > >   * Would use an API similar to what keyctl uses for the
> > >     parent-process update.
> > >   * This side-steps the credentials update API as it is documented in
> > >     Documentation, using the cred_alloc_blank() helper and replicating
> > >     some prepare_creds() logic.
> > > 
> > >   Drawback: This would introduce another use of the cred_alloc_blank()
> > >   API (and the cred_transfer LSM hook), which would otherwise be
> > >   reasonable to delete if we can remove the keyctl use case.
> > >   (https://lore.kernel.org/all/20240805-remove-cred-transfer-v2-0-a2aa1d45e6b8@google.com/)
> > 
> > cred_alloc_blank() was designed to avoid dealing with -ENOMEM, which is
> > a required property for this Landlock TSYNC feature (i.e. atomic and
> > consistent synchronization).
> 
> Remark on the side, I suspect that the error handling in nptl(7)
> probably also does not guarantee that, also for setuid(2) and friends.
> 
> 
> > I think it would make sense to replace most of the
> > key_change_session_keyring() code with a new cred_transfer() helper that
> > will memcpy the old cred to the new, increment the appropriate ref
> > counters, and call security_transfer_creds().  We could then use this
> > helper in Landlock too.
> > 
> > To properly handle race conditions with a thread changing its own
> > credentials, we would need a new LSM hook called by commit_creds().
> > For the Landlock implementation, this hook would check if the process is
> > being Landlocked+TSYNC and return -ERESTARTNOINTR if it is the case.
> > The newly created task_work would then be free to update each thread's
> > credentials while only blocking the calling thread (which is also a
> > required feature).
> > 
> > Alternatively, instead of a new LSM hook, commit_creds() could check
> > itself a new group leader's flag set if all the credentials from the
> > calling process are being updated, and return -ERESTARTNOINTR in this
> > case.
> 
> commit_creds() is explicitly documented to never return errors.
> It returns a 0 integer so that it lends itself for tail calls,
> and some of those usages might also rely on it always working.
> There are ~15 existing calls where the return value is discarded.

Indeed, commit_creds() should always return 0.  My full proposal does
not look safe enough, but the cred_transfer() helper can still be
useful.

> 
> If commit_creds() returns -ERESTARTNOINTR, I assume that your idea is
> that the task_work would retry the prepare-and-commit when
> encountering that?
> 
> We would have to store the fact that the process is being
> Landlock+TSYNC'd in a central place (e.g. group leader flag set).
> When that is done, don't we need more synchronization mechanisms to
> access that (which RCU was meant to avoid)?
> 
> I am having a hard time wrapping my head around these synchronization
> schemes, I feel this is getting too complicated for what it is trying
> to do and might become difficult to maintain if we implemented it.

Fair. ERESTARTNOINTR should only be used by a syscall implementation.

> 
> > > Approach 3: Store Landlock domains outside of credentials altogether
> > > 
> > >   * We could also store a task's Landlock domain as a pointer in the
> > >     per-task security blob, and refcount these.  We would need to make
> > >     sure that they get newly referenced and updated in the same
> > >     scenarios as they do within struct cred today.
> > >   * We could then guard accesses to a task's Landlock domain with a
> > >     more classic locking mechanism.  This would make it possible to
> > >     update the Landlock domain of all tasks in a process without
> > >     having to go through pseudo-signals.
> > > 
> > >   Drawbacks:
> > >   * Would have to make sure that the Landlock domain the task's LSM
> > >     blob behaves exactly the same as before in the struct cred.
> > >   * Potentially slower to access Landlock domains that are guarded by
> > >     a mutex.
> > 
> > This would not work because the kernel (including LSM hooks) uses
> > credentials to check access.
> 
> It's unclear to me what you mean by that.
> 
> Do you mean that it is hard to replicate for Landlock the cases where
> the pointer would have to be copied, because the LSM hooks are not
> suited for it?

struct cred is used to check if a task subject can access a task object.
Landlock's metadata must stay in struct cred to be available when
checking access to any kernel object.  The LSM hooks reflect this
rationale by only passing struct cred when checking a task (e.g.
security_task_kill()'s cred).

seccomp only cares about filtering raw syscalls, and the seccomp filters
are just ignored when the kernel (with an LSM or not) checks task's
permission to access another task.

The per-task security blob could store some state though, e.g. to
identify if a domain needs to be updated, but I don't see a use case
here.

> 
> 
> Here is another possible approach which a colleague suggested in a
> discussion:
> 
> Approach 4: Freeze-and re-enforce the Landlock ruleset
> 
> Another option would be to have a different user space API for this,
> with a flag LANDLOCK_RESTRICT_SELF_ENTER (name TBD) to enter a given
> domain.
> 
> On first usage of landlock_restrict_self() with the flag, the enforced
> ruleset would be frozen and linked to the Landlock domain which was
> enforced at the end.
> 
> Subsequent attempts to add rules to the ruleset would fail when the
> ruleset is frozen.  The ruleset FD is now representing the created
> domain including all its nesting.
> 
> Subsequent usages of landlock_restrict_self() on a frozen ruleset would:
> 
> (a) check that the ruleset's domain is a narrower (nested) domain of
>     the current thread's domain (so that we retain the property of
>     only locking in a task further than it was before).
> 
> (b) set the task's domain to the domain attached to the ruleset
> 
> This way, we would keep a per-thread userspace API, avoiding the
> issues discussed before.  It would become possible to use ruleset file
> descriptors as handles for entering Landlock domains and pass them
> around between processes.
> 
> The only drawback I can see is that it has the same issues as libpsx
> and nptl(7) in that the syscall can fail on individual threads due to
> ENOMEM.

Right. This approach is interesting, but it does not solve the main
issue here.

Anyway, being able to enter a Landlock domain would definitely be
useful. I would prefer using a pidfd to refer to a task's Landlock
domain, which would avoid race condition and make the API clearer.  It
would be nice to be able to pass a pidfd (instead of a ruleset) to
landlock_restrict_self().  If we want to directly deal with a domain, we
should create a dedicated domain FD type.

> 
> If we can not find a solution for "TSYNC", it seems that this might be
> a viable alternative.  For multithreaded applications enforcing a
> Landlock policy, it would become an application of libpsx with the
> LANDLOCK_RESTRICT_SELF_ENTER flag.
> 
> Let me know what you think.
> 
> –Günther

Thinking more about this feature, it might actually make sense to
synchronize all threads from the same process without checking other
threads' Landlock domain. The rationale are:
1. Linux threads are not security boundaries and it is allowed for a
   thread to control other threads' memory, which means changing their
   code flow.  In other words, thread's permissions are the union of all
   thread's permissions in the same process.
2. libpsx and libc's set*id() ignore other thread's credentials and just
   blindly execute the same code on all threads.
3. It would be simpler and would avoid another error case.

An issue could happen if a Landlock domain restricting a test thread is
replaced.  I don't think the benefit of avoiding this issue is worth it
compared to the guarantee we get when forcing the sandboxing of a full
process without error.

We should rename the flag to LANDLOCK_RESTRICT_SELF_PROCESS to make it
clear what it does.

The remaining issues are still the potential memory allocation failures.
There are two things:

1. We should try as much as possible to limit useless credential
   duplications by not creating a new struct cred if parent credentials
   are the same.

2. To avoid the libpsx inconsistency (because of ENOMEM or EPERM),
   landlock_restrict_self(2) should handle memory allocation and
   transition the process from a known state to another known state.

What about this approach:
- "Freeze" all threads of the current process (not ideal but simple) to
  make sure their credentials don't get updated.
- Create a new blank credential for the calling thread.
- Walk through all threads and create a new blank credential for all
  threads with a different cred than the caller.
- Inject a task work that will call cred_transfer() for all threads with
  either the same new credential used by the caller (incrementing the
  refcount), or it will populate and use a blank one if it has different
  credentials than the caller.

This may not efficiently deduplicate credentials for all threads but it
is a simple deduplication approach that should be useful in most cases.

The difficult part is mainly in the "fleezing". It would be nice to
change the cred API to avoid that but I'm not sure how.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-03-10 13:04         ` Günther Noack
  2025-03-11 14:32           ` Mickaël Salaün
@ 2025-05-27 14:26           ` Jann Horn
  1 sibling, 0 replies; 18+ messages in thread
From: Jann Horn @ 2025-05-27 14:26 UTC (permalink / raw)
  To: Günther Noack
  Cc: Paul Moore, Mickaël Salaün, sergeh, David Howells,
	Kees Cook, linux-security-module, Konstantin Meskhidze,
	linux-kernel, Peter Newman

On Mon, Mar 10, 2025 at 2:04 PM Günther Noack <gnoack3000@gmail.com> wrote:
> Hello Paul and Serge!
>
> On Tue, Mar 04, 2025 at 09:25:51PM +0100, Mickaël Salaün wrote:
> > On Fri, Feb 28, 2025 at 06:33:55PM +0100, Günther Noack wrote:
> > > Hello!
> > >
> > > Thanks for the review!
> > >
> > > I'm adding David Howells to this thread as well.  David, maybe you can
> > > help us and suggest a appropriate way to update the struct cred across
> > > multiple threads?
>
> Paul and Serge, since you are volunteering to take ownership of
> credentials, maybe you can advise on what is the best approach here?
>
> To summarize the approaches that I have been discussing with Mickaël:
>
> Approach 1: Use the creds API thread-by-thread (implemented here)
>
>   * Each task calls prepare_creds() and commit_creds() on its own, in
>     line with the way the API is designed to be used (from a single
>     task).
>   * Task work gets scheduled with a pseudo-signal and the task that
>     invoked the syscall is waiting for all of them to return.
>   * Task work can fail at the beginning due to prepare_creds(), in
>     which case all tasks have to abort_creds(). Additional
>     synchronization is needed for that.
>
>   Drawback: We need to grab the system-global task lock to prevent new
>   thread creation and also grab the per-process signal lock to prevent
>   races with other creds accesses, for the entire time as we wait for
>   each task to do the task work.

The tasklist_lock and the siglock cant be used while we're waiting,
they're not sleepable locks. Also, copy_process() currently copies
creds to the new task with copy_creds() without holding any locks yet.

I think one way to implement this option without introducing new locks
in core kernel code would be to have the thread initiating the
credential change do this in three separate steps:

 - loop over the other threads under RCU (for searching for new
threads to send task work to)
 - allocate state for threads (outside RCU, in sleepable context)
 - wait until all threads are in task work


I guess that might look roughly like this, as a rough sketch:


struct landlock_sync_state {
  atomic_t num_threads_entered;
  atomic_t num_threads_done;
  bool abort;
  wait_queue_head_t wq_for_initiator;
  wait_queue_head_t wq_for_others;
};
struct landlock_tw_state {
  struct task_struct *task;
  struct callback_head work;
  struct landlock_sync_state *shared;
};

size_t num_threads_signalled = 0;
size_t num_threads_allocated = 0;
size_t num_threads_to_allocate;
struct landlock_tw_state **thread_works = NULL;
struct landlock_sync_state state;
bool all_threads_signalled = false;
while (true) {
  /* scan for threads we haven't sent task work to yet */
  rcu_read_lock();
  num_threads_to_allocate = 0;
  for_each_thread(current, thread) {
    /*
     * this PF_EXITING check is a bit dodgy but not much worse than
     * what seccomp already does
    */
    if (thread == current || (READ_ONCE(thread->flags) & PF_EXITING))
      continue;

    for (int i=0; i<num_threads_signalled; i++) {
      if (thread_works[i]->task == thread)
        goto next_thread;
    }

    all_threads_signalled = false;
    if (num_threads_allocated > num_threads_signalled) {
      struct landlock_tw_state *ltws = thread_works[num_threads_signalled];

      ltws->task = thread;
      ltws->shared = &state;
      if (task_work_add(thread, &ltws->work, TWA_SIGNAL) == 0)
        num_threads_signalled++;
    } else {
      num_threads_to_allocate++;
    }

next_thread:;
  }
  rcu_read_unlock();

  if (all_threads_signalled)
    break; /* success */

  /*
   * If we need state for more threads, allocate it now and immediately retry.
   * Normally we only need to go through this path once, more if we race with
   * clone().
   */
  if (num_threads_to_allocate) {
    size_t new_count = num_threads_allocated + num_threads_to_allocate;
    struct landlock_tw_state **reallocd = kvrealloc(thread_works,
size_mul(sizeof(struct landlock_tw_state*), new_count), GFP_KERNEL);
    if (!reallocd)
      goto bailout;
    thread_works = reallocd;
    continue;
  }

  /* wait for all the threads we signalled */
  if (wait_event_interruptible(&state.wq_for_initiator,
          atomic_read(&state.num_threads_entered) == num_threads_signalled))
    goto bailout;

  /*
   * Now loop at least once more to check that none of the threads called
   * clone() in the meantime.
   */
  all_threads_signalled = true;
}

/* success */

/* ... coordinate the cred prepare and commit across threads here ... */

bailout:
for (int i=0; i<num_threads_signalled; i++) {
  if (task_work_cancel(thread_works[i].task, &thread_works[i].work))
    atomic_inc(&state.num_threads_done);
}
WRITE_ONCE(state.abort, true);
wake_up(&state.wq_for_others);
wait_event(&state.wq_for_initiator,
atomic_read(&state.num_threads_done) == num_threads_signalled);


This would be somewhat complex, but at least it would keep the
complexity isolated in the code for updating creds across tasks, so I
think this would be reasonable.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-05-18 19:57               ` Mickaël Salaün
@ 2025-05-30 13:16                 ` Günther Noack
  2025-05-30 15:11                   ` Mickaël Salaün
  0 siblings, 1 reply; 18+ messages in thread
From: Günther Noack @ 2025-05-30 13:16 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Günther Noack, Paul Moore, sergeh, David Howells, Kees Cook,
	linux-security-module, Konstantin Meskhidze, Jann Horn,
	linux-kernel, Peter Newman, Andy Lutomirski, Will Drewry

On Sun, May 18, 2025 at 09:57:32PM +0200, Mickaël Salaün wrote:
> On Sun, May 18, 2025 at 09:40:05AM +0200, Günther Noack wrote:
> > On Tue, Mar 11, 2025 at 03:32:53PM +0100, Mickaël Salaün wrote:
> > > On Mon, Mar 10, 2025 at 02:04:23PM +0100, Günther Noack wrote:

> > > > Approach 1: Use the creds API thread-by-thread (implemented here)
> > > > 
> > > >   * Each task calls prepare_creds() and commit_creds() on its own, in
> > > >     line with the way the API is designed to be used (from a single
> > > >     task).
> > > >   * Task work gets scheduled with a pseudo-signal and the task that
> > > >     invoked the syscall is waiting for all of them to return.
> > > >   * Task work can fail at the beginning due to prepare_creds(), in
> > > >     which case all tasks have to abort_creds(). Additional
> > > >     synchronization is needed for that.
> > > > 
> > > >   Drawback: We need to grab the system-global task lock to prevent new
> > > >   thread creation and also grab the per-process signal lock to prevent
> > > >   races with other creds accesses, for the entire time as we wait for
> > > >   each task to do the task work.
> > > 
> > > In other words, this approach blocks all threads from the same process.
> > 
> > It does, but that is still an improvement over the current
> > libpsx-based implementation in userspace.  That existing
> > implementation does not block, but it is running the risk that
> > prepare_creds() might fail on one of the threads (e.g. allocation
> > failure), which would leave the processes' threads in an inconsistent
> > state.
> > 
> > Another upside that the in-kernel implementation has is that the
> > implementation of that is hidden behind an API, so if we can
> > eventually find a better approach, we can migrate to it.  It gives us
> > flexibility.
> 
> > I guess a possible variant (approach 1B) would be to do the equivalent
> > to what userspace does today, and not make all threads wait for the
> > possible error of prepare_creds() on the other threads.
> 
> This 1B variant is not OK because it would remove the guarantee that the
> whole process is restricted.

👍 Agreed.


> > > > Approach 2: Attempt to do the prepare_creds() step in the calling task.
> > > > 
> > > >   * Would use an API similar to what keyctl uses for the
> > > >     parent-process update.
> > > >   * This side-steps the credentials update API as it is documented in
> > > >     Documentation, using the cred_alloc_blank() helper and replicating
> > > >     some prepare_creds() logic.
> > > > 
> > > >   Drawback: This would introduce another use of the cred_alloc_blank()
> > > >   API (and the cred_transfer LSM hook), which would otherwise be
> > > >   reasonable to delete if we can remove the keyctl use case.
> > > >   (https://lore.kernel.org/all/20240805-remove-cred-transfer-v2-0-a2aa1d45e6b8@google.com/)
> > > 
> > > cred_alloc_blank() was designed to avoid dealing with -ENOMEM, which is
> > > a required property for this Landlock TSYNC feature (i.e. atomic and
> > > consistent synchronization).
> > 
> > Remark on the side, I suspect that the error handling in nptl(7)
> > probably also does not guarantee that, also for setuid(2) and friends.
> > 
> > 
> > > I think it would make sense to replace most of the
> > > key_change_session_keyring() code with a new cred_transfer() helper that
> > > will memcpy the old cred to the new, increment the appropriate ref
> > > counters, and call security_transfer_creds().  We could then use this
> > > helper in Landlock too.
> > > 
> > > To properly handle race conditions with a thread changing its own
> > > credentials, we would need a new LSM hook called by commit_creds().
> > > For the Landlock implementation, this hook would check if the process is
> > > being Landlocked+TSYNC and return -ERESTARTNOINTR if it is the case.
> > > The newly created task_work would then be free to update each thread's
> > > credentials while only blocking the calling thread (which is also a
> > > required feature).
> > > 
> > > Alternatively, instead of a new LSM hook, commit_creds() could check
> > > itself a new group leader's flag set if all the credentials from the
> > > calling process are being updated, and return -ERESTARTNOINTR in this
> > > case.
> > 
> > commit_creds() is explicitly documented to never return errors.
> > It returns a 0 integer so that it lends itself for tail calls,
> > and some of those usages might also rely on it always working.
> > There are ~15 existing calls where the return value is discarded.
> 
> Indeed, commit_creds() should always return 0.  My full proposal does
> not look safe enough, but the cred_transfer() helper can still be
> useful.
> 
> > 
> > If commit_creds() returns -ERESTARTNOINTR, I assume that your idea is
> > that the task_work would retry the prepare-and-commit when
> > encountering that?
> > 
> > We would have to store the fact that the process is being
> > Landlock+TSYNC'd in a central place (e.g. group leader flag set).
> > When that is done, don't we need more synchronization mechanisms to
> > access that (which RCU was meant to avoid)?
> > 
> > I am having a hard time wrapping my head around these synchronization
> > schemes, I feel this is getting too complicated for what it is trying
> > to do and might become difficult to maintain if we implemented it.
> 
> Fair. ERESTARTNOINTR should only be used by a syscall implementation.
> 
> > 
> > > > Approach 3: Store Landlock domains outside of credentials altogether
> > > > 
> > > >   * We could also store a task's Landlock domain as a pointer in the
> > > >     per-task security blob, and refcount these.  We would need to make
> > > >     sure that they get newly referenced and updated in the same
> > > >     scenarios as they do within struct cred today.
> > > >   * We could then guard accesses to a task's Landlock domain with a
> > > >     more classic locking mechanism.  This would make it possible to
> > > >     update the Landlock domain of all tasks in a process without
> > > >     having to go through pseudo-signals.
> > > > 
> > > >   Drawbacks:
> > > >   * Would have to make sure that the Landlock domain the task's LSM
> > > >     blob behaves exactly the same as before in the struct cred.
> > > >   * Potentially slower to access Landlock domains that are guarded by
> > > >     a mutex.
> > > 
> > > This would not work because the kernel (including LSM hooks) uses
> > > credentials to check access.
> > 
> > It's unclear to me what you mean by that.
> > 
> > Do you mean that it is hard to replicate for Landlock the cases where
> > the pointer would have to be copied, because the LSM hooks are not
> > suited for it?
> 
> struct cred is used to check if a task subject can access a task object.
> Landlock's metadata must stay in struct cred to be available when
> checking access to any kernel object.  The LSM hooks reflect this
> rationale by only passing struct cred when checking a task (e.g.
> security_task_kill()'s cred).
> 
> seccomp only cares about filtering raw syscalls, and the seccomp filters
> are just ignored when the kernel (with an LSM or not) checks task's
> permission to access another task.
> 
> The per-task security blob could store some state though, e.g. to
> identify if a domain needs to be updated, but I don't see a use case
> here.

(Side remark on the idea of storing "pending domain updates" in the task blob:

I have pondered such an idea as well, where we do not store the Landlock domain
itself in the task blob, but only a "pending" update that we need to do to the
Landlock domain in creds, and then to apply that opportunistically/lazily as
part of other Landlock LSM calls.

I believe in this approach, it becomes hard to control whether that update can
actually ever get applied.  So to be sure, we would always have to run under the
assumption that it does not get applied, and then we might as well store the
Landlock domain directly in the task blob.

I also don't think this makes sense.)


> > Here is another possible approach which a colleague suggested in a
> > discussion:
> > 
> > Approach 4: Freeze-and re-enforce the Landlock ruleset
> > 
> > Another option would be to have a different user space API for this,
> > with a flag LANDLOCK_RESTRICT_SELF_ENTER (name TBD) to enter a given
> > domain.
> > 
> > On first usage of landlock_restrict_self() with the flag, the enforced
> > ruleset would be frozen and linked to the Landlock domain which was
> > enforced at the end.
> > 
> > Subsequent attempts to add rules to the ruleset would fail when the
> > ruleset is frozen.  The ruleset FD is now representing the created
> > domain including all its nesting.
> > 
> > Subsequent usages of landlock_restrict_self() on a frozen ruleset would:
> > 
> > (a) check that the ruleset's domain is a narrower (nested) domain of
> >     the current thread's domain (so that we retain the property of
> >     only locking in a task further than it was before).
> > 
> > (b) set the task's domain to the domain attached to the ruleset
> > 
> > This way, we would keep a per-thread userspace API, avoiding the
> > issues discussed before.  It would become possible to use ruleset file
> > descriptors as handles for entering Landlock domains and pass them
> > around between processes.
> > 
> > The only drawback I can see is that it has the same issues as libpsx
> > and nptl(7) in that the syscall can fail on individual threads due to
> > ENOMEM.
> 
> Right. This approach is interesting, but it does not solve the main
> issue here.

It doesn't?

In my mind, the main goal of the patch set is that we can enable Landlock in
multithreaded processes like in Go programs or in multithreaded C(++).

With Approach 4, we would admittedly still have to do some work in userspace,
and it would not have the nice all-or-nothing semantics, but at least, it would
be possible to get all threads joining the same Landlock domain.  (And after
all, setuid(0) also does not have the all-or-nothing semantics, from what I can
tell.)


> Anyway, being able to enter a Landlock domain would definitely be
> useful. I would prefer using a pidfd to refer to a task's Landlock
> domain, which would avoid race condition and make the API clearer.  It
> would be nice to be able to pass a pidfd (instead of a ruleset) to
> landlock_restrict_self().  If we want to directly deal with a domain, we
> should create a dedicated domain FD type.

Fair enough, a different FD type for that would also be possible.


> > If we can not find a solution for "TSYNC", it seems that this might be
> > a viable alternative.  For multithreaded applications enforcing a
> > Landlock policy, it would become an application of libpsx with the
> > LANDLOCK_RESTRICT_SELF_ENTER flag.
> > 
> > Let me know what you think.
> > 
> > –Günther
> 
> Thinking more about this feature, it might actually make sense to
> synchronize all threads from the same process without checking other
> threads' Landlock domain. The rationale are:
> 1. Linux threads are not security boundaries and it is allowed for a
>    thread to control other threads' memory, which means changing their
>    code flow.  In other words, thread's permissions are the union of all
>    thread's permissions in the same process.
> 2. libpsx and libc's set*id() ignore other thread's credentials and just
>    blindly execute the same code on all threads.
> 3. It would be simpler and would avoid another error case.

+1, agreed.  That would let us skip the check for the pre-existing domain on
these threads.


> An issue could happen if a Landlock domain restricting a test thread is
> replaced.

You mean for Landlock's selftests?  I thought these were running in their own
forked-off subprocess?  I'm probably misunderstanding you here. :)


> I don't think the benefit of avoiding this issue is worth it
> compared to the guarantee we get when forcing the sandboxing of a full
> process without error.
> 



> We should rename the flag to LANDLOCK_RESTRICT_SELF_PROCESS to make it
> clear what it does.
> 
> The remaining issues are still the potential memory allocation failures.
> There are two things:
> 
> 1. We should try as much as possible to limit useless credential
>    duplications by not creating a new struct cred if parent credentials
>    are the same.
> 
> 2. To avoid the libpsx inconsistency (because of ENOMEM or EPERM),
>    landlock_restrict_self(2) should handle memory allocation and
>    transition the process from a known state to another known state.
> 
> What about this approach:
> - "Freeze" all threads of the current process (not ideal but simple) to
>   make sure their credentials don't get updated.
> - Create a new blank credential for the calling thread.
> - Walk through all threads and create a new blank credential for all
>   threads with a different cred than the caller.
> - Inject a task work that will call cred_transfer() for all threads with
>   either the same new credential used by the caller (incrementing the
>   refcount), or it will populate and use a blank one if it has different
>   credentials than the caller.
> 
> This may not efficiently deduplicate credentials for all threads but it
> is a simple deduplication approach that should be useful in most cases.
> 
> The difficult part is mainly in the "fleezing". It would be nice to
> change the cred API to avoid that but I'm not sure how.

I don't see an option how we could freeze the credentials of other threads:

To freeze a task's credentials, we would have to inhibit that commit_creds()
succeeds on that task, and I don't see how that would be done - we can not
prevent these tasks from calling commit_creds() [1], and when commit_creds()
gets called, it is guaranteed to work.

So in my mind, we have to somehow deal with the possibility that a task has a
new and not-previously-seen struct creds, by the time that its task_work gets
called.  As a consequence, I think a call to prepare_creds() would then be
unavoidable in the task_work?


—Günther


[1] We might be able to keep cred_prepare() and maybe cred_alloc_blank() from
    succeeding, but that does not mean that no one can call commit_creds() -
    there is still the possibility that commit_creds() gets called with a struct
    cred* that was acquired before decided to freeze.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-05-30 13:16                 ` Günther Noack
@ 2025-05-30 15:11                   ` Mickaël Salaün
  2025-05-30 16:26                     ` Günther Noack
  0 siblings, 1 reply; 18+ messages in thread
From: Mickaël Salaün @ 2025-05-30 15:11 UTC (permalink / raw)
  To: Günther Noack
  Cc: Günther Noack, Paul Moore, sergeh, David Howells, Kees Cook,
	linux-security-module, Konstantin Meskhidze, Jann Horn,
	linux-kernel, Peter Newman, Andy Lutomirski, Will Drewry

On Fri, May 30, 2025 at 03:16:20PM +0200, Günther Noack wrote:
> On Sun, May 18, 2025 at 09:57:32PM +0200, Mickaël Salaün wrote:
> > On Sun, May 18, 2025 at 09:40:05AM +0200, Günther Noack wrote:
> > > On Tue, Mar 11, 2025 at 03:32:53PM +0100, Mickaël Salaün wrote:
> > > > On Mon, Mar 10, 2025 at 02:04:23PM +0100, Günther Noack wrote:
> 
> > > > > Approach 1: Use the creds API thread-by-thread (implemented here)
> > > > > 
> > > > >   * Each task calls prepare_creds() and commit_creds() on its own, in
> > > > >     line with the way the API is designed to be used (from a single
> > > > >     task).
> > > > >   * Task work gets scheduled with a pseudo-signal and the task that
> > > > >     invoked the syscall is waiting for all of them to return.
> > > > >   * Task work can fail at the beginning due to prepare_creds(), in
> > > > >     which case all tasks have to abort_creds(). Additional
> > > > >     synchronization is needed for that.
> > > > > 
> > > > >   Drawback: We need to grab the system-global task lock to prevent new
> > > > >   thread creation and also grab the per-process signal lock to prevent
> > > > >   races with other creds accesses, for the entire time as we wait for
> > > > >   each task to do the task work.
> > > > 
> > > > In other words, this approach blocks all threads from the same process.
> > > 
> > > It does, but that is still an improvement over the current
> > > libpsx-based implementation in userspace.  That existing
> > > implementation does not block, but it is running the risk that
> > > prepare_creds() might fail on one of the threads (e.g. allocation
> > > failure), which would leave the processes' threads in an inconsistent
> > > state.
> > > 
> > > Another upside that the in-kernel implementation has is that the
> > > implementation of that is hidden behind an API, so if we can
> > > eventually find a better approach, we can migrate to it.  It gives us
> > > flexibility.
> > 
> > > I guess a possible variant (approach 1B) would be to do the equivalent
> > > to what userspace does today, and not make all threads wait for the
> > > possible error of prepare_creds() on the other threads.
> > 
> > This 1B variant is not OK because it would remove the guarantee that the
> > whole process is restricted.
> 
> 👍 Agreed.
> 
> 
> > > > > Approach 2: Attempt to do the prepare_creds() step in the calling task.
> > > > > 
> > > > >   * Would use an API similar to what keyctl uses for the
> > > > >     parent-process update.
> > > > >   * This side-steps the credentials update API as it is documented in
> > > > >     Documentation, using the cred_alloc_blank() helper and replicating
> > > > >     some prepare_creds() logic.
> > > > > 
> > > > >   Drawback: This would introduce another use of the cred_alloc_blank()
> > > > >   API (and the cred_transfer LSM hook), which would otherwise be
> > > > >   reasonable to delete if we can remove the keyctl use case.
> > > > >   (https://lore.kernel.org/all/20240805-remove-cred-transfer-v2-0-a2aa1d45e6b8@google.com/)
> > > > 
> > > > cred_alloc_blank() was designed to avoid dealing with -ENOMEM, which is
> > > > a required property for this Landlock TSYNC feature (i.e. atomic and
> > > > consistent synchronization).
> > > 
> > > Remark on the side, I suspect that the error handling in nptl(7)
> > > probably also does not guarantee that, also for setuid(2) and friends.
> > > 
> > > 
> > > > I think it would make sense to replace most of the
> > > > key_change_session_keyring() code with a new cred_transfer() helper that
> > > > will memcpy the old cred to the new, increment the appropriate ref
> > > > counters, and call security_transfer_creds().  We could then use this
> > > > helper in Landlock too.
> > > > 
> > > > To properly handle race conditions with a thread changing its own
> > > > credentials, we would need a new LSM hook called by commit_creds().
> > > > For the Landlock implementation, this hook would check if the process is
> > > > being Landlocked+TSYNC and return -ERESTARTNOINTR if it is the case.
> > > > The newly created task_work would then be free to update each thread's
> > > > credentials while only blocking the calling thread (which is also a
> > > > required feature).
> > > > 
> > > > Alternatively, instead of a new LSM hook, commit_creds() could check
> > > > itself a new group leader's flag set if all the credentials from the
> > > > calling process are being updated, and return -ERESTARTNOINTR in this
> > > > case.
> > > 
> > > commit_creds() is explicitly documented to never return errors.
> > > It returns a 0 integer so that it lends itself for tail calls,
> > > and some of those usages might also rely on it always working.
> > > There are ~15 existing calls where the return value is discarded.
> > 
> > Indeed, commit_creds() should always return 0.  My full proposal does
> > not look safe enough, but the cred_transfer() helper can still be
> > useful.
> > 
> > > 
> > > If commit_creds() returns -ERESTARTNOINTR, I assume that your idea is
> > > that the task_work would retry the prepare-and-commit when
> > > encountering that?
> > > 
> > > We would have to store the fact that the process is being
> > > Landlock+TSYNC'd in a central place (e.g. group leader flag set).
> > > When that is done, don't we need more synchronization mechanisms to
> > > access that (which RCU was meant to avoid)?
> > > 
> > > I am having a hard time wrapping my head around these synchronization
> > > schemes, I feel this is getting too complicated for what it is trying
> > > to do and might become difficult to maintain if we implemented it.
> > 
> > Fair. ERESTARTNOINTR should only be used by a syscall implementation.
> > 
> > > 
> > > > > Approach 3: Store Landlock domains outside of credentials altogether
> > > > > 
> > > > >   * We could also store a task's Landlock domain as a pointer in the
> > > > >     per-task security blob, and refcount these.  We would need to make
> > > > >     sure that they get newly referenced and updated in the same
> > > > >     scenarios as they do within struct cred today.
> > > > >   * We could then guard accesses to a task's Landlock domain with a
> > > > >     more classic locking mechanism.  This would make it possible to
> > > > >     update the Landlock domain of all tasks in a process without
> > > > >     having to go through pseudo-signals.
> > > > > 
> > > > >   Drawbacks:
> > > > >   * Would have to make sure that the Landlock domain the task's LSM
> > > > >     blob behaves exactly the same as before in the struct cred.
> > > > >   * Potentially slower to access Landlock domains that are guarded by
> > > > >     a mutex.
> > > > 
> > > > This would not work because the kernel (including LSM hooks) uses
> > > > credentials to check access.
> > > 
> > > It's unclear to me what you mean by that.
> > > 
> > > Do you mean that it is hard to replicate for Landlock the cases where
> > > the pointer would have to be copied, because the LSM hooks are not
> > > suited for it?
> > 
> > struct cred is used to check if a task subject can access a task object.
> > Landlock's metadata must stay in struct cred to be available when
> > checking access to any kernel object.  The LSM hooks reflect this
> > rationale by only passing struct cred when checking a task (e.g.
> > security_task_kill()'s cred).
> > 
> > seccomp only cares about filtering raw syscalls, and the seccomp filters
> > are just ignored when the kernel (with an LSM or not) checks task's
> > permission to access another task.
> > 
> > The per-task security blob could store some state though, e.g. to
> > identify if a domain needs to be updated, but I don't see a use case
> > here.
> 
> (Side remark on the idea of storing "pending domain updates" in the task blob:
> 
> I have pondered such an idea as well, where we do not store the Landlock domain
> itself in the task blob, but only a "pending" update that we need to do to the
> Landlock domain in creds, and then to apply that opportunistically/lazily as
> part of other Landlock LSM calls.
> 
> I believe in this approach, it becomes hard to control whether that update can
> actually ever get applied.  So to be sure, we would always have to run under the
> assumption that it does not get applied, and then we might as well store the
> Landlock domain directly in the task blob.
> 
> I also don't think this makes sense.)
> 
> 
> > > Here is another possible approach which a colleague suggested in a
> > > discussion:
> > > 
> > > Approach 4: Freeze-and re-enforce the Landlock ruleset
> > > 
> > > Another option would be to have a different user space API for this,
> > > with a flag LANDLOCK_RESTRICT_SELF_ENTER (name TBD) to enter a given
> > > domain.
> > > 
> > > On first usage of landlock_restrict_self() with the flag, the enforced
> > > ruleset would be frozen and linked to the Landlock domain which was
> > > enforced at the end.
> > > 
> > > Subsequent attempts to add rules to the ruleset would fail when the
> > > ruleset is frozen.  The ruleset FD is now representing the created
> > > domain including all its nesting.
> > > 
> > > Subsequent usages of landlock_restrict_self() on a frozen ruleset would:
> > > 
> > > (a) check that the ruleset's domain is a narrower (nested) domain of
> > >     the current thread's domain (so that we retain the property of
> > >     only locking in a task further than it was before).
> > > 
> > > (b) set the task's domain to the domain attached to the ruleset
> > > 
> > > This way, we would keep a per-thread userspace API, avoiding the
> > > issues discussed before.  It would become possible to use ruleset file
> > > descriptors as handles for entering Landlock domains and pass them
> > > around between processes.
> > > 
> > > The only drawback I can see is that it has the same issues as libpsx
> > > and nptl(7) in that the syscall can fail on individual threads due to
> > > ENOMEM.
> > 
> > Right. This approach is interesting, but it does not solve the main
> > issue here.
> 
> It doesn't?
> 
> In my mind, the main goal of the patch set is that we can enable Landlock in
> multithreaded processes like in Go programs or in multithreaded C(++).

Yes, but it looks like replicating a user space hack in the kernel.

> 
> With Approach 4, we would admittedly still have to do some work in userspace,
> and it would not have the nice all-or-nothing semantics, but at least, it would
> be possible to get all threads joining the same Landlock domain.  (And after
> all, setuid(0) also does not have the all-or-nothing semantics, from what I can
> tell.)
> 
> 
> > Anyway, being able to enter a Landlock domain would definitely be
> > useful. I would prefer using a pidfd to refer to a task's Landlock
> > domain, which would avoid race condition and make the API clearer.  It
> > would be nice to be able to pass a pidfd (instead of a ruleset) to
> > landlock_restrict_self().  If we want to directly deal with a domain, we
> > should create a dedicated domain FD type.
> 
> Fair enough, a different FD type for that would also be possible.

For this use case, a pidfd would be appropriate to avoid
inconsistencies.

> 
> 
> > > If we can not find a solution for "TSYNC", it seems that this might be
> > > a viable alternative.  For multithreaded applications enforcing a
> > > Landlock policy, it would become an application of libpsx with the
> > > LANDLOCK_RESTRICT_SELF_ENTER flag.
> > > 
> > > Let me know what you think.
> > > 
> > > –Günther
> > 
> > Thinking more about this feature, it might actually make sense to
> > synchronize all threads from the same process without checking other
> > threads' Landlock domain. The rationale are:
> > 1. Linux threads are not security boundaries and it is allowed for a
> >    thread to control other threads' memory, which means changing their
> >    code flow.  In other words, thread's permissions are the union of all
> >    thread's permissions in the same process.
> > 2. libpsx and libc's set*id() ignore other thread's credentials and just
> >    blindly execute the same code on all threads.
> > 3. It would be simpler and would avoid another error case.
> 
> +1, agreed.  That would let us skip the check for the pre-existing domain on
> these threads.
> 
> 
> > An issue could happen if a Landlock domain restricting a test thread is
> > replaced.
> 
> You mean for Landlock's selftests?  I thought these were running in their own
> forked-off subprocess?  I'm probably misunderstanding you here. :)

I was thinking about potential (far fetched) issues that
LANDLOCK_RESTRICT_SELF_PROCESS could cause to existing code.  The main
use case to only enforce a Landlock domain on one thread would be for
test purpose, but this is indeed not the case for kselftest (except for
explit pthread tests).  So yeah, not a big deal but it should be
mentioned in the commit message and the doc that without
LANDLOCK_RESTRICT_SELF_PROCESS, threads can remove restrictions that are
only enforced on their siblings.

> 
> 
> > I don't think the benefit of avoiding this issue is worth it
> > compared to the guarantee we get when forcing the sandboxing of a full
> > process without error.
> > 
> 
> 
> 
> > We should rename the flag to LANDLOCK_RESTRICT_SELF_PROCESS to make it
> > clear what it does.
> > 
> > The remaining issues are still the potential memory allocation failures.
> > There are two things:
> > 
> > 1. We should try as much as possible to limit useless credential
> >    duplications by not creating a new struct cred if parent credentials
> >    are the same.
> > 
> > 2. To avoid the libpsx inconsistency (because of ENOMEM or EPERM),
> >    landlock_restrict_self(2) should handle memory allocation and
> >    transition the process from a known state to another known state.
> > 
> > What about this approach:
> > - "Freeze" all threads of the current process (not ideal but simple) to
> >   make sure their credentials don't get updated.
> > - Create a new blank credential for the calling thread.
> > - Walk through all threads and create a new blank credential for all
> >   threads with a different cred than the caller.
> > - Inject a task work that will call cred_transfer() for all threads with
> >   either the same new credential used by the caller (incrementing the
> >   refcount), or it will populate and use a blank one if it has different
> >   credentials than the caller.
> > 
> > This may not efficiently deduplicate credentials for all threads but it
> > is a simple deduplication approach that should be useful in most cases.
> > 
> > The difficult part is mainly in the "fleezing". It would be nice to
> > change the cred API to avoid that but I'm not sure how.
> 
> I don't see an option how we could freeze the credentials of other threads:
> 
> To freeze a task's credentials, we would have to inhibit that commit_creds()
> succeeds on that task, and I don't see how that would be done - we can not
> prevent these tasks from calling commit_creds() [1], and when commit_creds()
> gets called, it is guaranteed to work.
> 
> So in my mind, we have to somehow deal with the possibility that a task has a
> new and not-previously-seen struct creds, by the time that its task_work gets
> called.  As a consequence, I think a call to prepare_creds() would then be
> unavoidable in the task_work?

OK, we don't need to freeze all threads, just to block thread creation.

What about that:
1. lock thread creation for this process
2. call prepare_creds() for the calling thread (called new_cred)
3. call cred_alloc_blank() for all other threads, store them in a list,
   and exit if ENOMEM
4. asynchronously walk through all threads, and for each:
  a. if its creds are the same (i.e. same pointer) as the calling
     thread's ones, then call get_cred(new_cred) and
     commit_credsnew_cred().
  b. otherwise, take a blank cred, call cred_transfer(), add the
     Landlock domain, and commit_creds() with it.
5. free all unused blank creds (most)
6. call commit_creds(new_creds) and return

Pros:
- do not block threads
- minimize cred duplication
- atomic operation (from the point of view of the caller): all or
  nothing (with an error)
- almost no change to existing cred API

Cons:
- block thread creation
- initially allocate one cred per thread (but free most of them after)

> 
> 
> —Günther
> 
> 
> [1] We might be able to keep cred_prepare() and maybe cred_alloc_blank() from
>     succeeding, but that does not mean that no one can call commit_creds() -
>     there is still the possibility that commit_creds() gets called with a struct
>     cred* that was acquired before decided to freeze.
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-05-30 15:11                   ` Mickaël Salaün
@ 2025-05-30 16:26                     ` Günther Noack
  2025-05-30 16:52                       ` Casey Schaufler
  2025-06-02  6:45                       ` Mickaël Salaün
  0 siblings, 2 replies; 18+ messages in thread
From: Günther Noack @ 2025-05-30 16:26 UTC (permalink / raw)
  To: Mickaël Salaün, Casey Schaufler
  Cc: Günther Noack, Paul Moore, sergeh, David Howells, Kees Cook,
	linux-security-module, Konstantin Meskhidze, Jann Horn,
	linux-kernel, Peter Newman, Andy Lutomirski, Will Drewry,
	Jarkko Sakkinen

On Fri, May 30, 2025 at 05:11:34PM +0200, Mickaël Salaün wrote:
> On Fri, May 30, 2025 at 03:16:20PM +0200, Günther Noack wrote:
> > On Sun, May 18, 2025 at 09:57:32PM +0200, Mickaël Salaün wrote:
> > > We should rename the flag to LANDLOCK_RESTRICT_SELF_PROCESS to make it
> > > clear what it does.
> > > 
> > > The remaining issues are still the potential memory allocation failures.
> > > There are two things:
> > > 
> > > 1. We should try as much as possible to limit useless credential
> > >    duplications by not creating a new struct cred if parent credentials
> > >    are the same.
> > > 
> > > 2. To avoid the libpsx inconsistency (because of ENOMEM or EPERM),
> > >    landlock_restrict_self(2) should handle memory allocation and
> > >    transition the process from a known state to another known state.
> > > 
> > > What about this approach:
> > > - "Freeze" all threads of the current process (not ideal but simple) to
> > >   make sure their credentials don't get updated.
> > > - Create a new blank credential for the calling thread.
> > > - Walk through all threads and create a new blank credential for all
> > >   threads with a different cred than the caller.
> > > - Inject a task work that will call cred_transfer() for all threads with
> > >   either the same new credential used by the caller (incrementing the
> > >   refcount), or it will populate and use a blank one if it has different
> > >   credentials than the caller.
> > > 
> > > This may not efficiently deduplicate credentials for all threads but it
> > > is a simple deduplication approach that should be useful in most cases.
> > > 
> > > The difficult part is mainly in the "fleezing". It would be nice to
> > > change the cred API to avoid that but I'm not sure how.
> > 
> > I don't see an option how we could freeze the credentials of other threads:
> > 
> > To freeze a task's credentials, we would have to inhibit that commit_creds()
> > succeeds on that task, and I don't see how that would be done - we can not
> > prevent these tasks from calling commit_creds() [1], and when commit_creds()
> > gets called, it is guaranteed to work.
> > 
> > So in my mind, we have to somehow deal with the possibility that a task has a
> > new and not-previously-seen struct creds, by the time that its task_work gets
> > called.  As a consequence, I think a call to prepare_creds() would then be
> > unavoidable in the task_work?
> 
> OK, we don't need to freeze all threads, just to block thread creation.
> 
> What about that:
> 1. lock thread creation for this process
> 2. call prepare_creds() for the calling thread (called new_cred)
> 3. call cred_alloc_blank() for all other threads, store them in a list,
>    and exit if ENOMEM
> 4. asynchronously walk through all threads, and for each:
>   a. if its creds are the same (i.e. same pointer) as the calling
>      thread's ones, then call get_cred(new_cred) and
>      commit_creds(new_cred).
>   b. otherwise, take a blank cred, call cred_transfer(), add the
>      Landlock domain, and commit_creds() with it.
> 5. free all unused blank creds (most)
> 6. call commit_creds(new_creds) and return
> 
> Pros:
> - do not block threads
> - minimize cred duplication
> - atomic operation (from the point of view of the caller): all or
>   nothing (with an error)
> - almost no change to existing cred API
> 
> Cons:
> - block thread creation
> - initially allocate one cred per thread (but free most of them after)

The proposal is growing on me.

One way to view transfer_creds() and have it nicely fit into the credentials API
would be to view prepare_creds() as a convenience wrapper around
transfer_creds(), so that prepare_creds() is implemented as a function which:

  1) allocates a new struct cred (this may fail)
  2) calls transfer_creds() with the new struct cred to populate

We could then move the bulk of its existing prepare_creds() implementation into
the new transfer_creds(), and could also move the keyctl implementation to use
that.

A remaining problem is: The caveat and the underlying assumption is that
transfer_creds() must never fail when it runs in the task work, if we want to
avoid the additional synchronization.  The existing cases in which the
credentials preparation logic can return an error are:

* Allocation failure for struct cred  (we would get rid of this)
* get_ucounts(new->ucounts) returns NULL  (not supposed to happen, AFAICT)
* security_prepare_creds() fails.  Existing LSM implementations are:
  * Tomoyo, SELinux, AppArmor, Landlock: always return 0
  * SMACK: May return -ENOMEM on allocation failure


So summing up, in my understanding, the prerequisites for this approach are that:

  1) get_ucounts() never returns NULL, and

  2) SMACK does not bubble up allocation errors from the cred_prepare hook

     Casey, Jarkko: do you think this is realistic for SMACK to change?

and that

  3) We can block new thread creation

     Mickaël, do you have a specific approach in mind for that?

     As Jann pointed out in [1], the tasklist_lock and siglock are not sleepable
     and can't be used while waiting, which is why he proposed an approach where
     we retry in a loop until no new threads show up any more, while getting the
     existing threads stuck in the task_work as well (where they can't spawn new
     threads).


Thanks,

—Günther

[1] https://lore.kernel.org/all/CAG48ez0pWg3OTABfCKRk5sWrURM-HdJhQMcWedEppc_z1rrVJw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-05-30 16:26                     ` Günther Noack
@ 2025-05-30 16:52                       ` Casey Schaufler
  2025-06-02  6:45                       ` Mickaël Salaün
  1 sibling, 0 replies; 18+ messages in thread
From: Casey Schaufler @ 2025-05-30 16:52 UTC (permalink / raw)
  To: Günther Noack, Mickaël Salaün
  Cc: Günther Noack, Paul Moore, sergeh, David Howells, Kees Cook,
	linux-security-module, Konstantin Meskhidze, Jann Horn,
	linux-kernel, Peter Newman, Andy Lutomirski, Will Drewry,
	Jarkko Sakkinen, Casey Schaufler

On 5/30/2025 9:26 AM, Günther Noack wrote:
> On Fri, May 30, 2025 at 05:11:34PM +0200, Mickaël Salaün wrote:
>> On Fri, May 30, 2025 at 03:16:20PM +0200, Günther Noack wrote:
>>> On Sun, May 18, 2025 at 09:57:32PM +0200, Mickaël Salaün wrote:
>>>> We should rename the flag to LANDLOCK_RESTRICT_SELF_PROCESS to make it
>>>> clear what it does.
>>>>
>>>> The remaining issues are still the potential memory allocation failures.
>>>> There are two things:
>>>>
>>>> 1. We should try as much as possible to limit useless credential
>>>>    duplications by not creating a new struct cred if parent credentials
>>>>    are the same.
>>>>
>>>> 2. To avoid the libpsx inconsistency (because of ENOMEM or EPERM),
>>>>    landlock_restrict_self(2) should handle memory allocation and
>>>>    transition the process from a known state to another known state.
>>>>
>>>> What about this approach:
>>>> - "Freeze" all threads of the current process (not ideal but simple) to
>>>>   make sure their credentials don't get updated.
>>>> - Create a new blank credential for the calling thread.
>>>> - Walk through all threads and create a new blank credential for all
>>>>   threads with a different cred than the caller.
>>>> - Inject a task work that will call cred_transfer() for all threads with
>>>>   either the same new credential used by the caller (incrementing the
>>>>   refcount), or it will populate and use a blank one if it has different
>>>>   credentials than the caller.
>>>>
>>>> This may not efficiently deduplicate credentials for all threads but it
>>>> is a simple deduplication approach that should be useful in most cases.
>>>>
>>>> The difficult part is mainly in the "fleezing". It would be nice to
>>>> change the cred API to avoid that but I'm not sure how.
>>> I don't see an option how we could freeze the credentials of other threads:
>>>
>>> To freeze a task's credentials, we would have to inhibit that commit_creds()
>>> succeeds on that task, and I don't see how that would be done - we can not
>>> prevent these tasks from calling commit_creds() [1], and when commit_creds()
>>> gets called, it is guaranteed to work.
>>>
>>> So in my mind, we have to somehow deal with the possibility that a task has a
>>> new and not-previously-seen struct creds, by the time that its task_work gets
>>> called.  As a consequence, I think a call to prepare_creds() would then be
>>> unavoidable in the task_work?
>> OK, we don't need to freeze all threads, just to block thread creation.
>>
>> What about that:
>> 1. lock thread creation for this process
>> 2. call prepare_creds() for the calling thread (called new_cred)
>> 3. call cred_alloc_blank() for all other threads, store them in a list,
>>    and exit if ENOMEM
>> 4. asynchronously walk through all threads, and for each:
>>   a. if its creds are the same (i.e. same pointer) as the calling
>>      thread's ones, then call get_cred(new_cred) and
>>      commit_creds(new_cred).
>>   b. otherwise, take a blank cred, call cred_transfer(), add the
>>      Landlock domain, and commit_creds() with it.
>> 5. free all unused blank creds (most)
>> 6. call commit_creds(new_creds) and return
>>
>> Pros:
>> - do not block threads
>> - minimize cred duplication
>> - atomic operation (from the point of view of the caller): all or
>>   nothing (with an error)
>> - almost no change to existing cred API
>>
>> Cons:
>> - block thread creation
>> - initially allocate one cred per thread (but free most of them after)
> The proposal is growing on me.
>
> One way to view transfer_creds() and have it nicely fit into the credentials API
> would be to view prepare_creds() as a convenience wrapper around
> transfer_creds(), so that prepare_creds() is implemented as a function which:
>
>   1) allocates a new struct cred (this may fail)
>   2) calls transfer_creds() with the new struct cred to populate
>
> We could then move the bulk of its existing prepare_creds() implementation into
> the new transfer_creds(), and could also move the keyctl implementation to use
> that.
>
> A remaining problem is: The caveat and the underlying assumption is that
> transfer_creds() must never fail when it runs in the task work, if we want to
> avoid the additional synchronization.  The existing cases in which the
> credentials preparation logic can return an error are:
>
> * Allocation failure for struct cred  (we would get rid of this)
> * get_ucounts(new->ucounts) returns NULL  (not supposed to happen, AFAICT)
> * security_prepare_creds() fails.  Existing LSM implementations are:
>   * Tomoyo, SELinux, AppArmor, Landlock: always return 0
>   * SMACK: May return -ENOMEM on allocation failure
>
>
> So summing up, in my understanding, the prerequisites for this approach are that:
>
>   1) get_ucounts() never returns NULL, and
>
>   2) SMACK does not bubble up allocation errors from the cred_prepare hook
>
>      Casey, Jarkko: do you think this is realistic for SMACK to change?

It's possible, but not trivial. The task based access rule list and relabel
list would need to move from the Smack cred blob to a Smack task blob. There
wasn't a task blob when these lists were introduced, so the cred was the only
option. I would entertain patches.

>
> and that
>
>   3) We can block new thread creation
>
>      Mickaël, do you have a specific approach in mind for that?
>
>      As Jann pointed out in [1], the tasklist_lock and siglock are not sleepable
>      and can't be used while waiting, which is why he proposed an approach where
>      we retry in a loop until no new threads show up any more, while getting the
>      existing threads stuck in the task_work as well (where they can't spawn new
>      threads).
>
>
> Thanks,
>
> —Günther
>
> [1] https://lore.kernel.org/all/CAG48ez0pWg3OTABfCKRk5sWrURM-HdJhQMcWedEppc_z1rrVJw@mail.gmail.com/
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-05-30 16:26                     ` Günther Noack
  2025-05-30 16:52                       ` Casey Schaufler
@ 2025-06-02  6:45                       ` Mickaël Salaün
  2025-06-12 11:51                         ` Günther Noack
  1 sibling, 1 reply; 18+ messages in thread
From: Mickaël Salaün @ 2025-06-02  6:45 UTC (permalink / raw)
  To: Günther Noack
  Cc: Casey Schaufler, Günther Noack, Paul Moore, sergeh,
	David Howells, Kees Cook, linux-security-module,
	Konstantin Meskhidze, Jann Horn, linux-kernel, Peter Newman,
	Andy Lutomirski, Will Drewry, Jarkko Sakkinen

On Fri, May 30, 2025 at 06:26:07PM +0200, Günther Noack wrote:
> On Fri, May 30, 2025 at 05:11:34PM +0200, Mickaël Salaün wrote:
> > On Fri, May 30, 2025 at 03:16:20PM +0200, Günther Noack wrote:
> > > On Sun, May 18, 2025 at 09:57:32PM +0200, Mickaël Salaün wrote:
> > > > We should rename the flag to LANDLOCK_RESTRICT_SELF_PROCESS to make it
> > > > clear what it does.
> > > > 
> > > > The remaining issues are still the potential memory allocation failures.
> > > > There are two things:
> > > > 
> > > > 1. We should try as much as possible to limit useless credential
> > > >    duplications by not creating a new struct cred if parent credentials
> > > >    are the same.
> > > > 
> > > > 2. To avoid the libpsx inconsistency (because of ENOMEM or EPERM),
> > > >    landlock_restrict_self(2) should handle memory allocation and
> > > >    transition the process from a known state to another known state.
> > > > 
> > > > What about this approach:
> > > > - "Freeze" all threads of the current process (not ideal but simple) to
> > > >   make sure their credentials don't get updated.
> > > > - Create a new blank credential for the calling thread.
> > > > - Walk through all threads and create a new blank credential for all
> > > >   threads with a different cred than the caller.
> > > > - Inject a task work that will call cred_transfer() for all threads with
> > > >   either the same new credential used by the caller (incrementing the
> > > >   refcount), or it will populate and use a blank one if it has different
> > > >   credentials than the caller.
> > > > 
> > > > This may not efficiently deduplicate credentials for all threads but it
> > > > is a simple deduplication approach that should be useful in most cases.
> > > > 
> > > > The difficult part is mainly in the "fleezing". It would be nice to
> > > > change the cred API to avoid that but I'm not sure how.
> > > 
> > > I don't see an option how we could freeze the credentials of other threads:
> > > 
> > > To freeze a task's credentials, we would have to inhibit that commit_creds()
> > > succeeds on that task, and I don't see how that would be done - we can not
> > > prevent these tasks from calling commit_creds() [1], and when commit_creds()
> > > gets called, it is guaranteed to work.
> > > 
> > > So in my mind, we have to somehow deal with the possibility that a task has a
> > > new and not-previously-seen struct creds, by the time that its task_work gets
> > > called.  As a consequence, I think a call to prepare_creds() would then be
> > > unavoidable in the task_work?
> > 
> > OK, we don't need to freeze all threads, just to block thread creation.
> > 
> > What about that:
> > 1. lock thread creation for this process
> > 2. call prepare_creds() for the calling thread (called new_cred)
> > 3. call cred_alloc_blank() for all other threads, store them in a list,
> >    and exit if ENOMEM
> > 4. asynchronously walk through all threads, and for each:
> >   a. if its creds are the same (i.e. same pointer) as the calling
> >      thread's ones, then call get_cred(new_cred) and
> >      commit_creds(new_cred).
> >   b. otherwise, take a blank cred, call cred_transfer(), add the
> >      Landlock domain, and commit_creds() with it.
> > 5. free all unused blank creds (most)
> > 6. call commit_creds(new_creds) and return
> > 
> > Pros:
> > - do not block threads
> > - minimize cred duplication
> > - atomic operation (from the point of view of the caller): all or
> >   nothing (with an error)
> > - almost no change to existing cred API
> > 
> > Cons:
> > - block thread creation
> > - initially allocate one cred per thread (but free most of them after)
> 
> The proposal is growing on me.
> 
> One way to view transfer_creds() and have it nicely fit into the credentials API
> would be to view prepare_creds() as a convenience wrapper around
> transfer_creds(), so that prepare_creds() is implemented as a function which:
> 
>   1) allocates a new struct cred (this may fail)
>   2) calls transfer_creds() with the new struct cred to populate
> 
> We could then move the bulk of its existing prepare_creds() implementation into
> the new transfer_creds(), and could also move the keyctl implementation to use
> that.

Yes

> 
> A remaining problem is: The caveat and the underlying assumption is that
> transfer_creds() must never fail when it runs in the task work, if we want to
> avoid the additional synchronization.  The existing cases in which the
> credentials preparation logic can return an error are:
> 
> * Allocation failure for struct cred  (we would get rid of this)
> * get_ucounts(new->ucounts) returns NULL  (not supposed to happen, AFAICT)
> * security_prepare_creds() fails.  Existing LSM implementations are:
>   * Tomoyo, SELinux, AppArmor, Landlock: always return 0
>   * SMACK: May return -ENOMEM on allocation failure
> 
> 
> So summing up, in my understanding, the prerequisites for this approach are that:
> 
>   1) get_ucounts() never returns NULL, and
> 
>   2) SMACK does not bubble up allocation errors from the cred_prepare hook
> 
>      Casey, Jarkko: do you think this is realistic for SMACK to change?

There is no issue using cred_alloc_blank() and then transfer_creds().
The allocation is up front and transfer_creds() cannot failed.

> 
> and that
> 
>   3) We can block new thread creation
> 
>      Mickaël, do you have a specific approach in mind for that?

I was thinking to do the same as seccomp: lock siglock and,
check/synchronize credentials in a new LSM hook called just after
copy_seccomp() (when siglock is locked).  However, as pointed out by
Jann, that would not be enough because this new LSM hook would not be
able to allocate new credentials, and the siglock locking happens after
copy_creds().

> 
>      As Jann pointed out in [1], the tasklist_lock and siglock are not sleepable
>      and can't be used while waiting, which is why he proposed an approach where
>      we retry in a loop until no new threads show up any more, while getting the
>      existing threads stuck in the task_work as well (where they can't spawn new
>      threads).

This looks good.  Too bad we need to block all threads.

> 
> 
> Thanks,
> 
> —Günther
> 
> [1] https://lore.kernel.org/all/CAG48ez0pWg3OTABfCKRk5sWrURM-HdJhQMcWedEppc_z1rrVJw@mail.gmail.com/
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] landlock: Multithreading support for landlock_restrict_self()
  2025-06-02  6:45                       ` Mickaël Salaün
@ 2025-06-12 11:51                         ` Günther Noack
  0 siblings, 0 replies; 18+ messages in thread
From: Günther Noack @ 2025-06-12 11:51 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Casey Schaufler, Günther Noack, Paul Moore, sergeh,
	David Howells, Kees Cook, linux-security-module,
	Konstantin Meskhidze, Jann Horn, linux-kernel, Peter Newman,
	Andy Lutomirski, Will Drewry, Jarkko Sakkinen

On Mon, Jun 02, 2025 at 08:45:06AM +0200, Mickaël Salaün wrote:
> On Fri, May 30, 2025 at 06:26:07PM +0200, Günther Noack wrote:
> >      As Jann pointed out in [1], the tasklist_lock and siglock are not sleepable
> >      and can't be used while waiting, which is why he proposed an approach where
> >      we retry in a loop until no new threads show up any more, while getting the
> >      existing threads stuck in the task_work as well (where they can't spawn new
> >      threads).
> 
> This looks good.  Too bad we need to block all threads.

OK, I'll take that route then.

In my understanding, if we are already blocking all threads, we might as well
use prepare_creds() in these threads again. -- It does not cost us much more to
collect these potential errors now.  Does that sound reasonable?

—Günther

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-06-12 11:51 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-21 18:44 [RFC 0/2] landlock: Multithreaded policy enforcement Günther Noack
2025-02-21 18:44 ` [RFC 1/2] landlock: Multithreading support for landlock_restrict_self() Günther Noack
2025-02-27 20:53   ` Mickaël Salaün
2025-02-28 17:33     ` Günther Noack
2025-03-04 20:25       ` Mickaël Salaün
2025-03-10 13:04         ` Günther Noack
2025-03-11 14:32           ` Mickaël Salaün
2025-05-18  7:40             ` Günther Noack
2025-05-18 19:57               ` Mickaël Salaün
2025-05-30 13:16                 ` Günther Noack
2025-05-30 15:11                   ` Mickaël Salaün
2025-05-30 16:26                     ` Günther Noack
2025-05-30 16:52                       ` Casey Schaufler
2025-06-02  6:45                       ` Mickaël Salaün
2025-06-12 11:51                         ` Günther Noack
2025-05-27 14:26           ` Jann Horn
2025-02-21 18:44 ` [RFC 2/2] landlock: selftests for LANDLOCK_RESTRICT_SELF_TSYNC Günther Noack
2025-02-27 20:52 ` [RFC 0/2] landlock: Multithreaded policy enforcement Mickaël Salaün

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).