* Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()
2025-10-01 11:18 ` [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self() Günther Noack
@ 2025-10-17 15:04 ` Mickaël Salaün
2025-10-24 21:18 ` Jann Horn
2025-11-27 9:32 ` Günther Noack
2025-10-20 20:12 ` Mickaël Salaün
2025-10-24 21:11 ` Jann Horn
2 siblings, 2 replies; 14+ messages in thread
From: Mickaël Salaün @ 2025-10-17 15:04 UTC (permalink / raw)
To: Günther Noack, Jann Horn
Cc: Konstantin Meskhidze, Tingmao Wang, Paul Moore,
linux-security-module
This patch series is complex but it looks very good, thanks for all the
explanations! I don't have major concern but a few comments.
I'm wondering if this could be simplified but, except the flat array, I
couldn't find a better approach.
Jann, could you please take a look?
On Wed, Oct 01, 2025 at 01:18:06PM +0200, Günther Noack wrote:
> Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag. With this flag, a
> given Landlock ruleset is applied to all threads of the calling
> process, instead of only the current one.
>
> Without this flag, multithreaded userspace programs currently resort
> to using the nptl(7)/libpsx hack for multithreaded policy enforcement,
> which is also used by libcap and for setuid(2). Using this scheme,
> the threads of a process enforce the same Landlock ruleset, but the
> resulting Landlock domains are still separate, which makes a
> difference for Landlock's "scoped" access rights, where the domain
> identity and nesting is used. As a result, when using
> LANLDOCK_SCOPE_SIGNAL, signaling between sibling threads stops
> working. This is a problem for programming languages and frameworks
> which are inherently multithreaded (e.g. Go).
Having different Landlock domains also means that we get different
domains ID, and the audit logs might confuse users.
>
> Cc: Mickaël Salaün <mic@digikod.net>
> Cc: Paul Moore <paul@paul-moore.com>
> Cc: linux-security-module@vger.kernel.org
> Suggested-by: Jann Horn <jannh@google.com>
> Signed-off-by: Günther Noack <gnoack@google.com>
> ---
> include/uapi/linux/landlock.h | 4 +
> security/landlock/cred.h | 12 +
> security/landlock/limits.h | 2 +-
> security/landlock/syscalls.c | 433 +++++++++++++++++++++++++++++++++-
You can move most of this code to a new tsync.c file.
> 4 files changed, 448 insertions(+), 3 deletions(-)
>
> diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> index f030adc462ee..7c6c7f004a41 100644
> --- a/include/uapi/linux/landlock.h
> +++ b/include/uapi/linux/landlock.h
> @@ -117,11 +117,15 @@ struct landlock_ruleset_attr {
> * future nested domains, not the one being created. It can also be used
> * with a @ruleset_fd value of -1 to mute subdomain logs without creating a
> * domain.
> + *
> + * %LANDLOCK_RESTRICT_SELF_TSYNC
> + * Apply the given ruleset atomically to all threads of the current process.
Please bump the Landlock ABI version and update the doc.
> */
> /* clang-format off */
> #define LANDLOCK_RESTRICT_SELF_LOG_SAME_EXEC_OFF (1U << 0)
> #define LANDLOCK_RESTRICT_SELF_LOG_NEW_EXEC_ON (1U << 1)
> #define LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF (1U << 2)
> +#define LANDLOCK_RESTRICT_SELF_TSYNC (1U << 3)
> /* clang-format on */
>
> /**
> diff --git a/security/landlock/cred.h b/security/landlock/cred.h
> index c82fe63ec598..eb28eeade760 100644
> --- a/security/landlock/cred.h
> +++ b/security/landlock/cred.h
> @@ -65,6 +65,18 @@ landlock_cred(const struct cred *cred)
> return cred->security + landlock_blob_sizes.lbs_cred;
> }
>
> +static inline void landlock_cred_copy(struct landlock_cred_security *dst,
> + const struct landlock_cred_security *src)
> +{
You can simplify by removing the domain checks which are already done by
landlock_put/get_ruleset().
> + if (dst->domain)
> + landlock_put_ruleset(dst->domain);
> +
> + *dst = *src;
> +
> + if (dst->domain)
> + landlock_get_ruleset(src->domain);
> +}
> +
> static inline struct landlock_ruleset *landlock_get_current_domain(void)
> {
> return landlock_cred(current_cred())->domain;
> diff --git a/security/landlock/limits.h b/security/landlock/limits.h
> index 65b5ff051674..eb584f47288d 100644
> --- a/security/landlock/limits.h
> +++ b/security/landlock/limits.h
> @@ -31,7 +31,7 @@
> #define LANDLOCK_MASK_SCOPE ((LANDLOCK_LAST_SCOPE << 1) - 1)
> #define LANDLOCK_NUM_SCOPE __const_hweight64(LANDLOCK_MASK_SCOPE)
>
> -#define LANDLOCK_LAST_RESTRICT_SELF LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF
> +#define LANDLOCK_LAST_RESTRICT_SELF LANDLOCK_RESTRICT_SELF_TSYNC
> #define LANDLOCK_MASK_RESTRICT_SELF ((LANDLOCK_LAST_RESTRICT_SELF << 1) - 1)
>
> /* clang-format on */
> diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
> index 0116e9f93ffe..5ba14d641c11 100644
> --- a/security/landlock/syscalls.c
> +++ b/security/landlock/syscalls.c
> @@ -14,6 +14,7 @@
> #include <linux/capability.h>
> #include <linux/cleanup.h>
> #include <linux/compiler_types.h>
> +#include <linux/completion.h>
> #include <linux/dcache.h>
> #include <linux/err.h>
> #include <linux/errno.h>
> @@ -25,6 +26,7 @@
> #include <linux/security.h>
> #include <linux/stddef.h>
> #include <linux/syscalls.h>
> +#include <linux/task_work.h>
> #include <linux/types.h>
> #include <linux/uaccess.h>
> #include <uapi/linux/landlock.h>
> @@ -445,6 +447,416 @@ SYSCALL_DEFINE4(landlock_add_rule, const int, ruleset_fd,
>
> /* Enforcement */
>
> +/*
> + * Shared state between multiple threads which are enforcing Landlock rulesets
> + * in lockstep with each other.
> + */
> +struct tsync_shared_context {
> + /* The old and tentative new creds of the calling thread. */
> + const struct cred *old_cred;
> + const struct cred *new_cred;
> +
> + /* An error encountered in preparation step, or 0. */
> + atomic_t preparation_error;
> +
> + /*
> + * Barrier after preparation step in the inner loop.
> + * The calling thread waits for completion.
> + *
> + * Re-initialized on every round of looking for newly spawned threads.
> + */
> + atomic_t num_preparing;
> + struct completion all_prepared;
> +
> + /* Sibling threads wait for completion. */
> + struct completion ready_to_commit;
> +
> + /*
> + * Barrier after commit step (used by syscall impl to wait for
> + * completion).
> + */
> + atomic_t num_unfinished;
> + struct completion all_finished;
> +};
> +
> +struct tsync_work {
> + struct callback_head work;
> + struct task_struct *task;
> + struct tsync_shared_context *shared_ctx;
> +};
> +
> +/*
> + * restrict_one_thread - update a thread's Landlock domain in lockstep with the
> + * other threads in the same process
> + *
> + * When this is run, the same function gets run in all other threads in the same
> + * process (except for the calling thread which called landlock_restrict_self).
> + * The concurrently running invocations of restrict_one_thread coordinate
> + * through the shared ctx object to do their work in lockstep to implement
> + * all-or-nothing semantics for enforcing the new Landlock domain.
> + *
> + * Afterwards, depending on the presence of an error, all threads either commit
> + * or abort the prepared credentials. The commit operation can not fail any more.
> + */
> +static void restrict_one_thread(struct tsync_shared_context *ctx)
> +{
> + int res;
For consistency with existing code, please use "err" instead of "res".
> + struct cred *cred = NULL;
> + const struct cred *current_cred = current_cred();
No need for this variable.
> +
> + if (current_cred == ctx->old_cred) {
> + /*
> + * As a shortcut, switch out old_cred with new_cred, if
> + * possible.
> + *
> + * Note: We are intentionally dropping the const qualifier here,
> + * because it is required by commit_creds() and abort_creds().
> + */
> + cred = (struct cred *)get_cred(ctx->new_cred);
Good. You can extend the comment to explain that this optimization
avoid creating new credentials in most cases, and then save memory.
> + } else {
> + /* Else, prepare new creds and populate them. */
> + cred = prepare_creds();
> +
> + if (!cred) {
> + atomic_set(&ctx->preparation_error, -ENOMEM);
> +
> + /*
> + * Even on error, we need to adhere to the protocol and
> + * coordinate with concurrently running invocations.
> + */
> + if (atomic_dec_return(&ctx->num_preparing) == 0)
> + complete_all(&ctx->all_prepared);
> +
> + goto out;
> + }
> +
> + landlock_cred_copy(landlock_cred(cred),
> + landlock_cred(ctx->new_cred));
> + }
> +
> + /*
> + * Barrier: Wait until all threads are done preparing.
> + * After this point, we can have no more failures.
> + */
> + if (atomic_dec_return(&ctx->num_preparing) == 0)
> + complete_all(&ctx->all_prepared);
> +
> + /*
> + * Wait for signal from calling thread that it's safe to read the
> + * preparation error now and we are ready to commit (or abort).
> + */
> + wait_for_completion(&ctx->ready_to_commit);
> +
> + /* Abort the commit if any of the other threads had an error. */
> + res = atomic_read(&ctx->preparation_error);
> + if (res) {
> + abort_creds(cred);
> + goto out;
> + }
> +
> + /* If needed, establish enforcement prerequisites. */
> + if (!ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
> + task_set_no_new_privs(current);
We should always set PR_SET_NO_NEW_PRIVS if it is set on the calling
thread as done by seccomp. We should just store the result of
task_no_new_privs() in tsync_shared_context and use it as condition here.
This should be explained in the documentation.
This also mean that if the calling thread has CAP_SYS_ADMIN but not
PR_SET_NO_NEW_PRIVS, then a sibling thread could not have CAP_SYS_ADMIN
nor PR_SET_NO_NEW_PRIVS. This would be a risky state but mainly because
of the CAP_SYS_ADMIN inconsistency, not really the missing
PR_SET_NO_NEW_PRIVS.
> +
> + commit_creds(cred);
> +
> +out:
> + /* Notify the calling thread once all threads are done */
> + if (atomic_dec_return(&ctx->num_unfinished) == 0)
> + complete_all(&ctx->all_finished);
> +}
> +
> +/*
> + * restrict_one_thread_callback - task_work callback for restricting a thread
> + *
> + * Calls restrict_one_thread with the struct landlock_shared_tsync_context.
> + */
> +static void restrict_one_thread_callback(struct callback_head *work)
> +{
> + struct tsync_work *ctx = container_of(work, struct tsync_work, work);
> +
> + restrict_one_thread(ctx->shared_ctx);
> +}
> +
> +/*
> + * struct tsync_works - a growable array of per-task contexts
> + *
> + * The zero-initialized struct represents the empty array.
> + */
> +struct tsync_works {
> + struct tsync_work **works;
> + size_t size;
> + size_t capacity;
> +};
> +
> +/*
> + * tsync_works_provide - provides a preallocated tsync_work for the given task
> + *
> + * This also stores a task pointer in the context and increments the reference
> + * count of the task.
> + *
> + * Returns:
> + * A pointer to the preallocated context struct, with task filled in.
> + *
> + * NULL, if we ran out of preallocated context structs.
> + */
> +static struct tsync_work *tsync_works_provide(struct tsync_works *s,
> + struct task_struct *task)
> +{
> + struct tsync_work *ctx;
> +
> + if (s->size >= s->capacity)
In which case can this happen? Should we wrap this in a WARN_ON_ONCE()?
> + return NULL;
> +
> + ctx = s->works[s->size];
> + s->size++;
> +
> + ctx->task = get_task_struct(task);
> + return ctx;
> +}
> +
> +/*
> + * tsync_works_grow_by - preallocates space for n more contexts in s
> + *
> + * Returns:
> + * -ENOMEM if the (re)allocation fails
> + * 0 if the allocation succeeds, partially succeeds, or no reallocation was needed
> + */
> +static int tsync_works_grow_by(struct tsync_works *s, size_t n, gfp_t flags)
> +{
> + int i;
size_t i;
> + size_t new_capacity = s->capacity + n;
> + struct tsync_work **works;
> +
> + if (new_capacity <= s->capacity)
This integer overflow check should return an error instead.
> + return 0;
> +
> + works = krealloc_array(s->works, new_capacity, sizeof(s->works[0]),
> + flags);
> + if (IS_ERR(works))
> + return PTR_ERR(works);
> +
> + s->works = works;
> +
> + for (i = s->capacity; i < new_capacity; i++) {
> + s->works[i] = kzalloc(sizeof(*s->works[i]), flags);
We should use a local variable to avoid storing an error code in
s->works[i] and potentially dereferencing it later (e.g. in
tsync_work_free).
Why can't we avoid this loop entirely and allocate a flat array with
only one call to krealloc_array()? Why struct tsync_works->works needs
to be a pointer to a pointer?
> + if (IS_ERR(s->works[i])) {
> + /*
> + * Leave the object in a consistent state,
> + * but return an error.
> + */
> + s->capacity = i;
> + return PTR_ERR(s->works[i]);
> + }
> + }
> + s->capacity = new_capacity;
> + return 0;
> +}
> +
> +/*
> + * tsync_works_contains - checks for presence of task in s
> + */
> +static bool tsync_works_contains_task(struct tsync_works *s,
> + struct task_struct *task)
> +{
> + size_t i;
> +
> + for (i = 0; i < s->size; i++)
> + if (s->works[i]->task == task)
> + return true;
> + return false;
> +}
> +
> +/*
> + * tsync_works_free - free memory held by s and drop all task references
> + */
> +static void tsync_works_free(struct tsync_works *s)
tsync_works_put() would be more appropriate since this function doesn't
free s.
> +{
> + int i;
> +
> + for (i = 0; i < s->size; i++)
> + put_task_struct(s->works[i]->task);
> + for (i = 0; i < s->capacity; i++)
> + kfree(s->works[i]);
> + kfree(s->works);
> + s->works = NULL;
> + s->size = 0;
> + s->capacity = 0;
> +}
> +
> +/*
> + * restrict_sibling_threads - enables a Landlock policy for all sibling threads
> + */
> +static int restrict_sibling_threads(const struct cred *old_cred,
> + const struct cred *new_cred)
> +{
> + int res;
> + struct task_struct *thread, *caller;
> + struct tsync_shared_context shared_ctx;
> + struct tsync_works works = {};
> + size_t newly_discovered_threads;
> + bool found_more_threads;
> + struct tsync_work *ctx;
> +
> + atomic_set(&shared_ctx.preparation_error, 0);
> + init_completion(&shared_ctx.all_prepared);
> + init_completion(&shared_ctx.ready_to_commit);
> + atomic_set(&shared_ctx.num_unfinished, 0);
> + init_completion(&shared_ctx.all_finished);
> + shared_ctx.old_cred = old_cred;
> + shared_ctx.new_cred = new_cred;
> +
> + caller = current;
> +
> + /*
> + * We schedule a pseudo-signal task_work for each of the calling task's
> + * sibling threads. In the task work, each thread:
> + *
> + * 1) runs prepare_creds() and writes back the error to
> + * shared_ctx.preparation_error, if needed.
> + *
> + * 2) signals that it's done with prepare_creds() to the calling task.
> + * (completion "all_prepared").
> + *
> + * 3) waits for the completion "ready_to_commit". This is sent by the
> + * calling task after ensuring that all sibling threads have done
> + * with the "preparation" stage.
> + *
> + * After this barrier is reached, it's safe to read
> + * shared_ctx.preparation_error.
> + *
> + * 4) reads shared_ctx.preparation_error and then either does
> + * commit_creds() or abort_creds().
> + *
> + * 5) signals that it's done altogether (barrier synchronization
> + * "all_finished")
> + */
> + do {
> + found_more_threads = false;
> +
> + /*
> + * The "all_prepared" barrier is used locally to the inner loop,
> + * this use of for_each_thread(). We can reset it on each loop
> + * iteration because all previous loop iterations are done with
> + * it already.
> + *
> + * num_preparing is initialized to 1 so that the counter can not
> + * go to 0 and mark the completion as done before all task works
> + * are registered. (We decrement it at the end of this loop.)
> + */
> + atomic_set(&shared_ctx.num_preparing, 1);
> + reinit_completion(&shared_ctx.all_prepared);
> +
> + /* In RCU read-lock, count the threads we need. */
> + newly_discovered_threads = 0;
> + rcu_read_lock();
> + for_each_thread(caller, thread) {
> + /* Skip current, since it is initiating the sync. */
> + if (thread == caller)
> + continue;
> +
> + /* Skip exited threads. */
> + if (thread->flags & PF_EXITING)
> + continue;
> +
> + /* Skip threads that we have already seen. */
> + if (tsync_works_contains_task(&works, thread))
> + continue;
> +
> + newly_discovered_threads++;
> + }
> + rcu_read_unlock();
> +
> + if (newly_discovered_threads == 0)
> + break; /* done */
> +
> + res = tsync_works_grow_by(&works, newly_discovered_threads,
> + GFP_KERNEL_ACCOUNT);
> + if (res) {
> + atomic_set(&shared_ctx.preparation_error, res);
> + break;
> + }
> +
> + rcu_read_lock();
> + for_each_thread(caller, thread) {
> + /* Skip current, since it is initiating the sync. */
> + if (thread == caller)
> + continue;
> +
> + /* Skip exited threads. */
> + if (thread->flags & PF_EXITING)
> + continue;
> +
> + /* Skip threads that we already looked at. */
> + if (tsync_works_contains_task(&works, thread))
> + continue;
> +
> + /*
> + * We found a sibling thread that is not doing its
> + * task_work yet, and which might spawn new threads
> + * before our task work runs, so we need at least one
> + * more round in the outer loop.
> + */
> + found_more_threads = true;
> +
> + ctx = tsync_works_provide(&works, thread);
> + if (!ctx) {
> + /*
> + * We ran out of preallocated contexts -- we
> + * need to try again with this thread at a later
> + * time! found_more_threads is already true
> + * at this point.
> + */
> + break;
> + }
> +
> + ctx->shared_ctx = &shared_ctx;
> +
> + atomic_inc(&shared_ctx.num_preparing);
> + atomic_inc(&shared_ctx.num_unfinished);
> +
> + init_task_work(&ctx->work,
> + restrict_one_thread_callback);
> + res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
> + if (res) {
> + /*
> + * Remove the task from ctx so that we will
> + * revisit the task at a later stage, if it
> + * still exists.
> + */
> + put_task_struct_rcu_user(ctx->task);
> + ctx->task = NULL;
> +
> + atomic_set(&shared_ctx.preparation_error, res);
> + atomic_dec(&shared_ctx.num_preparing);
> + atomic_dec(&shared_ctx.num_unfinished);
> + }
> + }
> + rcu_read_unlock();
> +
> + /*
> + * Decrement num_preparing for current, to undo that we
> + * initialized it to 1 at the beginning of the inner loop.
> + */
> + if (atomic_dec_return(&shared_ctx.num_preparing) > 0)
> + wait_for_completion(&shared_ctx.all_prepared);
> + } while (found_more_threads &&
> + !atomic_read(&shared_ctx.preparation_error));
> +
> + /*
> + * We now have all sibling threads blocking and in "prepared" state in
> + * the task work. Ask all threads to commit.
> + */
> + complete_all(&shared_ctx.ready_to_commit);
> +
> + if (works.size)
> + wait_for_completion(&shared_ctx.all_finished);
> +
> + tsync_works_free(&works);
> +
> + return atomic_read(&shared_ctx.preparation_error);
> +}
> +
> /**
> * sys_landlock_restrict_self - Enforce a ruleset on the calling thread
> *
> @@ -454,12 +866,20 @@ SYSCALL_DEFINE4(landlock_add_rule, const int, ruleset_fd,
> * - %LANDLOCK_RESTRICT_SELF_LOG_SAME_EXEC_OFF
> * - %LANDLOCK_RESTRICT_SELF_LOG_NEW_EXEC_ON
> * - %LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF
> + * - %LANDLOCK_RESTRICT_SELF_TSYNC
> *
> - * This system call enables to enforce a Landlock ruleset on the current
> - * thread. Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
> + * This system call enforces a Landlock ruleset on the current thread.
> + * Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
> * namespace or is running with no_new_privs. This avoids scenarios where
> * unprivileged tasks can affect the behavior of privileged children.
> *
> + * If %LANDLOCK_RESTRICT_SELF_TSYNC is specified in @flags, all other threads of
> + * the process will be brought into the exact same Landlock configuration as the
> + * calling thread. This includes both the enforced ruleset and logging
> + * configuration, and happens irrespective of previously established rulesets
> + * and logging configurations on these threads. If required, this operation
> + * also enables the no_new_privs flag for these threads.
> + *
> * Possible returned errors are:
> *
> * - %EOPNOTSUPP: Landlock is supported by the kernel but disabled at boot time;
> @@ -484,6 +904,7 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> struct landlock_cred_security *new_llcred;
> bool __maybe_unused log_same_exec, log_new_exec, log_subdomains,
> prev_log_subdomains;
> + int res;
>
> if (!is_initialized())
> return -EOPNOTSUPP;
> @@ -566,5 +987,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> new_llcred->domain_exec |= BIT(new_dom->num_layers - 1);
> #endif /* CONFIG_AUDIT */
>
> + if (flags & LANDLOCK_RESTRICT_SELF_TSYNC) {
> + res = restrict_sibling_threads(current_cred(), new_cred);
> + if (res != 0) {
if (!err) {
> + abort_creds(new_cred);
> + return res;
> + }
> + }
> +
> return commit_creds(new_cred);
> }
> --
> 2.51.0.618.g983fd99d29-goog
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()
2025-10-17 15:04 ` Mickaël Salaün
@ 2025-10-24 21:18 ` Jann Horn
2025-11-27 9:34 ` Günther Noack
2025-11-27 9:32 ` Günther Noack
1 sibling, 1 reply; 14+ messages in thread
From: Jann Horn @ 2025-10-24 21:18 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Günther Noack, Konstantin Meskhidze, Tingmao Wang,
Paul Moore, linux-security-module
On Fri, Oct 17, 2025 at 5:04 PM Mickaël Salaün <mic@digikod.net> wrote:
> On Wed, Oct 01, 2025 at 01:18:06PM +0200, Günther Noack wrote:
> > + /* If needed, establish enforcement prerequisites. */
> > + if (!ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
> > + task_set_no_new_privs(current);
>
> We should always set PR_SET_NO_NEW_PRIVS if it is set on the calling
> thread as done by seccomp. We should just store the result of
> task_no_new_privs() in tsync_shared_context and use it as condition here.
> This should be explained in the documentation.
>
> This also mean that if the calling thread has CAP_SYS_ADMIN but not
> PR_SET_NO_NEW_PRIVS, then a sibling thread could not have CAP_SYS_ADMIN
> nor PR_SET_NO_NEW_PRIVS. This would be a risky state but mainly because
> of the CAP_SYS_ADMIN inconsistency, not really the missing
> PR_SET_NO_NEW_PRIVS.
Agreed, it would be nice to have behavior that is consistent with seccomp.
[...]
> > +/*
> > + * tsync_works_provide - provides a preallocated tsync_work for the given task
> > + *
> > + * This also stores a task pointer in the context and increments the reference
> > + * count of the task.
> > + *
> > + * Returns:
> > + * A pointer to the preallocated context struct, with task filled in.
> > + *
> > + * NULL, if we ran out of preallocated context structs.
> > + */
> > +static struct tsync_work *tsync_works_provide(struct tsync_works *s,
> > + struct task_struct *task)
> > +{
> > + struct tsync_work *ctx;
> > +
> > + if (s->size >= s->capacity)
>
> In which case can this happen? Should we wrap this in a WARN_ON_ONCE()?
No, this can legitimately happen if new sibling threads are created
between the time we pre-allocate memory and the time we loop over them
to call tsync_works_provide().
[...]
> > + return 0;
> > +
> > + works = krealloc_array(s->works, new_capacity, sizeof(s->works[0]),
> > + flags);
> > + if (IS_ERR(works))
> > + return PTR_ERR(works);
> > +
> > + s->works = works;
> > +
> > + for (i = s->capacity; i < new_capacity; i++) {
> > + s->works[i] = kzalloc(sizeof(*s->works[i]), flags);
>
> We should use a local variable to avoid storing an error code in
> s->works[i] and potentially dereferencing it later (e.g. in
> tsync_work_free).
>
> Why can't we avoid this loop entirely and allocate a flat array with
> only one call to krealloc_array()? Why struct tsync_works->works needs
> to be a pointer to a pointer?
Because pointers to some "struct tsync_work" items might already be in
use as task work or such, so we can't move them to a different address
anymore at this point.
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()
2025-10-17 15:04 ` Mickaël Salaün
2025-10-24 21:18 ` Jann Horn
@ 2025-11-27 9:32 ` Günther Noack
1 sibling, 0 replies; 14+ messages in thread
From: Günther Noack @ 2025-11-27 9:32 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Jann Horn, Konstantin Meskhidze, Tingmao Wang, Paul Moore,
linux-security-module
Hello!
Thank you for the review! It has taken a while, but now V3 is almost
ready and I have the answers to these questions.
On Fri, Oct 17, 2025 at 05:04:23PM +0200, Mickaël Salaün wrote:
> On Wed, Oct 01, 2025 at 01:18:06PM +0200, Günther Noack wrote:
> > Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag. With this flag, a
> > given Landlock ruleset is applied to all threads of the calling
> > process, instead of only the current one.
> >
> > Without this flag, multithreaded userspace programs currently resort
> > to using the nptl(7)/libpsx hack for multithreaded policy enforcement,
> > which is also used by libcap and for setuid(2). Using this scheme,
> > the threads of a process enforce the same Landlock ruleset, but the
> > resulting Landlock domains are still separate, which makes a
> > difference for Landlock's "scoped" access rights, where the domain
> > identity and nesting is used. As a result, when using
> > LANLDOCK_SCOPE_SIGNAL, signaling between sibling threads stops
> > working. This is a problem for programming languages and frameworks
> > which are inherently multithreaded (e.g. Go).
>
> Having different Landlock domains also means that we get different
> domains ID, and the audit logs might confuse users.
True, I added that to the commit message.
> > Cc: Mickaël Salaün <mic@digikod.net>
> > Cc: Paul Moore <paul@paul-moore.com>
> > Cc: linux-security-module@vger.kernel.org
> > Suggested-by: Jann Horn <jannh@google.com>
> > Signed-off-by: Günther Noack <gnoack@google.com>
> > ---
> > include/uapi/linux/landlock.h | 4 +
> > security/landlock/cred.h | 12 +
> > security/landlock/limits.h | 2 +-
> > security/landlock/syscalls.c | 433 +++++++++++++++++++++++++++++++++-
>
> You can move most of this code to a new tsync.c file.
Done.
> > --- a/include/uapi/linux/landlock.h
> > +++ b/include/uapi/linux/landlock.h
> > @@ -117,11 +117,15 @@ struct landlock_ruleset_attr {
> > * future nested domains, not the one being created. It can also be used
> > * with a @ruleset_fd value of -1 to mute subdomain logs without creating a
> > * domain.
> > + *
> > + * %LANDLOCK_RESTRICT_SELF_TSYNC
> > + * Apply the given ruleset atomically to all threads of the current process.
>
> Please bump the Landlock ABI version and update the doc.
Done.
Docs: I added flag description to the landlock.h header so that it
appears in the documentation for the system call, and added an entry
to the section where we describe the differences between the ABI
versions.
> > diff --git a/security/landlock/cred.h b/security/landlock/cred.h
> > index c82fe63ec598..eb28eeade760 100644
> > --- a/security/landlock/cred.h
> > +++ b/security/landlock/cred.h
> > @@ -65,6 +65,18 @@ landlock_cred(const struct cred *cred)
> > return cred->security + landlock_blob_sizes.lbs_cred;
> > }
> >
> > +static inline void landlock_cred_copy(struct landlock_cred_security *dst,
> > + const struct landlock_cred_security *src)
> > +{
>
> You can simplify by removing the domain checks which are already done by
> landlock_put/get_ruleset().
>
> > + if (dst->domain)
> > + landlock_put_ruleset(dst->domain);
> > +
> > + *dst = *src;
> > +
> > + if (dst->domain)
> > + landlock_get_ruleset(src->domain);
Done.
> > +/*
> > + * restrict_one_thread - update a thread's Landlock domain in lockstep with the
> > + * other threads in the same process
> > + *
> > + * When this is run, the same function gets run in all other threads in the same
> > + * process (except for the calling thread which called landlock_restrict_self).
> > + * The concurrently running invocations of restrict_one_thread coordinate
> > + * through the shared ctx object to do their work in lockstep to implement
> > + * all-or-nothing semantics for enforcing the new Landlock domain.
> > + *
> > + * Afterwards, depending on the presence of an error, all threads either commit
> > + * or abort the prepared credentials. The commit operation can not fail any more.
> > + */
> > +static void restrict_one_thread(struct tsync_shared_context *ctx)
> > +{
> > + int res;
>
> For consistency with existing code, please use "err" instead of "res".
Done.
> > + struct cred *cred = NULL;
> > + const struct cred *current_cred = current_cred();
>
> No need for this variable.
Done.
> > +
> > + if (current_cred == ctx->old_cred) {
> > + /*
> > + * As a shortcut, switch out old_cred with new_cred, if
> > + * possible.
> > + *
> > + * Note: We are intentionally dropping the const qualifier here,
> > + * because it is required by commit_creds() and abort_creds().
> > + */
> > + cred = (struct cred *)get_cred(ctx->new_cred);
>
> Good. You can extend the comment to explain that this optimization
> avoid creating new credentials in most cases, and then save memory.
Sounds good, I extended that comment a bit.
> > + } else {
> > + /* Else, prepare new creds and populate them. */
> > + cred = prepare_creds();
> > +
> > + if (!cred) {
> > + atomic_set(&ctx->preparation_error, -ENOMEM);
> > +
> > + /*
> > + * Even on error, we need to adhere to the protocol and
> > + * coordinate with concurrently running invocations.
> > + */
> > + if (atomic_dec_return(&ctx->num_preparing) == 0)
> > + complete_all(&ctx->all_prepared);
> > +
> > + goto out;
> > + }
> > +
> > + landlock_cred_copy(landlock_cred(cred),
> > + landlock_cred(ctx->new_cred));
> > + }
> > +
> > + /*
> > + * Barrier: Wait until all threads are done preparing.
> > + * After this point, we can have no more failures.
> > + */
> > + if (atomic_dec_return(&ctx->num_preparing) == 0)
> > + complete_all(&ctx->all_prepared);
> > +
> > + /*
> > + * Wait for signal from calling thread that it's safe to read the
> > + * preparation error now and we are ready to commit (or abort).
> > + */
> > + wait_for_completion(&ctx->ready_to_commit);
> > +
> > + /* Abort the commit if any of the other threads had an error. */
> > + res = atomic_read(&ctx->preparation_error);
> > + if (res) {
> > + abort_creds(cred);
> > + goto out;
> > + }
> > +
> > + /* If needed, establish enforcement prerequisites. */
> > + if (!ns_capable_noaudit(current_user_ns(), CAP_SYS_ADMIN))
> > + task_set_no_new_privs(current);
>
> We should always set PR_SET_NO_NEW_PRIVS if it is set on the calling
> thread as done by seccomp. We should just store the result of
> task_no_new_privs() in tsync_shared_context and use it as condition here.
> This should be explained in the documentation.
>
> This also mean that if the calling thread has CAP_SYS_ADMIN but not
> PR_SET_NO_NEW_PRIVS, then a sibling thread could not have CAP_SYS_ADMIN
> nor PR_SET_NO_NEW_PRIVS. This would be a risky state but mainly because
> of the CAP_SYS_ADMIN inconsistency, not really the missing
> PR_SET_NO_NEW_PRIVS.
OK, done. In line with seccomp now, where the logic is (pseudocode):
if (task_no_new_privs(caller))
task_set_no_new_privs(current);
As you are saying as well, the situation where the caller has
CAP_SYS_ADMIN but not PR_SET_NO_NEW_PRIVS results in sibling threads
that will also end up without PR_SET_NO_NEW_PRIVS. But this is also
consistent with Seccomp.
> > +
> > + commit_creds(cred);
> > +
> > +out:
> > + /* Notify the calling thread once all threads are done */
> > + if (atomic_dec_return(&ctx->num_unfinished) == 0)
> > + complete_all(&ctx->all_finished);
> > +}
> > +
> > +/*
> > + * restrict_one_thread_callback - task_work callback for restricting a thread
> > + *
> > + * Calls restrict_one_thread with the struct landlock_shared_tsync_context.
> > + */
> > +static void restrict_one_thread_callback(struct callback_head *work)
> > +{
> > + struct tsync_work *ctx = container_of(work, struct tsync_work, work);
> > +
> > + restrict_one_thread(ctx->shared_ctx);
> > +}
> > +
> > +/*
> > + * struct tsync_works - a growable array of per-task contexts
> > + *
> > + * The zero-initialized struct represents the empty array.
> > + */
> > +struct tsync_works {
> > + struct tsync_work **works;
> > + size_t size;
> > + size_t capacity;
> > +};
> > +
> > +/*
> > + * tsync_works_provide - provides a preallocated tsync_work for the given task
> > + *
> > + * This also stores a task pointer in the context and increments the reference
> > + * count of the task.
> > + *
> > + * Returns:
> > + * A pointer to the preallocated context struct, with task filled in.
> > + *
> > + * NULL, if we ran out of preallocated context structs.
> > + */
> > +static struct tsync_work *tsync_works_provide(struct tsync_works *s,
> > + struct task_struct *task)
> > +{
> > + struct tsync_work *ctx;
> > +
> > + if (s->size >= s->capacity)
>
> In which case can this happen? Should we wrap this in a WARN_ON_ONCE()?
As Jann also said in his reply, it can happen if the number of
existing threads differs between the first for_each_thread() loop
where we count the threads and the second for_each_thread() loop where
we fill the allocated slots in the array. If it happens, we deal with
it by doing an additional loop to discover new threads.
> > + return NULL;
> > +
> > + ctx = s->works[s->size];
> > + s->size++;
> > +
> > + ctx->task = get_task_struct(task);
> > + return ctx;
> > +}
> > +
> > +/*
> > + * tsync_works_grow_by - preallocates space for n more contexts in s
> > + *
> > + * Returns:
> > + * -ENOMEM if the (re)allocation fails
> > + * 0 if the allocation succeeds, partially succeeds, or no reallocation was needed
> > + */
> > +static int tsync_works_grow_by(struct tsync_works *s, size_t n, gfp_t flags)
> > +{
> > + int i;
>
> size_t i;
Done.
> > + size_t new_capacity = s->capacity + n;
>
> > + struct tsync_work **works;
> > +
> > + if (new_capacity <= s->capacity)
>
> This integer overflow check should return an error instead.
Oh, this was not meant to check overflow, actually.
As Jann is remarking in [1], what should really have happened here is
that the capacity should have been increased to s->size + n instead of
s->capacity + n a few lines above. With that logic, it also makes
more sense to check whether the new capacity is below-or-equal the old
capacity, as it can legitimately happen in cases where the previously
allocated size was not used up in its entirety. In that case, it is
possible that we already have sufficient space for the additional n
items, and it is reasonable to return 0 in that case. (The semantics
of the function are "make sure that I have enough space for n
additional items", and that is fulfilled without error.)
[1] http://lore.kernel.org/all/CAG48ez1oS9kANZBq1bt+D76MX03DPHAFp76GJt7z5yx-Na1VLQ@mail.gmail.com
I did add an overflow check as well, for good measure (it's
practically for free).
>
> > + return 0;
> > +
> > + works = krealloc_array(s->works, new_capacity, sizeof(s->works[0]),
> > + flags);
> > + if (IS_ERR(works))
> > + return PTR_ERR(works);
> > +
> > + s->works = works;
> > +
> > + for (i = s->capacity; i < new_capacity; i++) {
> > + s->works[i] = kzalloc(sizeof(*s->works[i]), flags);
>
> We should use a local variable to avoid storing an error code in
> s->works[i] and potentially dereferencing it later (e.g. in
> tsync_work_free).
As Jann also pointed out in the other mail, this was a mistake -
kzalloc returns NULL on failure, not an error code. (Fixed)
I extracted a local variable anyway, it's a bit clearer to read.
> Why can't we avoid this loop entirely and allocate a flat array with
> only one call to krealloc_array()? Why struct tsync_works->works needs
> to be a pointer to a pointer?
We pass out pointers to the struct tsync_work elements when calling
task_work_add(), and these pointers are used by the task_work logic
and by our own implementation of that task_work,
restrict_one_thread_callback(). If these elements were directly in
the array, realloc would move them to new locations and these pointers
would then point into freed memory.
Discussed by Jann in [2] as well.
[2] https://lore.kernel.org/all/CAG48ez2KoF6hVSJwdPfUpN3oroMww6Mu1+-bsBSbO8C5f91P6Q@mail.gmail.com/
If we were to take a very close look at all the affected code, we
could potentially convince ourselves that we can discard these objects
after the all_prepared point, but this all seems like a brittle coding
pattern that would go against normal conventions and could break in a
future version. (e.g. when task_work_run() would invoke
"work->func(work)", the memory behind that "work" pointer would get
freed halfway throughout that invocation. It seems a bit brittle.)
I'd prefer to not base our code correctness on such complicated
assumptions about other corners of the code.
> > + if (IS_ERR(s->works[i])) {
> > + /*
> > + * Leave the object in a consistent state,
> > + * but return an error.
> > + */
> > + s->capacity = i;
> > + return PTR_ERR(s->works[i]);
> > + }
> > + }
> > + s->capacity = new_capacity;
> > + return 0;
> > +}
> > +
> > +/*
> > + * tsync_works_contains - checks for presence of task in s
> > + */
> > +static bool tsync_works_contains_task(struct tsync_works *s,
> > + struct task_struct *task)
> > +{
> > + size_t i;
> > +
> > + for (i = 0; i < s->size; i++)
> > + if (s->works[i]->task == task)
> > + return true;
> > + return false;
> > +}
> > +
> > +/*
> > + * tsync_works_free - free memory held by s and drop all task references
> > + */
> > +static void tsync_works_free(struct tsync_works *s)
>
> tsync_works_put() would be more appropriate since this function doesn't
> free s.
put() is usually a reference count decrement, and we don't do that either.
After a bit of thinking, I settled on tsync_works_release(), which is
a slightly more generic wording that does not sound like a refcount.
> > @@ -566,5 +987,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> > new_llcred->domain_exec |= BIT(new_dom->num_layers - 1);
> > #endif /* CONFIG_AUDIT */
> >
> > + if (flags & LANDLOCK_RESTRICT_SELF_TSYNC) {
> > + res = restrict_sibling_threads(current_cred(), new_cred);
> > + if (res != 0) {
>
> if (!err) {
Done.
(Corrected it in another place in landlock_restrict_sibling_threads() as well)
—Günther
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()
2025-10-01 11:18 ` [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self() Günther Noack
2025-10-17 15:04 ` Mickaël Salaün
@ 2025-10-20 20:12 ` Mickaël Salaün
2025-10-24 21:29 ` Jann Horn
2025-11-27 9:56 ` Günther Noack
2025-10-24 21:11 ` Jann Horn
2 siblings, 2 replies; 14+ messages in thread
From: Mickaël Salaün @ 2025-10-20 20:12 UTC (permalink / raw)
To: Günther Noack
Cc: Konstantin Meskhidze, Tingmao Wang, Paul Moore,
linux-security-module, Jann Horn
On Wed, Oct 01, 2025 at 01:18:06PM +0200, Günther Noack wrote:
> Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag. With this flag, a
> given Landlock ruleset is applied to all threads of the calling
> process, instead of only the current one.
>
> Without this flag, multithreaded userspace programs currently resort
> to using the nptl(7)/libpsx hack for multithreaded policy enforcement,
> which is also used by libcap and for setuid(2). Using this scheme,
> the threads of a process enforce the same Landlock ruleset, but the
> resulting Landlock domains are still separate, which makes a
> difference for Landlock's "scoped" access rights, where the domain
> identity and nesting is used. As a result, when using
> LANLDOCK_SCOPE_SIGNAL, signaling between sibling threads stops
> working. This is a problem for programming languages and frameworks
> which are inherently multithreaded (e.g. Go).
>
> Cc: Mickaël Salaün <mic@digikod.net>
> Cc: Paul Moore <paul@paul-moore.com>
> Cc: linux-security-module@vger.kernel.org
> Suggested-by: Jann Horn <jannh@google.com>
> Signed-off-by: Günther Noack <gnoack@google.com>
> ---
> include/uapi/linux/landlock.h | 4 +
> security/landlock/cred.h | 12 +
> security/landlock/limits.h | 2 +-
> security/landlock/syscalls.c | 433 +++++++++++++++++++++++++++++++++-
> 4 files changed, 448 insertions(+), 3 deletions(-)
> +/*
> + * restrict_sibling_threads - enables a Landlock policy for all sibling threads
> + */
> +static int restrict_sibling_threads(const struct cred *old_cred,
> + const struct cred *new_cred)
> +{
> + int res;
> + struct task_struct *thread, *caller;
> + struct tsync_shared_context shared_ctx;
> + struct tsync_works works = {};
> + size_t newly_discovered_threads;
> + bool found_more_threads;
> + struct tsync_work *ctx;
> +
> + atomic_set(&shared_ctx.preparation_error, 0);
> + init_completion(&shared_ctx.all_prepared);
> + init_completion(&shared_ctx.ready_to_commit);
> + atomic_set(&shared_ctx.num_unfinished, 0);
> + init_completion(&shared_ctx.all_finished);
> + shared_ctx.old_cred = old_cred;
> + shared_ctx.new_cred = new_cred;
> +
> + caller = current;
> +
> + /*
> + * We schedule a pseudo-signal task_work for each of the calling task's
> + * sibling threads. In the task work, each thread:
> + *
> + * 1) runs prepare_creds() and writes back the error to
> + * shared_ctx.preparation_error, if needed.
> + *
> + * 2) signals that it's done with prepare_creds() to the calling task.
> + * (completion "all_prepared").
> + *
> + * 3) waits for the completion "ready_to_commit". This is sent by the
> + * calling task after ensuring that all sibling threads have done
> + * with the "preparation" stage.
> + *
> + * After this barrier is reached, it's safe to read
> + * shared_ctx.preparation_error.
> + *
> + * 4) reads shared_ctx.preparation_error and then either does
> + * commit_creds() or abort_creds().
> + *
> + * 5) signals that it's done altogether (barrier synchronization
> + * "all_finished")
> + */
> + do {
> + found_more_threads = false;
> +
> + /*
> + * The "all_prepared" barrier is used locally to the inner loop,
> + * this use of for_each_thread(). We can reset it on each loop
> + * iteration because all previous loop iterations are done with
> + * it already.
> + *
> + * num_preparing is initialized to 1 so that the counter can not
> + * go to 0 and mark the completion as done before all task works
> + * are registered. (We decrement it at the end of this loop.)
> + */
> + atomic_set(&shared_ctx.num_preparing, 1);
> + reinit_completion(&shared_ctx.all_prepared);
> +
> + /* In RCU read-lock, count the threads we need. */
> + newly_discovered_threads = 0;
> + rcu_read_lock();
> + for_each_thread(caller, thread) {
> + /* Skip current, since it is initiating the sync. */
> + if (thread == caller)
> + continue;
> +
> + /* Skip exited threads. */
> + if (thread->flags & PF_EXITING)
> + continue;
> +
> + /* Skip threads that we have already seen. */
> + if (tsync_works_contains_task(&works, thread))
> + continue;
> +
> + newly_discovered_threads++;
> + }
> + rcu_read_unlock();
This RCU block could be moved in a dedicated helper that will return the
number of newly discovered threads. In this helper, we could use
guard()(rcu).
> +
> + if (newly_discovered_threads == 0)
> + break; /* done */
> +
> + res = tsync_works_grow_by(&works, newly_discovered_threads,
> + GFP_KERNEL_ACCOUNT);
> + if (res) {
> + atomic_set(&shared_ctx.preparation_error, res);
> + break;
> + }
> +
> + rcu_read_lock();
> + for_each_thread(caller, thread) {
> + /* Skip current, since it is initiating the sync. */
> + if (thread == caller)
> + continue;
> +
> + /* Skip exited threads. */
> + if (thread->flags & PF_EXITING)
> + continue;
> +
> + /* Skip threads that we already looked at. */
> + if (tsync_works_contains_task(&works, thread))
> + continue;
> +
> + /*
> + * We found a sibling thread that is not doing its
> + * task_work yet, and which might spawn new threads
> + * before our task work runs, so we need at least one
> + * more round in the outer loop.
> + */
> + found_more_threads = true;
> +
> + ctx = tsync_works_provide(&works, thread);
> + if (!ctx) {
> + /*
> + * We ran out of preallocated contexts -- we
> + * need to try again with this thread at a later
> + * time! found_more_threads is already true
> + * at this point.
> + */
> + break;
> + }
> +
> + ctx->shared_ctx = &shared_ctx;
> +
> + atomic_inc(&shared_ctx.num_preparing);
> + atomic_inc(&shared_ctx.num_unfinished);
> +
> + init_task_work(&ctx->work,
> + restrict_one_thread_callback);
> + res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
> + if (res) {
> + /*
> + * Remove the task from ctx so that we will
> + * revisit the task at a later stage, if it
> + * still exists.
> + */
> + put_task_struct_rcu_user(ctx->task);
> + ctx->task = NULL;
> +
> + atomic_set(&shared_ctx.preparation_error, res);
> + atomic_dec(&shared_ctx.num_preparing);
> + atomic_dec(&shared_ctx.num_unfinished);
> + }
> + }
> + rcu_read_unlock();
As for the other RCU block, it might help to move this RCU block into a
dedicated helper. It seems that it would look easier to read and
maintain.
> +
> + /*
> + * Decrement num_preparing for current, to undo that we
> + * initialized it to 1 at the beginning of the inner loop.
> + */
> + if (atomic_dec_return(&shared_ctx.num_preparing) > 0)
> + wait_for_completion(&shared_ctx.all_prepared);
> + } while (found_more_threads &&
> + !atomic_read(&shared_ctx.preparation_error));
Is it safe to prevent inconsistencies wrt execve? seccomp uses
cred_guard_mutex (new code should probably use exec_update_lock), why
should Landlock not do the same?
Why shouldn't we lock sighand as well?
Answers to these questions should be explained in comments.
> +
> + /*
> + * We now have all sibling threads blocking and in "prepared" state in
> + * the task work. Ask all threads to commit.
> + */
> + complete_all(&shared_ctx.ready_to_commit);
> +
> + if (works.size)
> + wait_for_completion(&shared_ctx.all_finished);
> +
> + tsync_works_free(&works);
> +
> + return atomic_read(&shared_ctx.preparation_error);
> +}
> +
> /**
> * sys_landlock_restrict_self - Enforce a ruleset on the calling thread
> *
> @@ -454,12 +866,20 @@ SYSCALL_DEFINE4(landlock_add_rule, const int, ruleset_fd,
> * - %LANDLOCK_RESTRICT_SELF_LOG_SAME_EXEC_OFF
> * - %LANDLOCK_RESTRICT_SELF_LOG_NEW_EXEC_ON
> * - %LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF
> + * - %LANDLOCK_RESTRICT_SELF_TSYNC
> *
> - * This system call enables to enforce a Landlock ruleset on the current
> - * thread. Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
> + * This system call enforces a Landlock ruleset on the current thread.
> + * Enforcing a ruleset requires that the task has %CAP_SYS_ADMIN in its
> * namespace or is running with no_new_privs. This avoids scenarios where
> * unprivileged tasks can affect the behavior of privileged children.
> *
> + * If %LANDLOCK_RESTRICT_SELF_TSYNC is specified in @flags, all other threads of
> + * the process will be brought into the exact same Landlock configuration as the
> + * calling thread. This includes both the enforced ruleset and logging
> + * configuration, and happens irrespective of previously established rulesets
> + * and logging configurations on these threads. If required, this operation
> + * also enables the no_new_privs flag for these threads.
> + *
> * Possible returned errors are:
> *
> * - %EOPNOTSUPP: Landlock is supported by the kernel but disabled at boot time;
> @@ -484,6 +904,7 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> struct landlock_cred_security *new_llcred;
> bool __maybe_unused log_same_exec, log_new_exec, log_subdomains,
> prev_log_subdomains;
> + int res;
>
> if (!is_initialized())
> return -EOPNOTSUPP;
> @@ -566,5 +987,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> new_llcred->domain_exec |= BIT(new_dom->num_layers - 1);
> #endif /* CONFIG_AUDIT */
>
> + if (flags & LANDLOCK_RESTRICT_SELF_TSYNC) {
> + res = restrict_sibling_threads(current_cred(), new_cred);
> + if (res != 0) {
> + abort_creds(new_cred);
> + return res;
> + }
> + }
> +
> return commit_creds(new_cred);
> }
> --
> 2.51.0.618.g983fd99d29-goog
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()
2025-10-20 20:12 ` Mickaël Salaün
@ 2025-10-24 21:29 ` Jann Horn
2025-11-27 9:36 ` Günther Noack
2025-11-27 9:56 ` Günther Noack
1 sibling, 1 reply; 14+ messages in thread
From: Jann Horn @ 2025-10-24 21:29 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Günther Noack, Konstantin Meskhidze, Tingmao Wang,
Paul Moore, linux-security-module
On Mon, Oct 20, 2025 at 10:12 PM Mickaël Salaün <mic@digikod.net> wrote:
> On Wed, Oct 01, 2025 at 01:18:06PM +0200, Günther Noack wrote:
> > +
> > + /*
> > + * Decrement num_preparing for current, to undo that we
> > + * initialized it to 1 at the beginning of the inner loop.
> > + */
> > + if (atomic_dec_return(&shared_ctx.num_preparing) > 0)
> > + wait_for_completion(&shared_ctx.all_prepared);
> > + } while (found_more_threads &&
> > + !atomic_read(&shared_ctx.preparation_error));
>
> Is it safe to prevent inconsistencies wrt execve? seccomp uses
> cred_guard_mutex (new code should probably use exec_update_lock), why
> should Landlock not do the same?
We don't have to worry about interactions with execve because, unlike
seccomp, we don't directly change properties of another running
thread; we ask other threads to change their credentials _themselves_.
From a locking context, restrict_one_thread() essentially runs in the
same kind of context as a syscall, and doesn't need any more locking
than the existing landlock_restrict_self().
> Why shouldn't we lock sighand as well?
seccomp uses siglock for the following reasons:
1. to protect against concurrent access to one thread's seccomp filter
information from multiple threads; we don't do anything like that
2. to protect the for_each_thread() loop; we use RCU for that (we
could also use siglock but there's no reason to do that, and RCU is
more lightweight than the siglock which requires disabling interrupts)
3. to ensure that threads' seccomp states don't change between its two
loops over other threads (seccomp_can_sync_threads() and
seccomp_sync_threads()); we don't do anything like that
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()
2025-10-24 21:29 ` Jann Horn
@ 2025-11-27 9:36 ` Günther Noack
0 siblings, 0 replies; 14+ messages in thread
From: Günther Noack @ 2025-11-27 9:36 UTC (permalink / raw)
To: Jann Horn
Cc: Mickaël Salaün, Konstantin Meskhidze, Tingmao Wang,
Paul Moore, linux-security-module
On Fri, Oct 24, 2025 at 11:29:51PM +0200, Jann Horn wrote:
> On Mon, Oct 20, 2025 at 10:12 PM Mickaël Salaün <mic@digikod.net> wrote:
> > Is it safe to prevent inconsistencies wrt execve? seccomp uses
> > cred_guard_mutex (new code should probably use exec_update_lock), why
> > should Landlock not do the same?
>
> We don't have to worry about interactions with execve because, unlike
> seccomp, we don't directly change properties of another running
> thread; we ask other threads to change their credentials _themselves_.
> From a locking context, restrict_one_thread() essentially runs in the
> same kind of context as a syscall, and doesn't need any more locking
> than the existing landlock_restrict_self().
>
> > Why shouldn't we lock sighand as well?
>
> seccomp uses siglock for the following reasons:
>
> 1. to protect against concurrent access to one thread's seccomp filter
> information from multiple threads; we don't do anything like that
> 2. to protect the for_each_thread() loop; we use RCU for that (we
> could also use siglock but there's no reason to do that, and RCU is
> more lightweight than the siglock which requires disabling interrupts)
> 3. to ensure that threads' seccomp states don't change between its two
> loops over other threads (seccomp_can_sync_threads() and
> seccomp_sync_threads()); we don't do anything like that
Thanks for digging this up! I used a (reworded) variant of these three
points to document the locking rationale in the code.
—Günther
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()
2025-10-20 20:12 ` Mickaël Salaün
2025-10-24 21:29 ` Jann Horn
@ 2025-11-27 9:56 ` Günther Noack
1 sibling, 0 replies; 14+ messages in thread
From: Günther Noack @ 2025-11-27 9:56 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Konstantin Meskhidze, Tingmao Wang, Paul Moore,
linux-security-module, Jann Horn
Hello!
On Mon, Oct 20, 2025 at 10:12:44PM +0200, Mickaël Salaün wrote:
> On Wed, Oct 01, 2025 at 01:18:06PM +0200, Günther Noack wrote:
> > +/*
> > + * restrict_sibling_threads - enables a Landlock policy for all sibling threads
> > + */
> > +static int restrict_sibling_threads(const struct cred *old_cred,
> > + const struct cred *new_cred)
> > +{
> > + int res;
> > + struct task_struct *thread, *caller;
> > + struct tsync_shared_context shared_ctx;
> > + struct tsync_works works = {};
> > + size_t newly_discovered_threads;
> > + bool found_more_threads;
> > + struct tsync_work *ctx;
> > +
> > + atomic_set(&shared_ctx.preparation_error, 0);
> > + init_completion(&shared_ctx.all_prepared);
> > + init_completion(&shared_ctx.ready_to_commit);
> > + atomic_set(&shared_ctx.num_unfinished, 0);
> > + init_completion(&shared_ctx.all_finished);
> > + shared_ctx.old_cred = old_cred;
> > + shared_ctx.new_cred = new_cred;
> > +
> > + caller = current;
> > +
> > + /*
> > + * We schedule a pseudo-signal task_work for each of the calling task's
> > + * sibling threads. In the task work, each thread:
> > + *
> > + * 1) runs prepare_creds() and writes back the error to
> > + * shared_ctx.preparation_error, if needed.
> > + *
> > + * 2) signals that it's done with prepare_creds() to the calling task.
> > + * (completion "all_prepared").
> > + *
> > + * 3) waits for the completion "ready_to_commit". This is sent by the
> > + * calling task after ensuring that all sibling threads have done
> > + * with the "preparation" stage.
> > + *
> > + * After this barrier is reached, it's safe to read
> > + * shared_ctx.preparation_error.
> > + *
> > + * 4) reads shared_ctx.preparation_error and then either does
> > + * commit_creds() or abort_creds().
> > + *
> > + * 5) signals that it's done altogether (barrier synchronization
> > + * "all_finished")
> > + */
> > + do {
> > + found_more_threads = false;
> > +
> > + /*
> > + * The "all_prepared" barrier is used locally to the inner loop,
> > + * this use of for_each_thread(). We can reset it on each loop
> > + * iteration because all previous loop iterations are done with
> > + * it already.
> > + *
> > + * num_preparing is initialized to 1 so that the counter can not
> > + * go to 0 and mark the completion as done before all task works
> > + * are registered. (We decrement it at the end of this loop.)
> > + */
> > + atomic_set(&shared_ctx.num_preparing, 1);
> > + reinit_completion(&shared_ctx.all_prepared);
> > +
>
> > + /* In RCU read-lock, count the threads we need. */
> > + newly_discovered_threads = 0;
> > + rcu_read_lock();
> > + for_each_thread(caller, thread) {
> > + /* Skip current, since it is initiating the sync. */
> > + if (thread == caller)
> > + continue;
> > +
> > + /* Skip exited threads. */
> > + if (thread->flags & PF_EXITING)
> > + continue;
> > +
> > + /* Skip threads that we have already seen. */
> > + if (tsync_works_contains_task(&works, thread))
> > + continue;
> > +
> > + newly_discovered_threads++;
> > + }
> > + rcu_read_unlock();
>
> This RCU block could be moved in a dedicated helper that will return the
> number of newly discovered threads. In this helper, we could use
> guard()(rcu).
Done.
> > +
> > + if (newly_discovered_threads == 0)
> > + break; /* done */
> > +
> > + res = tsync_works_grow_by(&works, newly_discovered_threads,
> > + GFP_KERNEL_ACCOUNT);
> > + if (res) {
> > + atomic_set(&shared_ctx.preparation_error, res);
> > + break;
> > + }
> > +
> > + rcu_read_lock();
> > + for_each_thread(caller, thread) {
> > + /* Skip current, since it is initiating the sync. */
> > + if (thread == caller)
> > + continue;
> > +
> > + /* Skip exited threads. */
> > + if (thread->flags & PF_EXITING)
> > + continue;
> > +
> > + /* Skip threads that we already looked at. */
> > + if (tsync_works_contains_task(&works, thread))
> > + continue;
> > +
> > + /*
> > + * We found a sibling thread that is not doing its
> > + * task_work yet, and which might spawn new threads
> > + * before our task work runs, so we need at least one
> > + * more round in the outer loop.
> > + */
> > + found_more_threads = true;
> > +
> > + ctx = tsync_works_provide(&works, thread);
> > + if (!ctx) {
> > + /*
> > + * We ran out of preallocated contexts -- we
> > + * need to try again with this thread at a later
> > + * time! found_more_threads is already true
> > + * at this point.
> > + */
> > + break;
> > + }
> > +
> > + ctx->shared_ctx = &shared_ctx;
> > +
> > + atomic_inc(&shared_ctx.num_preparing);
> > + atomic_inc(&shared_ctx.num_unfinished);
> > +
> > + init_task_work(&ctx->work,
> > + restrict_one_thread_callback);
> > + res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
> > + if (res) {
> > + /*
> > + * Remove the task from ctx so that we will
> > + * revisit the task at a later stage, if it
> > + * still exists.
> > + */
> > + put_task_struct_rcu_user(ctx->task);
> > + ctx->task = NULL;
> > +
> > + atomic_set(&shared_ctx.preparation_error, res);
> > + atomic_dec(&shared_ctx.num_preparing);
> > + atomic_dec(&shared_ctx.num_unfinished);
> > + }
> > + }
> > + rcu_read_unlock();
>
> As for the other RCU block, it might help to move this RCU block into a
> dedicated helper. It seems that it would look easier to read and
> maintain.
Done.
> > + /*
> > + * Decrement num_preparing for current, to undo that we
> > + * initialized it to 1 at the beginning of the inner loop.
> > + */
> > + if (atomic_dec_return(&shared_ctx.num_preparing) > 0)
> > + wait_for_completion(&shared_ctx.all_prepared);
> > + } while (found_more_threads &&
> > + !atomic_read(&shared_ctx.preparation_error));
>
> Is it safe to prevent inconsistencies wrt execve? seccomp uses
> cred_guard_mutex (new code should probably use exec_update_lock), why
> should Landlock not do the same?
>
> Why shouldn't we lock sighand as well?
>
> Answers to these questions should be explained in comments.
Added something to the top of the loop, based on Jann's explanation in [1].
[1] https://lore.kernel.org/all/CAG48ez3MxN524ge_sZeTwL0FEDASaSTb-gm1vPO8UwpijTeHqw@mail.gmail.com/
—Günther
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()
2025-10-01 11:18 ` [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self() Günther Noack
2025-10-17 15:04 ` Mickaël Salaün
2025-10-20 20:12 ` Mickaël Salaün
@ 2025-10-24 21:11 ` Jann Horn
2025-11-27 10:32 ` Günther Noack
2 siblings, 1 reply; 14+ messages in thread
From: Jann Horn @ 2025-10-24 21:11 UTC (permalink / raw)
To: Günther Noack
Cc: Mickaël Salaün, Konstantin Meskhidze, Tingmao Wang,
Paul Moore, linux-security-module
On Wed, Oct 1, 2025 at 1:18 PM Günther Noack <gnoack@google.com> wrote:
> Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag. With this flag, a
> given Landlock ruleset is applied to all threads of the calling
> process, instead of only the current one.
>
> Without this flag, multithreaded userspace programs currently resort
> to using the nptl(7)/libpsx hack for multithreaded policy enforcement,
> which is also used by libcap and for setuid(2). Using this scheme,
> the threads of a process enforce the same Landlock ruleset, but the
> resulting Landlock domains are still separate, which makes a
> difference for Landlock's "scoped" access rights, where the domain
> identity and nesting is used. As a result, when using
> LANLDOCK_SCOPE_SIGNAL, signaling between sibling threads stops
> working. This is a problem for programming languages and frameworks
> which are inherently multithreaded (e.g. Go).
This looks good to me overall, though there are a couple details to fix.
[...]
> +static inline void landlock_cred_copy(struct landlock_cred_security *dst,
> + const struct landlock_cred_security *src)
> +{
> + if (dst->domain)
> + landlock_put_ruleset(dst->domain);
> +
> + *dst = *src;
nit: I would add a short comment at the definition of struct
landlock_cred_security noting that this function memcpy's the entire
struct
> +
> + if (dst->domain)
> + landlock_get_ruleset(src->domain);
> +}
[...]
> +/*
> + * tsync_works_grow_by - preallocates space for n more contexts in s
> + *
> + * Returns:
> + * -ENOMEM if the (re)allocation fails
> + * 0 if the allocation succeeds, partially succeeds, or no reallocation was needed
> + */
> +static int tsync_works_grow_by(struct tsync_works *s, size_t n, gfp_t flags)
> +{
> + int i;
> + size_t new_capacity = s->capacity + n;
(You only have to grow to `s->size + n` but I guess this works too.)
> + struct tsync_work **works;
> +
> + if (new_capacity <= s->capacity)
> + return 0;
> +
> + works = krealloc_array(s->works, new_capacity, sizeof(s->works[0]),
> + flags);
> + if (IS_ERR(works))
> + return PTR_ERR(works);
The kmalloc function family returns NULL on failure, so you have to
check for NULL here instead of IS_ERR(), and then return -ENOMEM
instead of PTR_ERR().
> +
> + s->works = works;
> +
> + for (i = s->capacity; i < new_capacity; i++) {
> + s->works[i] = kzalloc(sizeof(*s->works[i]), flags);
> + if (IS_ERR(s->works[i])) {
(again, kzalloc() returns NULL on failure)
> + /*
> + * Leave the object in a consistent state,
> + * but return an error.
> + */
> + s->capacity = i;
> + return PTR_ERR(s->works[i]);
> + }
> + }
> + s->capacity = new_capacity;
> + return 0;
> +}
[...]
> +/*
> + * tsync_works_free - free memory held by s and drop all task references
> + */
> +static void tsync_works_free(struct tsync_works *s)
> +{
> + int i;
> +
> + for (i = 0; i < s->size; i++)
> + put_task_struct(s->works[i]->task);
You'll need a NULL check before calling put_task_struct(), since the
task_work_add() failure path can NULL out ->task. (Alternatively you
could leave the task pointer intact in the task_work_add() failure
path, since task_work_add() only fails if the task is already
PF_EXITING. The &work_exited marker which causes task_work_add() to
fail is only put on the task work list when task_work_run() runs on a
PF_EXITING task.)
> + for (i = 0; i < s->capacity; i++)
> + kfree(s->works[i]);
> + kfree(s->works);
> + s->works = NULL;
> + s->size = 0;
> + s->capacity = 0;
> +}
> +
> +/*
> + * restrict_sibling_threads - enables a Landlock policy for all sibling threads
> + */
> +static int restrict_sibling_threads(const struct cred *old_cred,
> + const struct cred *new_cred)
> +{
> + int res;
> + struct task_struct *thread, *caller;
> + struct tsync_shared_context shared_ctx;
> + struct tsync_works works = {};
> + size_t newly_discovered_threads;
> + bool found_more_threads;
> + struct tsync_work *ctx;
> +
> + atomic_set(&shared_ctx.preparation_error, 0);
> + init_completion(&shared_ctx.all_prepared);
> + init_completion(&shared_ctx.ready_to_commit);
> + atomic_set(&shared_ctx.num_unfinished, 0);
I think num_unfinished should be initialized to 1 here and decremented
later on, I think, similar to how num_preparing works. Though it only
matters in the edge case where the first thread we send task work to
immediately fails the memory allocation. (And then you can also remove
that "if (works.size)" check before
"wait_for_completion(&shared_ctx.all_finished)".)
> + init_completion(&shared_ctx.all_finished);
> + shared_ctx.old_cred = old_cred;
> + shared_ctx.new_cred = new_cred;
> +
> + caller = current;
[...]
> + init_task_work(&ctx->work,
> + restrict_one_thread_callback);
> + res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
> + if (res) {
> + /*
> + * Remove the task from ctx so that we will
> + * revisit the task at a later stage, if it
> + * still exists.
> + */
> + put_task_struct_rcu_user(ctx->task);
The complement to get_task_struct() is put_task_struct(), which I see
you also used in tsync_works_free(). put_task_struct_rcu_user() is for
a different, special type of task_struct reference.
> + ctx->task = NULL;
> +
> + atomic_set(&shared_ctx.preparation_error, res);
I think you don't want to set preparation_error here - that would
cause the syscall to return -ESRCH if we happen to race with an
exiting thread. Just remove that line - in the next iteration, we'll
skip this thread even if it still exists, because it has PF_EXITING
set by this point.
> + atomic_dec(&shared_ctx.num_preparing);
> + atomic_dec(&shared_ctx.num_unfinished);
> + }
> + }
> + rcu_read_unlock();
> +
> + /*
> + * Decrement num_preparing for current, to undo that we
> + * initialized it to 1 at the beginning of the inner loop.
> + */
> + if (atomic_dec_return(&shared_ctx.num_preparing) > 0)
> + wait_for_completion(&shared_ctx.all_prepared);
I'm sorry, because this will make the patch a little bit more
complicated, but... I don't think you can use wait_for_completion()
here. Consider the scenario where two userspace threads of the same
process call this functionality (or a kernel subsystem that does
something similar) simultaneously. Each thread will wait for the other
indefinitely, and userspace won't even be able to resolve the deadlock
by killing the processes.
Similar issues would probably apply if, for example, GDB tried to
attach to the process with bad timing - if GDB ptrace-stops another
thread before you schedule task work for it, and then tries to
ptrace-stop this thread, I think this thread could essentially be in a
deadlock with GDB.
You'll have to do something else here. I think the best solution would
be to use wait_for_completion_interruptible() instead; then if that
fails, tear down all the task work stuff that was already scheduled,
and return with error -ERESTARTNOINTR. Something like (entirely
untested):
/* interruptible wait to avoid deadlocks while waiting for other tasks
to enter our task work */
if (wait_for_completion_interruptible(&shared_ctx.all_prepared)) {
atomic_set(&shared_ctx.preparation_error, -ERESTARTNOINTR);
for (int i=0; i<works.size; i++) {
if (task_work_cancel(works.works[i]->task, &works.works[i]->work))
if (atomic_dec_return(&shared_ctx.num_preparing))
complete_all(&shared_ctx.all_prepared);
}
/* at this point we're only waiting for tasks that are already
executing the task work */
wait_for_completion(&shared_ctx.all_prepared);
}
Note that if the syscall returns -ERESTARTNOINTR, that won't be
visible to userspace (except for debugging tools like strace/gdb); the
kernel ensures that the syscall will transparently re-execute
immediately. (It literally decrements the saved userspace instruction
pointer by the size of a syscall instruction, so that when the kernel
returns to userspace, the next instruction that executes will redo the
syscall.) This allows us to break the deadlock without having to write
any ugly retry logic or throwing userspace-visible errors.
> + } while (found_more_threads &&
> + !atomic_read(&shared_ctx.preparation_error));
> +
> + /*
> + * We now have all sibling threads blocking and in "prepared" state in
> + * the task work. Ask all threads to commit.
> + */
> + complete_all(&shared_ctx.ready_to_commit);
> +
> + if (works.size)
> + wait_for_completion(&shared_ctx.all_finished);
> +
> + tsync_works_free(&works);
> +
> + return atomic_read(&shared_ctx.preparation_error);
> +}
[...]
> @@ -566,5 +987,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> new_llcred->domain_exec |= BIT(new_dom->num_layers - 1);
> #endif /* CONFIG_AUDIT */
>
> + if (flags & LANDLOCK_RESTRICT_SELF_TSYNC) {
> + res = restrict_sibling_threads(current_cred(), new_cred);
> + if (res != 0) {
> + abort_creds(new_cred);
> + return res;
> + }
> + }
Annoyingly, there is a special-case path above for the case where
LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF is set without actually
applying any ruleset. In that case you won't reach this point, and so
LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF would only affect the
current thread in that case. I doubt it'd be very noticeable, but
still, it might be a good idea to rearrange things here a bit... maybe
instead of the current `if (!ruleset) return commit_creds(new_cred);`,
put some of the subsequent stuff in a `if (ruleset) {` block?
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()
2025-10-24 21:11 ` Jann Horn
@ 2025-11-27 10:32 ` Günther Noack
0 siblings, 0 replies; 14+ messages in thread
From: Günther Noack @ 2025-11-27 10:32 UTC (permalink / raw)
To: Jann Horn
Cc: Mickaël Salaün, Konstantin Meskhidze, Tingmao Wang,
Paul Moore, linux-security-module
On Fri, Oct 24, 2025 at 11:11:10PM +0200, Jann Horn wrote:
> On Wed, Oct 1, 2025 at 1:18 PM Günther Noack <gnoack@google.com> wrote:
> > Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag. With this flag, a
> > given Landlock ruleset is applied to all threads of the calling
> > process, instead of only the current one.
> >
> > Without this flag, multithreaded userspace programs currently resort
> > to using the nptl(7)/libpsx hack for multithreaded policy enforcement,
> > which is also used by libcap and for setuid(2). Using this scheme,
> > the threads of a process enforce the same Landlock ruleset, but the
> > resulting Landlock domains are still separate, which makes a
> > difference for Landlock's "scoped" access rights, where the domain
> > identity and nesting is used. As a result, when using
> > LANLDOCK_SCOPE_SIGNAL, signaling between sibling threads stops
> > working. This is a problem for programming languages and frameworks
> > which are inherently multithreaded (e.g. Go).
>
> This looks good to me overall, though there are a couple details to fix.
>
> [...]
> > +static inline void landlock_cred_copy(struct landlock_cred_security *dst,
> > + const struct landlock_cred_security *src)
> > +{
> > + if (dst->domain)
> > + landlock_put_ruleset(dst->domain);
> > +
> > + *dst = *src;
>
> nit: I would add a short comment at the definition of struct
> landlock_cred_security noting that this function memcpy's the entire
> struct
Sounds good. I added a small remark "when updating this, also update
landlock_cred_copy() if needed".
> > +
> > + if (dst->domain)
> > + landlock_get_ruleset(src->domain);
> > +}
> [...]
> > +/*
> > + * tsync_works_grow_by - preallocates space for n more contexts in s
> > + *
> > + * Returns:
> > + * -ENOMEM if the (re)allocation fails
> > + * 0 if the allocation succeeds, partially succeeds, or no reallocation was needed
> > + */
> > +static int tsync_works_grow_by(struct tsync_works *s, size_t n, gfp_t flags)
> > +{
> > + int i;
> > + size_t new_capacity = s->capacity + n;
>
> (You only have to grow to `s->size + n` but I guess this works too.)
Thanks, well spotted. This was indeed the intended behavior, the
new_capacity <= s->capacity check also makes much more sense that
way. (I have a more detailed answer in another reply.) I fixed this
and also added an overflow check for good measure.
> > + struct tsync_work **works;
> > +
> > + if (new_capacity <= s->capacity)
> > + return 0;
> > +
> > + works = krealloc_array(s->works, new_capacity, sizeof(s->works[0]),
> > + flags);
> > + if (IS_ERR(works))
> > + return PTR_ERR(works);
>
> The kmalloc function family returns NULL on failure, so you have to
> check for NULL here instead of IS_ERR(), and then return -ENOMEM
> instead of PTR_ERR().
Thanks, fixed.
> > + s->works = works;
> > +
> > + for (i = s->capacity; i < new_capacity; i++) {
> > + s->works[i] = kzalloc(sizeof(*s->works[i]), flags);
> > + if (IS_ERR(s->works[i])) {
>
> (again, kzalloc() returns NULL on failure)
Done.
> > + /*
> > + * Leave the object in a consistent state,
> > + * but return an error.
> > + */
> > + s->capacity = i;
> > + return PTR_ERR(s->works[i]);
> > + }
> > + }
> > + s->capacity = new_capacity;
> > + return 0;
> > +}
> [...]
> > +/*
> > + * tsync_works_free - free memory held by s and drop all task references
> > + */
> > +static void tsync_works_free(struct tsync_works *s)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < s->size; i++)
> > + put_task_struct(s->works[i]->task);
>
> You'll need a NULL check before calling put_task_struct(), since the
> task_work_add() failure path can NULL out ->task. (Alternatively you
> could leave the task pointer intact in the task_work_add() failure
> path, since task_work_add() only fails if the task is already
> PF_EXITING. The &work_exited marker which causes task_work_add() to
> fail is only put on the task work list when task_work_run() runs on a
> PF_EXITING task.)
Thanks for spotting this, this is correct! I added a NULL check.
> > + for (i = 0; i < s->capacity; i++)
> > + kfree(s->works[i]);
> > + kfree(s->works);
> > + s->works = NULL;
> > + s->size = 0;
> > + s->capacity = 0;
> > +}
> > +
> > +/*
> > + * restrict_sibling_threads - enables a Landlock policy for all sibling threads
> > + */
> > +static int restrict_sibling_threads(const struct cred *old_cred,
> > + const struct cred *new_cred)
> > +{
> > + int res;
> > + struct task_struct *thread, *caller;
> > + struct tsync_shared_context shared_ctx;
> > + struct tsync_works works = {};
> > + size_t newly_discovered_threads;
> > + bool found_more_threads;
> > + struct tsync_work *ctx;
> > +
> > + atomic_set(&shared_ctx.preparation_error, 0);
> > + init_completion(&shared_ctx.all_prepared);
> > + init_completion(&shared_ctx.ready_to_commit);
> > + atomic_set(&shared_ctx.num_unfinished, 0);
>
> I think num_unfinished should be initialized to 1 here and decremented
> later on, I think, similar to how num_preparing works. Though it only
> matters in the edge case where the first thread we send task work to
> immediately fails the memory allocation. (And then you can also remove
> that "if (works.size)" check before
> "wait_for_completion(&shared_ctx.all_finished)".)
Thank you, good catch!
The works.size check was inaccurate, because in the case of an error
during task_work_add(), it wasn't actually counting the number of
scheduled task works, but overestimating it. The scenario is a bit
obscure, but initializing num_unfinished is a more robust approach
that rules out that variant of bugs.
> > + init_completion(&shared_ctx.all_finished);
> > + shared_ctx.old_cred = old_cred;
> > + shared_ctx.new_cred = new_cred;
> > +
> > + caller = current;
> [...]
> > + init_task_work(&ctx->work,
> > + restrict_one_thread_callback);
> > + res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
> > + if (res) {
> > + /*
> > + * Remove the task from ctx so that we will
> > + * revisit the task at a later stage, if it
> > + * still exists.
> > + */
> > + put_task_struct_rcu_user(ctx->task);
>
> The complement to get_task_struct() is put_task_struct(), which I see
> you also used in tsync_works_free(). put_task_struct_rcu_user() is for
> a different, special type of task_struct reference.
Thanks, done.
> > + ctx->task = NULL;
> > +
> > + atomic_set(&shared_ctx.preparation_error, res);
>
> I think you don't want to set preparation_error here - that would
> cause the syscall to return -ESRCH if we happen to race with an
> exiting thread. Just remove that line - in the next iteration, we'll
> skip this thread even if it still exists, because it has PF_EXITING
> set by this point.
Thanks, that is correct and I fixed it as you suggested. -- The thread
exiting is the only reason why task_work_add() can fail. In the
(perfectly valid) case where one of the sibling threads happens to
exit, we do not want the landlock_restrict_self() syscall to fail just
because of that.
> > + atomic_dec(&shared_ctx.num_preparing);
> > + atomic_dec(&shared_ctx.num_unfinished);
> > + }
> > + }
> > + rcu_read_unlock();
> > +
> > + /*
> > + * Decrement num_preparing for current, to undo that we
> > + * initialized it to 1 at the beginning of the inner loop.
> > + */
> > + if (atomic_dec_return(&shared_ctx.num_preparing) > 0)
> > + wait_for_completion(&shared_ctx.all_prepared);
>
> I'm sorry, because this will make the patch a little bit more
> complicated, but... I don't think you can use wait_for_completion()
> here. Consider the scenario where two userspace threads of the same
> process call this functionality (or a kernel subsystem that does
> something similar) simultaneously. Each thread will wait for the other
> indefinitely, and userspace won't even be able to resolve the deadlock
> by killing the processes.
> Similar issues would probably apply if, for example, GDB tried to
> attach to the process with bad timing - if GDB ptrace-stops another
> thread before you schedule task work for it, and then tries to
> ptrace-stop this thread, I think this thread could essentially be in a
> deadlock with GDB.
>
> You'll have to do something else here. I think the best solution would
> be to use wait_for_completion_interruptible() instead; then if that
> fails, tear down all the task work stuff that was already scheduled,
> and return with error -ERESTARTNOINTR. Something like (entirely
> untested):
>
> /* interruptible wait to avoid deadlocks while waiting for other tasks
> to enter our task work */
> if (wait_for_completion_interruptible(&shared_ctx.all_prepared)) {
> atomic_set(&shared_ctx.preparation_error, -ERESTARTNOINTR);
> for (int i=0; i<works.size; i++) {
> if (task_work_cancel(works.works[i]->task, &works.works[i]->work))
> if (atomic_dec_return(&shared_ctx.num_preparing))
> complete_all(&shared_ctx.all_prepared);
> }
> /* at this point we're only waiting for tasks that are already
> executing the task work */
> wait_for_completion(&shared_ctx.all_prepared);
> }
>
> Note that if the syscall returns -ERESTARTNOINTR, that won't be
> visible to userspace (except for debugging tools like strace/gdb); the
> kernel ensures that the syscall will transparently re-execute
> immediately. (It literally decrements the saved userspace instruction
> pointer by the size of a syscall instruction, so that when the kernel
> returns to userspace, the next instruction that executes will redo the
> syscall.) This allows us to break the deadlock without having to write
> any ugly retry logic or throwing userspace-visible errors.
Thank you again for catching this!
I used your suggestion, with the following (minor) differences:
1. Factored it out as a function (indentation level got too high...)
2. typo: call complete_all() only if atomic_dec_return() returns 0
3. Do the same barrier synchronization dance with num_unfinished/all_finished as well.
...and I also implemented a selftest for the case where
landlock_restrict_self() gets called by two adjacent threads at the
same time.
I'll write down my reasoning why this works for reference:
The problem in V2 is that we can run into a deadlock in the case where
two thread call landlock_restrict_self(). In that case, they'll both
become uninterruptible at syscall entry and try to schedule a
task_work for each other. Then, they proceed to wait for each other's
task_work to execute, which never happens because the task_work never
gets scheduled.
This is resolved by using an interruptible wait. With the
interruptible wait, the task can detect the condition where a signal
(or task work) comes in, execute that task_work and bail out of the
system call cleanly (we un-schedule the task_works for other threads
that are still pending and we abort all the task_works that have
already started to run by setting the shared_ctx.preparation_error).
After returning from the system call with -ERESTARTNOINTR, it gets
retried automatically to recover from the problem.
In all the other cases where we wait_for_completion() uninterruptibly,
we can reason about that returning, because these happen under
circumstances where we know that the task works we are waiting for
have all been started already.
I have reproduced the deadlock and verified the fixed implementation
with the selftest, by temporarily adding mdelay() calls and a lot of
logging in strategic places.
>
> > + } while (found_more_threads &&
> > + !atomic_read(&shared_ctx.preparation_error));
> > +
> > + /*
> > + * We now have all sibling threads blocking and in "prepared" state in
> > + * the task work. Ask all threads to commit.
> > + */
> > + complete_all(&shared_ctx.ready_to_commit);
> > +
> > + if (works.size)
> > + wait_for_completion(&shared_ctx.all_finished);
> > +
> > + tsync_works_free(&works);
> > +
> > + return atomic_read(&shared_ctx.preparation_error);
> > +}
> [...]
> > @@ -566,5 +987,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> > new_llcred->domain_exec |= BIT(new_dom->num_layers - 1);
> > #endif /* CONFIG_AUDIT */
> >
> > + if (flags & LANDLOCK_RESTRICT_SELF_TSYNC) {
> > + res = restrict_sibling_threads(current_cred(), new_cred);
> > + if (res != 0) {
> > + abort_creds(new_cred);
> > + return res;
> > + }
> > + }
>
> Annoyingly, there is a special-case path above for the case where
> LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF is set without actually
> applying any ruleset. In that case you won't reach this point, and so
> LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF would only affect the
> current thread in that case. I doubt it'd be very noticeable, but
> still, it might be a good idea to rearrange things here a bit... maybe
> instead of the current `if (!ruleset) return commit_creds(new_cred);`,
> put some of the subsequent stuff in a `if (ruleset) {` block?
Thanks, I fixed that one as well.
—Günther
^ permalink raw reply [flat|nested] 14+ messages in thread