Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Günther Noack" <gnoack@google.com>
To: Jann Horn <jannh@google.com>
Cc: "Mickaël Salaün" <mic@digikod.net>,
	"Konstantin Meskhidze" <konstantin.meskhidze@huawei.com>,
	"Tingmao Wang" <m@maowtm.org>, "Paul Moore" <paul@paul-moore.com>,
	linux-security-module@vger.kernel.org
Subject: Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self()
Date: Thu, 27 Nov 2025 11:32:07 +0100	[thread overview]
Message-ID: <aSgopxqRkzKhvUXw@google.com> (raw)
In-Reply-To: <CAG48ez1oS9kANZBq1bt+D76MX03DPHAFp76GJt7z5yx-Na1VLQ@mail.gmail.com>

On Fri, Oct 24, 2025 at 11:11:10PM +0200, Jann Horn wrote:
> On Wed, Oct 1, 2025 at 1:18 PM Günther Noack <gnoack@google.com> wrote:
> > Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag.  With this flag, a
> > given Landlock ruleset is applied to all threads of the calling
> > process, instead of only the current one.
> >
> > Without this flag, multithreaded userspace programs currently resort
> > to using the nptl(7)/libpsx hack for multithreaded policy enforcement,
> > which is also used by libcap and for setuid(2).  Using this scheme,
> > the threads of a process enforce the same Landlock ruleset, but the
> > resulting Landlock domains are still separate, which makes a
> > difference for Landlock's "scoped" access rights, where the domain
> > identity and nesting is used.  As a result, when using
> > LANLDOCK_SCOPE_SIGNAL, signaling between sibling threads stops
> > working.  This is a problem for programming languages and frameworks
> > which are inherently multithreaded (e.g. Go).
> 
> This looks good to me overall, though there are a couple details to fix.
> 
> [...]
> > +static inline void landlock_cred_copy(struct landlock_cred_security *dst,
> > +                                     const struct landlock_cred_security *src)
> > +{
> > +       if (dst->domain)
> > +               landlock_put_ruleset(dst->domain);
> > +
> > +       *dst = *src;
> 
> nit: I would add a short comment at the definition of struct
> landlock_cred_security noting that this function memcpy's the entire
> struct

Sounds good. I added a small remark "when updating this, also update
landlock_cred_copy() if needed".

> > +
> > +       if (dst->domain)
> > +               landlock_get_ruleset(src->domain);
> > +}
> [...]
> > +/*
> > + * tsync_works_grow_by - preallocates space for n more contexts in s
> > + *
> > + * Returns:
> > + *   -ENOMEM if the (re)allocation fails
> > + *   0       if the allocation succeeds, partially succeeds, or no reallocation was needed
> > + */
> > +static int tsync_works_grow_by(struct tsync_works *s, size_t n, gfp_t flags)
> > +{
> > +       int i;
> > +       size_t new_capacity = s->capacity + n;
> 
> (You only have to grow to `s->size + n` but I guess this works too.)

Thanks, well spotted. This was indeed the intended behavior, the
new_capacity <= s->capacity check also makes much more sense that
way. (I have a more detailed answer in another reply.)  I fixed this
and also added an overflow check for good measure.


> > +       struct tsync_work **works;
> > +
> > +       if (new_capacity <= s->capacity)
> > +               return 0;
> > +
> > +       works = krealloc_array(s->works, new_capacity, sizeof(s->works[0]),
> > +                              flags);
> > +       if (IS_ERR(works))
> > +               return PTR_ERR(works);
> 
> The kmalloc function family returns NULL on failure, so you have to
> check for NULL here instead of IS_ERR(), and then return -ENOMEM
> instead of PTR_ERR().

Thanks, fixed.


> > +       s->works = works;
> > +
> > +       for (i = s->capacity; i < new_capacity; i++) {
> > +               s->works[i] = kzalloc(sizeof(*s->works[i]), flags);
> > +               if (IS_ERR(s->works[i])) {
> 
> (again, kzalloc() returns NULL on failure)

Done.


> > +                       /*
> > +                        * Leave the object in a consistent state,
> > +                        * but return an error.
> > +                        */
> > +                       s->capacity = i;
> > +                       return PTR_ERR(s->works[i]);
> > +               }
> > +       }
> > +       s->capacity = new_capacity;
> > +       return 0;
> > +}
> [...]
> > +/*
> > + * tsync_works_free - free memory held by s and drop all task references
> > + */
> > +static void tsync_works_free(struct tsync_works *s)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < s->size; i++)
> > +               put_task_struct(s->works[i]->task);
> 
> You'll need a NULL check before calling put_task_struct(), since the
> task_work_add() failure path can NULL out ->task. (Alternatively you
> could leave the task pointer intact in the task_work_add() failure
> path, since task_work_add() only fails if the task is already
> PF_EXITING. The &work_exited marker which causes task_work_add() to
> fail is only put on the task work list when task_work_run() runs on a
> PF_EXITING task.)

Thanks for spotting this, this is correct!  I added a NULL check.

> > +       for (i = 0; i < s->capacity; i++)
> > +               kfree(s->works[i]);
> > +       kfree(s->works);
> > +       s->works = NULL;
> > +       s->size = 0;
> > +       s->capacity = 0;
> > +}
> > +
> > +/*
> > + * restrict_sibling_threads - enables a Landlock policy for all sibling threads
> > + */
> > +static int restrict_sibling_threads(const struct cred *old_cred,
> > +                                   const struct cred *new_cred)
> > +{
> > +       int res;
> > +       struct task_struct *thread, *caller;
> > +       struct tsync_shared_context shared_ctx;
> > +       struct tsync_works works = {};
> > +       size_t newly_discovered_threads;
> > +       bool found_more_threads;
> > +       struct tsync_work *ctx;
> > +
> > +       atomic_set(&shared_ctx.preparation_error, 0);
> > +       init_completion(&shared_ctx.all_prepared);
> > +       init_completion(&shared_ctx.ready_to_commit);
> > +       atomic_set(&shared_ctx.num_unfinished, 0);
> 
> I think num_unfinished should be initialized to 1 here and decremented
> later on, I think, similar to how num_preparing works. Though it only
> matters in the edge case where the first thread we send task work to
> immediately fails the memory allocation. (And then you can also remove
> that "if (works.size)" check before
> "wait_for_completion(&shared_ctx.all_finished)".)

Thank you, good catch!

The works.size check was inaccurate, because in the case of an error
during task_work_add(), it wasn't actually counting the number of
scheduled task works, but overestimating it.  The scenario is a bit
obscure, but initializing num_unfinished is a more robust approach
that rules out that variant of bugs.


> > +       init_completion(&shared_ctx.all_finished);
> > +       shared_ctx.old_cred = old_cred;
> > +       shared_ctx.new_cred = new_cred;
> > +
> > +       caller = current;
> [...]
> > +                       init_task_work(&ctx->work,
> > +                                      restrict_one_thread_callback);
> > +                       res = task_work_add(thread, &ctx->work, TWA_SIGNAL);
> > +                       if (res) {
> > +                               /*
> > +                                * Remove the task from ctx so that we will
> > +                                * revisit the task at a later stage, if it
> > +                                * still exists.
> > +                                */
> > +                               put_task_struct_rcu_user(ctx->task);
> 
> The complement to get_task_struct() is put_task_struct(), which I see
> you also used in tsync_works_free(). put_task_struct_rcu_user() is for
> a different, special type of task_struct reference.

Thanks, done.


> > +                               ctx->task = NULL;
> > +
> > +                               atomic_set(&shared_ctx.preparation_error, res);
> 
> I think you don't want to set preparation_error here - that would
> cause the syscall to return -ESRCH if we happen to race with an
> exiting thread. Just remove that line - in the next iteration, we'll
> skip this thread even if it still exists, because it has PF_EXITING
> set by this point.

Thanks, that is correct and I fixed it as you suggested. -- The thread
exiting is the only reason why task_work_add() can fail.  In the
(perfectly valid) case where one of the sibling threads happens to
exit, we do not want the landlock_restrict_self() syscall to fail just
because of that.


> > +                               atomic_dec(&shared_ctx.num_preparing);
> > +                               atomic_dec(&shared_ctx.num_unfinished);
> > +                       }
> > +               }
> > +               rcu_read_unlock();
> > +
> > +               /*
> > +                * Decrement num_preparing for current, to undo that we
> > +                * initialized it to 1 at the beginning of the inner loop.
> > +                */
> > +               if (atomic_dec_return(&shared_ctx.num_preparing) > 0)
> > +                       wait_for_completion(&shared_ctx.all_prepared);
> 
> I'm sorry, because this will make the patch a little bit more
> complicated, but... I don't think you can use wait_for_completion()
> here. Consider the scenario where two userspace threads of the same
> process call this functionality (or a kernel subsystem that does
> something similar) simultaneously. Each thread will wait for the other
> indefinitely, and userspace won't even be able to resolve the deadlock
> by killing the processes.
> Similar issues would probably apply if, for example, GDB tried to
> attach to the process with bad timing - if GDB ptrace-stops another
> thread before you schedule task work for it, and then tries to
> ptrace-stop this thread, I think this thread could essentially be in a
> deadlock with GDB.
> 
> You'll have to do something else here. I think the best solution would
> be to use wait_for_completion_interruptible() instead; then if that
> fails, tear down all the task work stuff that was already scheduled,
> and return with error -ERESTARTNOINTR. Something like (entirely
> untested):
> 
> /* interruptible wait to avoid deadlocks while waiting for other tasks
> to enter our task work */
> if (wait_for_completion_interruptible(&shared_ctx.all_prepared)) {
>   atomic_set(&shared_ctx.preparation_error, -ERESTARTNOINTR);
>   for (int i=0; i<works.size; i++) {
>     if (task_work_cancel(works.works[i]->task, &works.works[i]->work))
>       if (atomic_dec_return(&shared_ctx.num_preparing))
>         complete_all(&shared_ctx.all_prepared);
>   }
>   /* at this point we're only waiting for tasks that are already
> executing the task work */
>   wait_for_completion(&shared_ctx.all_prepared);
> }
> 
> Note that if the syscall returns -ERESTARTNOINTR, that won't be
> visible to userspace (except for debugging tools like strace/gdb); the
> kernel ensures that the syscall will transparently re-execute
> immediately. (It literally decrements the saved userspace instruction
> pointer by the size of a syscall instruction, so that when the kernel
> returns to userspace, the next instruction that executes will redo the
> syscall.) This allows us to break the deadlock without having to write
> any ugly retry logic or throwing userspace-visible errors.

Thank you again for catching this!

I used your suggestion, with the following (minor) differences:

1. Factored it out as a function (indentation level got too high...)
2. typo: call complete_all() only if atomic_dec_return() returns 0
3. Do the same barrier synchronization dance with num_unfinished/all_finished as well.

...and I also implemented a selftest for the case where
landlock_restrict_self() gets called by two adjacent threads at the
same time.

I'll write down my reasoning why this works for reference:

The problem in V2 is that we can run into a deadlock in the case where
two thread call landlock_restrict_self().  In that case, they'll both
become uninterruptible at syscall entry and try to schedule a
task_work for each other.  Then, they proceed to wait for each other's
task_work to execute, which never happens because the task_work never
gets scheduled.

This is resolved by using an interruptible wait.  With the
interruptible wait, the task can detect the condition where a signal
(or task work) comes in, execute that task_work and bail out of the
system call cleanly (we un-schedule the task_works for other threads
that are still pending and we abort all the task_works that have
already started to run by setting the shared_ctx.preparation_error).
After returning from the system call with -ERESTARTNOINTR, it gets
retried automatically to recover from the problem.

In all the other cases where we wait_for_completion() uninterruptibly,
we can reason about that returning, because these happen under
circumstances where we know that the task works we are waiting for
have all been started already.

I have reproduced the deadlock and verified the fixed implementation
with the selftest, by temporarily adding mdelay() calls and a lot of
logging in strategic places.

> 
> > +       } while (found_more_threads &&
> > +                !atomic_read(&shared_ctx.preparation_error));
> > +
> > +       /*
> > +        * We now have all sibling threads blocking and in "prepared" state in
> > +        * the task work. Ask all threads to commit.
> > +        */
> > +       complete_all(&shared_ctx.ready_to_commit);
> > +
> > +       if (works.size)
> > +               wait_for_completion(&shared_ctx.all_finished);
> > +
> > +       tsync_works_free(&works);
> > +
> > +       return atomic_read(&shared_ctx.preparation_error);
> > +}
> [...]
> > @@ -566,5 +987,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32,
> >         new_llcred->domain_exec |= BIT(new_dom->num_layers - 1);
> >  #endif /* CONFIG_AUDIT */
> >
> > +       if (flags & LANDLOCK_RESTRICT_SELF_TSYNC) {
> > +               res = restrict_sibling_threads(current_cred(), new_cred);
> > +               if (res != 0) {
> > +                       abort_creds(new_cred);
> > +                       return res;
> > +               }
> > +       }
> 
> Annoyingly, there is a special-case path above for the case where
> LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF is set without actually
> applying any ruleset. In that case you won't reach this point, and so
> LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF would only affect the
> current thread in that case. I doubt it'd be very noticeable, but
> still, it might be a good idea to rearrange things here a bit... maybe
> instead of the current `if (!ruleset) return commit_creds(new_cred);`,
> put some of the subsequent stuff in a `if (ruleset) {` block?

Thanks, I fixed that one as well.

—Günther

next prev parent reply	other threads:[~2025-11-27 10:32 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-01 11:23 [PATCH v2 0/2] Landlock multithreaded enforcement Günther Noack
2025-10-01 11:18 ` [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self() Günther Noack
2025-10-17 15:04   ` Mickaël Salaün
2025-10-24 21:18     ` Jann Horn
2025-11-27  9:34       ` Günther Noack
2025-11-27  9:32     ` Günther Noack
2025-10-20 20:12   ` Mickaël Salaün
2025-10-24 21:29     ` Jann Horn
2025-11-27  9:36       ` Günther Noack
2025-11-27  9:56     ` Günther Noack
2025-10-24 21:11   ` Jann Horn
2025-11-27 10:32     ` Günther Noack [this message]
2025-10-01 11:18 ` [PATCH v2 2/2] landlock: selftests for LANDLOCK_RESTRICT_SELF_TSYNC Günther Noack
2025-10-17 15:05   ` Mickaël Salaün

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aSgopxqRkzKhvUXw@google.com \
    --to=gnoack@google.com \
    --cc=jannh@google.com \
    --cc=konstantin.meskhidze@huawei.com \
    --cc=linux-security-module@vger.kernel.org \
    --cc=m@maowtm.org \
    --cc=mic@digikod.net \
    --cc=paul@paul-moore.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.