From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 991A92FB0AB for ; Fri, 24 Oct 2025 21:11:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.48 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761340310; cv=none; b=dcCREBeGIBg6Bpe+EtZ/RAvWZkvO9U0sdqk/V3ognxiYzHj2Nf5T4PllRtCA9BbJOPCXHKK06aKrDKzJFZKPwnKGS9J+6D2NP3xbvM3+dRYl61PleJHe6t7N8BuWvBa21WWp9PfnOtql8HSQryd0yal41oM/lDcRwdpCbmCWVhQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761340310; c=relaxed/simple; bh=RUScoPmeu611HPgKVWnusfapdovDrGbWm/YA/FmJh+c=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=KGR+/LjiGXWU6SAFl2f9JzD0fg06as9BXxpiwi17wPsaj6tQr1Pc446IV1iUsZfnZw58iqSTg8qcdlAhkFqOB/CP3XVsJjSiL+xJYujek3JM+oZqJx3koJIZJXhKU03aQINaKM2elT4MwK41AnAtK9bPP1r5lW53nJFDnZvRslo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=KTvkFWgM; arc=none smtp.client-ip=209.85.208.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="KTvkFWgM" Received: by mail-ed1-f48.google.com with SMTP id 4fb4d7f45d1cf-62faa04afd9so958a12.1 for ; Fri, 24 Oct 2025 14:11:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1761340307; x=1761945107; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=1VspeXcxDsKYB1s6JKLm4yyRqd0232xzXbNa1Czm0k4=; b=KTvkFWgMV++osy1CjbcN3UPI4RClzrkLnWPoPCxaw1FN8HlaZtkWXrj7O0aurFfZD4 u4Nwh2DPKFTIcGLdAKB8//LrHR5MMIOAVNrIHFKJs+MknNEctBaTkx35UjJuzLP2GDuU 4BCOI2DraIJD0qMrhLT7vuP1AR/vXxAm/0WK9SzBgXBPla03m9ttvRhIzaqHl4UY3KOs Y668ZIX2pw7re7Mdi9r3xnj+eYg5rT4QwtyvI65MrC7hAVVq4IvpT3Z4SyS+pW9R80nE UUu32mYR0EKKANuAp0CQQKa8tFBegv5O8BmgL4Oni1kTF8lwDQOrURmgrdjTAPhBpLn2 D1QA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761340307; x=1761945107; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1VspeXcxDsKYB1s6JKLm4yyRqd0232xzXbNa1Czm0k4=; b=bJH4rNd2XeV12EX9rdSznt0rTfSoKF5r3Oyvuic1DfDn+8GZhXZQ0PJbyLpaQEjPoN nf5GQYwqMsfLS8702Zm2XPhtfM18BuzuxWdhOaR6BSbPvlca6zFcus0rxtJ3x8JACd2s 6rNJoVbQcTzC8gsZdRbcVpE/qu/IGJDFjPiljWR9kFTg4yfHuw/kJ6vQ6LquQeHYRXck a9VQXuD89nRVL3FI/3708ZCLLgCm6UscUzNeVtNAkoP7R9e1I6hgEbmflZMscPpGm0DV TysCgFR8bBPcSjcURSFCPduBPM7mPhDnsZiQxL7XglLVDBsc5o5ejXI/4FNz9T3sbzaj 3Ruw== X-Forwarded-Encrypted: i=1; AJvYcCWHrOs6GYoFE4SWp0zApUzb54biWoVAoF6mkYShpKFJUXCnmvzHbvn6era6oaTZOAW19f2vM5Ivf57vVB8Yg7G+x/umz+A=@vger.kernel.org X-Gm-Message-State: AOJu0Yx4aNcQI1WdzBZVmJuWupz23zYVBY1qaqAliT5vQgDeejPo0gHD TrrDowkOXhdYpp3YN8+NU36f7ThqLISTDXu4tPbxie5sav7B+YQ75FA2zecdFaoNRAWY9CdWSQb G5JU90D0xsL/aFVvUc5RZFNOwCdQTpDKl/nmrxeL7 X-Gm-Gg: ASbGncs25gka3QGmFBVzOS6r5gYU2ci6p5jVXo9lqyOeqRe8AZ6mSHalUYqOF95PrYi IvpzhWe2hgrda0fTFHdWUf/ssmXABEfBJ+PnwsfOMiA2/P/FGP2WCA8a+l9rWAoB2CbsjgVkIjR j0G1yIF2txYDkHDaVE/D+Wua/mTDSljOcCvoguFfBg1A2/vRngCwjzIHyIoQzV7QdQdIKu/vb/8 3aNDJbritkSDpPY4QZCSWc495syvSMXkm2Gdn01w0L6//xcMgop3QMs3gEm8ZpLpVaYRK8HbSrt 2Ly7lUbA4qQd6Q== X-Google-Smtp-Source: AGHT+IEdhI04b/zcODyPwavXcpAlFkkSlVsgX+LcQ50epmJZL2VSAxuBvAqNaZTrrkMsyCTEEnf4G29jDEqfw0943VE= X-Received: by 2002:a05:6402:1606:b0:63e:11ae:ff2e with SMTP id 4fb4d7f45d1cf-63e7c534ce3mr29147a12.3.1761340306596; Fri, 24 Oct 2025 14:11:46 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-security-module@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20251001111807.18902-1-gnoack@google.com> <20251001111807.18902-2-gnoack@google.com> In-Reply-To: <20251001111807.18902-2-gnoack@google.com> From: Jann Horn Date: Fri, 24 Oct 2025 23:11:10 +0200 X-Gm-Features: AWmQ_bkXJIRCaCV5s7BScVJ5nmYtYGZ690er2FZ4TnkqMxqwWDtUEEKQs_b3vHI Message-ID: Subject: Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self() To: =?UTF-8?Q?G=C3=BCnther_Noack?= Cc: =?UTF-8?B?TWlja2HDq2wgU2FsYcO8bg==?= , Konstantin Meskhidze , Tingmao Wang , Paul Moore , linux-security-module@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Oct 1, 2025 at 1:18=E2=80=AFPM G=C3=BCnther Noack wrote: > Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag. With this flag, a > given Landlock ruleset is applied to all threads of the calling > process, instead of only the current one. > > Without this flag, multithreaded userspace programs currently resort > to using the nptl(7)/libpsx hack for multithreaded policy enforcement, > which is also used by libcap and for setuid(2). Using this scheme, > the threads of a process enforce the same Landlock ruleset, but the > resulting Landlock domains are still separate, which makes a > difference for Landlock's "scoped" access rights, where the domain > identity and nesting is used. As a result, when using > LANLDOCK_SCOPE_SIGNAL, signaling between sibling threads stops > working. This is a problem for programming languages and frameworks > which are inherently multithreaded (e.g. Go). This looks good to me overall, though there are a couple details to fix. [...] > +static inline void landlock_cred_copy(struct landlock_cred_security *dst= , > + const struct landlock_cred_security= *src) > +{ > + if (dst->domain) > + landlock_put_ruleset(dst->domain); > + > + *dst =3D *src; nit: I would add a short comment at the definition of struct landlock_cred_security noting that this function memcpy's the entire struct > + > + if (dst->domain) > + landlock_get_ruleset(src->domain); > +} [...] > +/* > + * tsync_works_grow_by - preallocates space for n more contexts in s > + * > + * Returns: > + * -ENOMEM if the (re)allocation fails > + * 0 if the allocation succeeds, partially succeeds, or no reall= ocation was needed > + */ > +static int tsync_works_grow_by(struct tsync_works *s, size_t n, gfp_t fl= ags) > +{ > + int i; > + size_t new_capacity =3D s->capacity + n; (You only have to grow to `s->size + n` but I guess this works too.) > + struct tsync_work **works; > + > + if (new_capacity <=3D s->capacity) > + return 0; > + > + works =3D krealloc_array(s->works, new_capacity, sizeof(s->works[= 0]), > + flags); > + if (IS_ERR(works)) > + return PTR_ERR(works); The kmalloc function family returns NULL on failure, so you have to check for NULL here instead of IS_ERR(), and then return -ENOMEM instead of PTR_ERR(). > + > + s->works =3D works; > + > + for (i =3D s->capacity; i < new_capacity; i++) { > + s->works[i] =3D kzalloc(sizeof(*s->works[i]), flags); > + if (IS_ERR(s->works[i])) { (again, kzalloc() returns NULL on failure) > + /* > + * Leave the object in a consistent state, > + * but return an error. > + */ > + s->capacity =3D i; > + return PTR_ERR(s->works[i]); > + } > + } > + s->capacity =3D new_capacity; > + return 0; > +} [...] > +/* > + * tsync_works_free - free memory held by s and drop all task references > + */ > +static void tsync_works_free(struct tsync_works *s) > +{ > + int i; > + > + for (i =3D 0; i < s->size; i++) > + put_task_struct(s->works[i]->task); You'll need a NULL check before calling put_task_struct(), since the task_work_add() failure path can NULL out ->task. (Alternatively you could leave the task pointer intact in the task_work_add() failure path, since task_work_add() only fails if the task is already PF_EXITING. The &work_exited marker which causes task_work_add() to fail is only put on the task work list when task_work_run() runs on a PF_EXITING task.) > + for (i =3D 0; i < s->capacity; i++) > + kfree(s->works[i]); > + kfree(s->works); > + s->works =3D NULL; > + s->size =3D 0; > + s->capacity =3D 0; > +} > + > +/* > + * restrict_sibling_threads - enables a Landlock policy for all sibling = threads > + */ > +static int restrict_sibling_threads(const struct cred *old_cred, > + const struct cred *new_cred) > +{ > + int res; > + struct task_struct *thread, *caller; > + struct tsync_shared_context shared_ctx; > + struct tsync_works works =3D {}; > + size_t newly_discovered_threads; > + bool found_more_threads; > + struct tsync_work *ctx; > + > + atomic_set(&shared_ctx.preparation_error, 0); > + init_completion(&shared_ctx.all_prepared); > + init_completion(&shared_ctx.ready_to_commit); > + atomic_set(&shared_ctx.num_unfinished, 0); I think num_unfinished should be initialized to 1 here and decremented later on, I think, similar to how num_preparing works. Though it only matters in the edge case where the first thread we send task work to immediately fails the memory allocation. (And then you can also remove that "if (works.size)" check before "wait_for_completion(&shared_ctx.all_finished)".) > + init_completion(&shared_ctx.all_finished); > + shared_ctx.old_cred =3D old_cred; > + shared_ctx.new_cred =3D new_cred; > + > + caller =3D current; [...] > + init_task_work(&ctx->work, > + restrict_one_thread_callback); > + res =3D task_work_add(thread, &ctx->work, TWA_SIG= NAL); > + if (res) { > + /* > + * Remove the task from ctx so that we wi= ll > + * revisit the task at a later stage, if = it > + * still exists. > + */ > + put_task_struct_rcu_user(ctx->task); The complement to get_task_struct() is put_task_struct(), which I see you also used in tsync_works_free(). put_task_struct_rcu_user() is for a different, special type of task_struct reference. > + ctx->task =3D NULL; > + > + atomic_set(&shared_ctx.preparation_error,= res); I think you don't want to set preparation_error here - that would cause the syscall to return -ESRCH if we happen to race with an exiting thread. Just remove that line - in the next iteration, we'll skip this thread even if it still exists, because it has PF_EXITING set by this point. > + atomic_dec(&shared_ctx.num_preparing); > + atomic_dec(&shared_ctx.num_unfinished); > + } > + } > + rcu_read_unlock(); > + > + /* > + * Decrement num_preparing for current, to undo that we > + * initialized it to 1 at the beginning of the inner loop= . > + */ > + if (atomic_dec_return(&shared_ctx.num_preparing) > 0) > + wait_for_completion(&shared_ctx.all_prepared); I'm sorry, because this will make the patch a little bit more complicated, but... I don't think you can use wait_for_completion() here. Consider the scenario where two userspace threads of the same process call this functionality (or a kernel subsystem that does something similar) simultaneously. Each thread will wait for the other indefinitely, and userspace won't even be able to resolve the deadlock by killing the processes. Similar issues would probably apply if, for example, GDB tried to attach to the process with bad timing - if GDB ptrace-stops another thread before you schedule task work for it, and then tries to ptrace-stop this thread, I think this thread could essentially be in a deadlock with GDB. You'll have to do something else here. I think the best solution would be to use wait_for_completion_interruptible() instead; then if that fails, tear down all the task work stuff that was already scheduled, and return with error -ERESTARTNOINTR. Something like (entirely untested): /* interruptible wait to avoid deadlocks while waiting for other tasks to enter our task work */ if (wait_for_completion_interruptible(&shared_ctx.all_prepared)) { atomic_set(&shared_ctx.preparation_error, -ERESTARTNOINTR); for (int i=3D0; itask, &works.works[i]->work)) if (atomic_dec_return(&shared_ctx.num_preparing)) complete_all(&shared_ctx.all_prepared); } /* at this point we're only waiting for tasks that are already executing the task work */ wait_for_completion(&shared_ctx.all_prepared); } Note that if the syscall returns -ERESTARTNOINTR, that won't be visible to userspace (except for debugging tools like strace/gdb); the kernel ensures that the syscall will transparently re-execute immediately. (It literally decrements the saved userspace instruction pointer by the size of a syscall instruction, so that when the kernel returns to userspace, the next instruction that executes will redo the syscall.) This allows us to break the deadlock without having to write any ugly retry logic or throwing userspace-visible errors. > + } while (found_more_threads && > + !atomic_read(&shared_ctx.preparation_error)); > + > + /* > + * We now have all sibling threads blocking and in "prepared" sta= te in > + * the task work. Ask all threads to commit. > + */ > + complete_all(&shared_ctx.ready_to_commit); > + > + if (works.size) > + wait_for_completion(&shared_ctx.all_finished); > + > + tsync_works_free(&works); > + > + return atomic_read(&shared_ctx.preparation_error); > +} [...] > @@ -566,5 +987,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, r= uleset_fd, const __u32, > new_llcred->domain_exec |=3D BIT(new_dom->num_layers - 1); > #endif /* CONFIG_AUDIT */ > > + if (flags & LANDLOCK_RESTRICT_SELF_TSYNC) { > + res =3D restrict_sibling_threads(current_cred(), new_cred= ); > + if (res !=3D 0) { > + abort_creds(new_cred); > + return res; > + } > + } Annoyingly, there is a special-case path above for the case where LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF is set without actually applying any ruleset. In that case you won't reach this point, and so LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF would only affect the current thread in that case. I doubt it'd be very noticeable, but still, it might be a good idea to rearrange things here a bit... maybe instead of the current `if (!ruleset) return commit_creds(new_cred);`, put some of the subsequent stuff in a `if (ruleset) {` block?