From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f46.google.com (mail-wm1-f46.google.com [209.85.128.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2E80726F29F for ; Thu, 27 Nov 2025 10:32:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.46 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764239537; cv=none; b=bagBuHhFh0E1kJlqJN4/NZPVtLXyjKhlU5v0hxgl2XkN5CoJ/iJnPGfOmntnkyZBGadElwvCYIkPkdZL8QafooSOud0OF98hhO47L6tHmrM1Ngu7mj/xhIhCgjfJ861UzV/URChUZkQJ0SBlaGY1vmTIAB7SF2A/uo79712QxrE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764239537; c=relaxed/simple; bh=GyJ3o5qfR6cqmPkOHsgh0P9eFO/R0XYurLhER62zzyY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=JDLvB06k457Zglg3hokW2XT3E3mJT5ZU0phgR9UrIWOiy25JbBs0KcZyVpO7qMKbw/vnpoBiGz0saraQtnlpOla6V1LBJF5tAXGCZYwFFOou4bBOydVf5qusW/noYT2ApuIDy0qZb9rW7psDmsQWiXqvV5dqliHrmnqpm12sMJ0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=SKFUFmGC; arc=none smtp.client-ip=209.85.128.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="SKFUFmGC" Received: by mail-wm1-f46.google.com with SMTP id 5b1f17b1804b1-4779cb0a33fso6115175e9.0 for ; Thu, 27 Nov 2025 02:32:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1764239533; x=1764844333; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=u/IhTZONkluOKlTgm2ZycslvEunFiqGakSs0SYii3+E=; b=SKFUFmGCPu/ZGlEC8BPRjqk+w19c+zdqh0jlYnU4EGMGjeGWbRsK3SB+34AEwCsH+z cMCBVDNVtNBaRuaQBYw8hzCY7fLTOj+9CHmLMmPfOAlayRUaE44YR83zY7e2uzewVmOO l/P0CP1olp1HWpgH7aVhxtf9wrBdCfppA4pedWkZoN6TOTJZsan9FKpZdCozXw1G4Fni 8iteELQVVxWY68VopTc/vaEF1NLk9qwPFOsel5duOqIB+m2i11EVKpMDNC3Ib20Q1DsZ jVP63Wxv3YeQjDHdT8Qx1Th2KHrrwiLQO4OqVdnA95OuU0z1039e0GJp01ttAC3xrhWJ t68Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764239533; x=1764844333; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=u/IhTZONkluOKlTgm2ZycslvEunFiqGakSs0SYii3+E=; b=o8IIQNDvv8PMRTXffgGLtaJsOqIYEBXk1ee5vQpgRrIWYucALmyApKmkCOrD7tLJ7i /iw3dUzcccL1skX+kTqsO7ACWb0qJ+8ns9Lxz2hxcwLWi63T1w+qmBZtgEkelRs4eS8B cmM5QzSMUlCfN669trkYhDrVi+/u01fHqwmN/LnpNk+V0EpjbxuG10kZAb27Nt8bwBEl E7ac/oZNONh1OPph+CMvxkD0Mhq4MucyD4GduFBkskMED0CMQ0CsgzYKmvFiUGcmC47x v2IpABZAMFYcB/pls1ADVLRc8TW2ERe5iXgE9QbIzSyrf/qKM2cMx46Qx9mpIk1AHkQP JD7A== X-Forwarded-Encrypted: i=1; AJvYcCVOEiEHhBhFoszvNT7JeHkKiquE+uw1KGq7a31i+oALlwcF081vdApMMh5aJkUD+Ks9RE7tTZ32/wZaKjkUobN7TFUlqAs=@vger.kernel.org X-Gm-Message-State: AOJu0Yx5Sw9ciFoAis6mbtKurWgrUAr0I+4pnI7lPFtnqoP5diq6DcAx bt7eNbFkDj41Uf6xnstx3UWqnovSQ8uKhsF3Uwm7Kk6FlnEhSGZxeIUU2lXvEb/3Xk+wDCoLu0W WSn4Fjw== X-Gm-Gg: ASbGnctrvtY5TgtCABcZDf5Q5H9S5bm39IHCc8wWzQnj3s3I1ZCP4r3Y0dOAUmhKldB wWDvpfmdIybrkVC/8c/Mdmd8D8nHNplHTnRX/3GX6huCUwQ0SWVC4bN8katzqx+qWkFhXlg7EYF 9RxJDDNoSasBCq6ps47lPqHzF3C1wSfMyi7WniLCNtRGrKqInz72W581Flnzt/Vk7G5rD1ct+u4 +lP6tkiawm08NKNK/l5hXG57DZrHkFt0Gb0VwQydWQgMIrrn5P1k5fVqUIF7eaP4Dsk3SfoY0FO cYTl5ZSA93Y9HREtdPr8iC7ZAWhgU6eRAdxGbutpbdOpLVbonMSDO0U0y6j30bG13V/JhyrjUNg E3EC6RynpslExgUXRcNjERJTH00FzDnLxe9ie9BnRLLdwQ/XMZ/1pDOVYzP98j5VwqguCLYMMuD F6sH+Dx9poyutKy51DoxoBp9MPq7SLYMl3ajl3o9dYRQ== X-Google-Smtp-Source: AGHT+IGkFUzKtQ1ejM+R7d0HZ9a0PYFnLhNqdoQuu3BLywxBzYnRSnQkxBO/YDpNCJQo2wLXR8Gzxw== X-Received: by 2002:a05:600c:46c3:b0:477:6d96:b3ca with SMTP id 5b1f17b1804b1-477c01743e2mr225378745e9.5.1764239533095; Thu, 27 Nov 2025 02:32:13 -0800 (PST) Received: from google.com ([2a00:79e0:288a:8:2372:89e8:ec59:7515]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4790530f0a4sm63532045e9.8.2025.11.27.02.32.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Nov 2025 02:32:12 -0800 (PST) Date: Thu, 27 Nov 2025 11:32:07 +0100 From: =?utf-8?Q?G=C3=BCnther?= Noack To: Jann Horn Cc: =?utf-8?Q?Micka=C3=ABl_Sala=C3=BCn?= , Konstantin Meskhidze , Tingmao Wang , Paul Moore , linux-security-module@vger.kernel.org Subject: Re: [PATCH v2 1/2] landlock: Multithreading support for landlock_restrict_self() Message-ID: References: <20251001111807.18902-1-gnoack@google.com> <20251001111807.18902-2-gnoack@google.com> Precedence: bulk X-Mailing-List: linux-security-module@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Fri, Oct 24, 2025 at 11:11:10PM +0200, Jann Horn wrote: > On Wed, Oct 1, 2025 at 1:18 PM Günther Noack wrote: > > Introduce the LANDLOCK_RESTRICT_SELF_TSYNC flag. With this flag, a > > given Landlock ruleset is applied to all threads of the calling > > process, instead of only the current one. > > > > Without this flag, multithreaded userspace programs currently resort > > to using the nptl(7)/libpsx hack for multithreaded policy enforcement, > > which is also used by libcap and for setuid(2). Using this scheme, > > the threads of a process enforce the same Landlock ruleset, but the > > resulting Landlock domains are still separate, which makes a > > difference for Landlock's "scoped" access rights, where the domain > > identity and nesting is used. As a result, when using > > LANLDOCK_SCOPE_SIGNAL, signaling between sibling threads stops > > working. This is a problem for programming languages and frameworks > > which are inherently multithreaded (e.g. Go). > > This looks good to me overall, though there are a couple details to fix. > > [...] > > +static inline void landlock_cred_copy(struct landlock_cred_security *dst, > > + const struct landlock_cred_security *src) > > +{ > > + if (dst->domain) > > + landlock_put_ruleset(dst->domain); > > + > > + *dst = *src; > > nit: I would add a short comment at the definition of struct > landlock_cred_security noting that this function memcpy's the entire > struct Sounds good. I added a small remark "when updating this, also update landlock_cred_copy() if needed". > > + > > + if (dst->domain) > > + landlock_get_ruleset(src->domain); > > +} > [...] > > +/* > > + * tsync_works_grow_by - preallocates space for n more contexts in s > > + * > > + * Returns: > > + * -ENOMEM if the (re)allocation fails > > + * 0 if the allocation succeeds, partially succeeds, or no reallocation was needed > > + */ > > +static int tsync_works_grow_by(struct tsync_works *s, size_t n, gfp_t flags) > > +{ > > + int i; > > + size_t new_capacity = s->capacity + n; > > (You only have to grow to `s->size + n` but I guess this works too.) Thanks, well spotted. This was indeed the intended behavior, the new_capacity <= s->capacity check also makes much more sense that way. (I have a more detailed answer in another reply.) I fixed this and also added an overflow check for good measure. > > + struct tsync_work **works; > > + > > + if (new_capacity <= s->capacity) > > + return 0; > > + > > + works = krealloc_array(s->works, new_capacity, sizeof(s->works[0]), > > + flags); > > + if (IS_ERR(works)) > > + return PTR_ERR(works); > > The kmalloc function family returns NULL on failure, so you have to > check for NULL here instead of IS_ERR(), and then return -ENOMEM > instead of PTR_ERR(). Thanks, fixed. > > + s->works = works; > > + > > + for (i = s->capacity; i < new_capacity; i++) { > > + s->works[i] = kzalloc(sizeof(*s->works[i]), flags); > > + if (IS_ERR(s->works[i])) { > > (again, kzalloc() returns NULL on failure) Done. > > + /* > > + * Leave the object in a consistent state, > > + * but return an error. > > + */ > > + s->capacity = i; > > + return PTR_ERR(s->works[i]); > > + } > > + } > > + s->capacity = new_capacity; > > + return 0; > > +} > [...] > > +/* > > + * tsync_works_free - free memory held by s and drop all task references > > + */ > > +static void tsync_works_free(struct tsync_works *s) > > +{ > > + int i; > > + > > + for (i = 0; i < s->size; i++) > > + put_task_struct(s->works[i]->task); > > You'll need a NULL check before calling put_task_struct(), since the > task_work_add() failure path can NULL out ->task. (Alternatively you > could leave the task pointer intact in the task_work_add() failure > path, since task_work_add() only fails if the task is already > PF_EXITING. The &work_exited marker which causes task_work_add() to > fail is only put on the task work list when task_work_run() runs on a > PF_EXITING task.) Thanks for spotting this, this is correct! I added a NULL check. > > + for (i = 0; i < s->capacity; i++) > > + kfree(s->works[i]); > > + kfree(s->works); > > + s->works = NULL; > > + s->size = 0; > > + s->capacity = 0; > > +} > > + > > +/* > > + * restrict_sibling_threads - enables a Landlock policy for all sibling threads > > + */ > > +static int restrict_sibling_threads(const struct cred *old_cred, > > + const struct cred *new_cred) > > +{ > > + int res; > > + struct task_struct *thread, *caller; > > + struct tsync_shared_context shared_ctx; > > + struct tsync_works works = {}; > > + size_t newly_discovered_threads; > > + bool found_more_threads; > > + struct tsync_work *ctx; > > + > > + atomic_set(&shared_ctx.preparation_error, 0); > > + init_completion(&shared_ctx.all_prepared); > > + init_completion(&shared_ctx.ready_to_commit); > > + atomic_set(&shared_ctx.num_unfinished, 0); > > I think num_unfinished should be initialized to 1 here and decremented > later on, I think, similar to how num_preparing works. Though it only > matters in the edge case where the first thread we send task work to > immediately fails the memory allocation. (And then you can also remove > that "if (works.size)" check before > "wait_for_completion(&shared_ctx.all_finished)".) Thank you, good catch! The works.size check was inaccurate, because in the case of an error during task_work_add(), it wasn't actually counting the number of scheduled task works, but overestimating it. The scenario is a bit obscure, but initializing num_unfinished is a more robust approach that rules out that variant of bugs. > > + init_completion(&shared_ctx.all_finished); > > + shared_ctx.old_cred = old_cred; > > + shared_ctx.new_cred = new_cred; > > + > > + caller = current; > [...] > > + init_task_work(&ctx->work, > > + restrict_one_thread_callback); > > + res = task_work_add(thread, &ctx->work, TWA_SIGNAL); > > + if (res) { > > + /* > > + * Remove the task from ctx so that we will > > + * revisit the task at a later stage, if it > > + * still exists. > > + */ > > + put_task_struct_rcu_user(ctx->task); > > The complement to get_task_struct() is put_task_struct(), which I see > you also used in tsync_works_free(). put_task_struct_rcu_user() is for > a different, special type of task_struct reference. Thanks, done. > > + ctx->task = NULL; > > + > > + atomic_set(&shared_ctx.preparation_error, res); > > I think you don't want to set preparation_error here - that would > cause the syscall to return -ESRCH if we happen to race with an > exiting thread. Just remove that line - in the next iteration, we'll > skip this thread even if it still exists, because it has PF_EXITING > set by this point. Thanks, that is correct and I fixed it as you suggested. -- The thread exiting is the only reason why task_work_add() can fail. In the (perfectly valid) case where one of the sibling threads happens to exit, we do not want the landlock_restrict_self() syscall to fail just because of that. > > + atomic_dec(&shared_ctx.num_preparing); > > + atomic_dec(&shared_ctx.num_unfinished); > > + } > > + } > > + rcu_read_unlock(); > > + > > + /* > > + * Decrement num_preparing for current, to undo that we > > + * initialized it to 1 at the beginning of the inner loop. > > + */ > > + if (atomic_dec_return(&shared_ctx.num_preparing) > 0) > > + wait_for_completion(&shared_ctx.all_prepared); > > I'm sorry, because this will make the patch a little bit more > complicated, but... I don't think you can use wait_for_completion() > here. Consider the scenario where two userspace threads of the same > process call this functionality (or a kernel subsystem that does > something similar) simultaneously. Each thread will wait for the other > indefinitely, and userspace won't even be able to resolve the deadlock > by killing the processes. > Similar issues would probably apply if, for example, GDB tried to > attach to the process with bad timing - if GDB ptrace-stops another > thread before you schedule task work for it, and then tries to > ptrace-stop this thread, I think this thread could essentially be in a > deadlock with GDB. > > You'll have to do something else here. I think the best solution would > be to use wait_for_completion_interruptible() instead; then if that > fails, tear down all the task work stuff that was already scheduled, > and return with error -ERESTARTNOINTR. Something like (entirely > untested): > > /* interruptible wait to avoid deadlocks while waiting for other tasks > to enter our task work */ > if (wait_for_completion_interruptible(&shared_ctx.all_prepared)) { > atomic_set(&shared_ctx.preparation_error, -ERESTARTNOINTR); > for (int i=0; i if (task_work_cancel(works.works[i]->task, &works.works[i]->work)) > if (atomic_dec_return(&shared_ctx.num_preparing)) > complete_all(&shared_ctx.all_prepared); > } > /* at this point we're only waiting for tasks that are already > executing the task work */ > wait_for_completion(&shared_ctx.all_prepared); > } > > Note that if the syscall returns -ERESTARTNOINTR, that won't be > visible to userspace (except for debugging tools like strace/gdb); the > kernel ensures that the syscall will transparently re-execute > immediately. (It literally decrements the saved userspace instruction > pointer by the size of a syscall instruction, so that when the kernel > returns to userspace, the next instruction that executes will redo the > syscall.) This allows us to break the deadlock without having to write > any ugly retry logic or throwing userspace-visible errors. Thank you again for catching this! I used your suggestion, with the following (minor) differences: 1. Factored it out as a function (indentation level got too high...) 2. typo: call complete_all() only if atomic_dec_return() returns 0 3. Do the same barrier synchronization dance with num_unfinished/all_finished as well. ...and I also implemented a selftest for the case where landlock_restrict_self() gets called by two adjacent threads at the same time. I'll write down my reasoning why this works for reference: The problem in V2 is that we can run into a deadlock in the case where two thread call landlock_restrict_self(). In that case, they'll both become uninterruptible at syscall entry and try to schedule a task_work for each other. Then, they proceed to wait for each other's task_work to execute, which never happens because the task_work never gets scheduled. This is resolved by using an interruptible wait. With the interruptible wait, the task can detect the condition where a signal (or task work) comes in, execute that task_work and bail out of the system call cleanly (we un-schedule the task_works for other threads that are still pending and we abort all the task_works that have already started to run by setting the shared_ctx.preparation_error). After returning from the system call with -ERESTARTNOINTR, it gets retried automatically to recover from the problem. In all the other cases where we wait_for_completion() uninterruptibly, we can reason about that returning, because these happen under circumstances where we know that the task works we are waiting for have all been started already. I have reproduced the deadlock and verified the fixed implementation with the selftest, by temporarily adding mdelay() calls and a lot of logging in strategic places. > > > + } while (found_more_threads && > > + !atomic_read(&shared_ctx.preparation_error)); > > + > > + /* > > + * We now have all sibling threads blocking and in "prepared" state in > > + * the task work. Ask all threads to commit. > > + */ > > + complete_all(&shared_ctx.ready_to_commit); > > + > > + if (works.size) > > + wait_for_completion(&shared_ctx.all_finished); > > + > > + tsync_works_free(&works); > > + > > + return atomic_read(&shared_ctx.preparation_error); > > +} > [...] > > @@ -566,5 +987,13 @@ SYSCALL_DEFINE2(landlock_restrict_self, const int, ruleset_fd, const __u32, > > new_llcred->domain_exec |= BIT(new_dom->num_layers - 1); > > #endif /* CONFIG_AUDIT */ > > > > + if (flags & LANDLOCK_RESTRICT_SELF_TSYNC) { > > + res = restrict_sibling_threads(current_cred(), new_cred); > > + if (res != 0) { > > + abort_creds(new_cred); > > + return res; > > + } > > + } > > Annoyingly, there is a special-case path above for the case where > LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF is set without actually > applying any ruleset. In that case you won't reach this point, and so > LANDLOCK_RESTRICT_SELF_LOG_SUBDOMAINS_OFF would only affect the > current thread in that case. I doubt it'd be very noticeable, but > still, it might be a good idea to rearrange things here a bit... maybe > instead of the current `if (!ruleset) return commit_creds(new_cred);`, > put some of the subsequent stuff in a `if (ruleset) {` block? Thanks, I fixed that one as well. —Günther