From: "Luck, Tony" <tony.luck@intel.com>
To: Reinette Chatre <reinette.chatre@intel.com>
Cc: Borislav Petkov <bp@alien8.de>, <x86@kernel.org>,
Fenghua Yu <fenghuay@nvidia.com>,
Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>,
Peter Newman <peternewman@google.com>,
James Morse <james.morse@arm.com>,
Babu Moger <babu.moger@amd.com>,
"Drew Fustini" <dfustini@baylibre.com>,
Dave Martin <Dave.Martin@arm.com>, Chen Yu <yu.c.chen@intel.com>,
<linux-kernel@vger.kernel.org>, <patches@lists.linux.dev>
Subject: Re: [PATCH] fs/resctrl: Fix deadlock for errors during mount
Date: Mon, 4 May 2026 09:25:00 -0700 [thread overview]
Message-ID: <afjIXKwVpSAh5kAA@agluck-desk3> (raw)
In-Reply-To: <1cdef1e9-e484-4929-be2a-793e42a49cca@intel.com>
On Fri, May 01, 2026 at 04:17:18PM -0700, Reinette Chatre wrote:
> Hi Tony,
>
> On 5/1/26 11:56 AM, Tony Luck wrote:
> > Sashiko noticed[1] a deadlock in the resctrl mount code.
> >
> > rdt_get_tree() acquires rdtgroup_mutex before calling kernfs_get_tree(). If
> > superblock setup fails inside kernfs_get_tree(), the VFS calls kill_sb on
> > the same thread before the call returns. rdt_kill_sb() unconditionally
> > attempts to acquire rdtgroup_mutex and deadlock occurs.
>
> Thank you for addressing this.
>
> >
> > Add a boolean rdt_kill_sb_locked flag. Set it for the duration of
> > kernfs_get_tree() and check in rdt_kill_sb() to determine if locks
> > are already held.
> >
>
> ...
>
> > diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> > index 5dfdaa6f9d8f..8544020ef420 100644
> > --- a/fs/resctrl/rdtgroup.c
> > +++ b/fs/resctrl/rdtgroup.c
> > @@ -2782,6 +2782,9 @@ static void schemata_list_destroy(void)
> > }
> > }
> >
> > +/* Protected by the serialized mount path (rdtgroup_mutex + resctrl_mounted). */
>
> I interpret above to mean that every access to rdt_kill_sb_locked can be expected to
> be done with rdtgroup_mutex held ...
The comment could be much more descriptive about locking and limited use
case.
> > +static bool rdt_kill_sb_locked;
> > +
> > static int rdt_get_tree(struct fs_context *fc)
> > {
> > struct rdt_fs_context *ctx = rdt_fc2context(fc);
> > @@ -2855,7 +2858,9 @@ static int rdt_get_tree(struct fs_context *fc)
> > if (ret)
> > goto out_mondata;
> >
> > + rdt_kill_sb_locked = true;
> > ret = kernfs_get_tree(fc);
> > + rdt_kill_sb_locked = false;
> > if (ret < 0)
> > goto out_psl;
> >
> > @@ -3173,8 +3178,10 @@ static void rdt_kill_sb(struct super_block *sb)
> > {
> > struct rdt_resource *r;
> >
> > - cpus_read_lock();
> > - mutex_lock(&rdtgroup_mutex);
> > + if (!rdt_kill_sb_locked) {
> > + cpus_read_lock();
> > + mutex_lock(&rdtgroup_mutex);
>
> ... but here clearly rdt_kill_sb_locked can be accessed without rdtgroup_mutex held.
A much better name for this flag would be "resctrl_mount_in_progress". With
The header comment noting that it is set-and cleared inside
rdtgroup_mutex protected code, it is used only in rdt_kill_sb().
This specific use case seems safe as there are only call chains leading
to rdt_kill_sb():
1) Error cleanup from failure of kernfs_fill_super() within the
call to kernfs_get_tree() [rdtgroup_mutex still held in this
case]
2) From user call to unmount the filesystem. In which case
rdt_get_tree() must have completed successfully. Any new
calls are blocked from changing this flag by the early exit
based on resctrl_mounted.
>
> It appears that while this change claims that rdt_kill_sb_locked is protected the
> implementation instead seems to actually be "this works for the scenarios cared
> about here" which I understand to be based on considerations of how the filesystem
> code interacts with resctrl callbacks _today_.
>
> > + }
> >
> > rdt_disable_ctx();
> >
> > @@ -3189,8 +3196,10 @@ static void rdt_kill_sb(struct super_block *sb)
> > resctrl_arch_disable_mon();
> > resctrl_mounted = false;
> > kernfs_kill_sb(sb);
> > - mutex_unlock(&rdtgroup_mutex);
> > - cpus_read_unlock();
> > + if (!rdt_kill_sb_locked) {
> > + mutex_unlock(&rdtgroup_mutex);
> > + cpus_read_unlock();
> > + }
> > }
> >
> > static struct file_system_type rdt_fs_type = {
>
> Did you or your AI assistant consider running kernfs_get_tree() without rdtgroup_mutex
> and CPU hotplug lock held? Consider, for example:
Not considered. Thanks for the suggestion ... But, see below.
> diff --git a/fs/resctrl/rdtgroup.c b/fs/resctrl/rdtgroup.c
> index 36d21652616e..9ee6295d6521 100644
> --- a/fs/resctrl/rdtgroup.c
> +++ b/fs/resctrl/rdtgroup.c
> @@ -2892,10 +2892,6 @@ static int rdt_get_tree(struct fs_context *fc)
> if (ret)
> goto out_mondata;
>
> - ret = kernfs_get_tree(fc);
> - if (ret < 0)
> - goto out_psl;
> -
> if (resctrl_arch_alloc_capable())
> resctrl_arch_enable_alloc();
> if (resctrl_arch_mon_capable())
> @@ -2911,10 +2907,10 @@ static int rdt_get_tree(struct fs_context *fc)
> RESCTRL_PICK_ANY_CPU);
> }
>
> - goto out;
> + mutex_unlock(&rdtgroup_mutex);
> + cpus_read_unlock();
> + return kernfs_get_tree(fc);
>
> -out_psl:
> - rdt_pseudo_lock_release();
> out_mondata:
> if (resctrl_arch_mon_capable())
> kernfs_remove(kn_mondata);
>
>
> This seems simpler by:
> * avoiding introduction of additional state (rdt_kill_sb_locked) with unclear protection,
> * avoiding double-cleanup on failure (rdt_kill_sb() called and then all rdt_get_tree()'s
> failure path),
> * maintaining symmetry with rdt_kill_sb() by providing it the state it is
> expected to be called with (i.e resctrl_mounted = true).
All these are excellent points in favor of this approach.
>
> >From what I can tell it is safe to call kernfs_kill_sb() on failure of kernfs_get_tree(),
> but this needs to have been be considered as part of this submission anyway.
Looks OK to me too.
> Oh, maybe there is a new lock ordering issue with this that I am missing?
I can't see any lock issues.
But ... there is a problem. kernfs_get_tree() can fail for many reasons.
Only the specific case of failure in kernfs_get_super() makes the cleanup
call to rdt_kill_sb(). rdt_get_tree() has no way to tell from the error
code from kernfs_get_tree() whether cleanup has been done.
Plausibly I could do some surgery on the kernfs subsystem to make kernfs_get_tree()
take a second argument "bool *did_i_call_kill_sb". Only other user is
the cgroup code. So this might not be too invasive.
Or, I could fix up the comments to justify use of "resctrl_mount_in_progress"
Also fix up rdt_kill_sb() to look like this:
static void rdt_kill_sb(struct super_block *sb)
{
if (resctrl_mount_in_progress) {
resctrl_clean_up_failed_mount();
return;
}
... existing unmount path code here ...
}
Or ... do you have some other suggestion?
>
> Reinette
-Tony
next prev parent reply other threads:[~2026-05-04 16:25 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-01 18:56 [PATCH] fs/resctrl: Fix deadlock for errors during mount Tony Luck
2026-05-01 23:17 ` Reinette Chatre
2026-05-04 16:25 ` Luck, Tony [this message]
2026-05-04 17:43 ` Reinette Chatre
2026-05-04 17:52 ` Luck, Tony
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=afjIXKwVpSAh5kAA@agluck-desk3 \
--to=tony.luck@intel.com \
--cc=Dave.Martin@arm.com \
--cc=babu.moger@amd.com \
--cc=bp@alien8.de \
--cc=dfustini@baylibre.com \
--cc=fenghuay@nvidia.com \
--cc=james.morse@arm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=maciej.wieczor-retman@intel.com \
--cc=patches@lists.linux.dev \
--cc=peternewman@google.com \
--cc=reinette.chatre@intel.com \
--cc=x86@kernel.org \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox