All of lore.kernel.org
 help / color / mirror / Atom feed
From: Reinette Chatre <reinette.chatre@intel.com>
To: Qinyun Tan <qinyuntan@linux.alibaba.com>,
	Tony Luck <tony.luck@intel.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>, <linux-kernel@vger.kernel.org>,
	<x86@kernel.org>
Subject: Re: [PATCH] x86/resctrl: Remove unnecessary references to cacheinfo in the resctrl subsystem.
Date: Tue, 27 May 2025 16:36:17 -0700	[thread overview]
Message-ID: <c4382156-a51e-4d07-9ccb-e6db2ca9d719@intel.com> (raw)
In-Reply-To: <20250526073744.62520-1-qinyuntan@linux.alibaba.com>

Hi Qinyun,

Thank you for catching this.

On 5/26/25 12:37 AM, Qinyun Tan wrote:
> In the resctrl subsystem's Sub-NUMA Cluster (SNC) mode, the rdt_mon_domain
> structure previously relied on the cacheinfo interface to store L3 cache

"previously relied" -> "relies"

> information (e.g., shared_cpu_map) for monitoring. However, this approach
> introduces risks when CPUs go offline:
> 
> 1. Inconsistency: The ci field in rdt_mon_domain was initialized using the

"was initialized" -> "is initialized"

> first online CPU of a NUMA node. If this CPU later goes offline, the
> shared_cpu_map (managed by the cache subsystem) is cleared, leading to

"is cleared" sounds like the shared_cpu_map is empty. Looking at
cache_shared_cpu_map_remove() it seems that the worst case is when the
CPU goes offline the shared_cpu_map only contains the offline CPU itself.
If there remains a reference to that shared_cpu_map then it will thus be
to a cpumask with an offline CPU. Are you seeing other scenarios?

> incorrect or undefined behavior when accessed via rdt_mon_domain.

Could you please elaborate what "incorrect and undefined behavior" you
encountered?
It looks like reading the top level event would not be able to read
an event from one (or more?) of the online domains since the shared_cpu_map
will contain an offline CPU causing smp_call_on_cpu() to return with a failure
that is not caught ... the symptom may this be that there are no errors but
data is wrong?

> 
> 2. Lifecycle dependency: The cacheinfo structure's lifecycle is managed
> by the cache subsystem, making it unsafe for resctrl to hold
> long-term references.

This is not obvious to me. Could you please elaborate how resctrl could
have a reference to a removed structure?

> 
> To resolve these issues and align with design principles:
> 
> 1. Replace direct cacheinfo references in struct rdt_mon_domain and struct
> rmid_read with the cacheinfo ID (a unique identifier for the L3 cache).
> 
> 2. Use hdr.cpu_mask (already maintained by resctrl) to replace
> shared_cpu_map logic for determining valid CPUs for RMID counter reads
> via the MSR interface.

I think it will help to explain why it is ok now to use hdr.cpu_mask instead
of the shared_cpu_map. In support of this you could mention that the original
solution aimed to read the counters on any CPU associated with the L3 cache
that the sub-numa domain forms part of, but this solution uses the
cpumask of one of the sub-numa domains that is known to be associated with
the L3 cache. This is a reduced set of CPUs from previously intended but
known to be accurate. Alternative may be to build a local cpumask from the
all the sub-numa domains but it is not clear to me how much this will
improve things.

Considering this I do not think the references are "unnecessary" as the
subject claims since the solution does not replace the original cpumask with
an identical one.

> 

Needs a "Fixes:" tag.

> Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
> ---

Reinette

  parent reply	other threads:[~2025-05-27 23:36 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-26  7:37 [PATCH] x86/resctrl: Remove unnecessary references to cacheinfo in the resctrl subsystem Qinyun Tan
2025-05-27 21:25 ` Luck, Tony
2025-05-28  2:30   ` qinyuntan
2025-05-27 23:36 ` Reinette Chatre [this message]
2025-05-28  0:14   ` Luck, Tony
2025-05-28  2:59     ` Reinette Chatre
2025-05-28  3:32   ` qinyuntan
2025-05-28  4:49     ` Reinette Chatre
2025-05-28  6:37       ` qinyuntan
2025-05-28  6:40         ` qinyuntan
2025-05-28 17:38         ` Reinette Chatre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c4382156-a51e-4d07-9ccb-e6db2ca9d719@intel.com \
    --to=reinette.chatre@intel.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=qinyuntan@linux.alibaba.com \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.