[PATCH v4] x86/resctrl: Fix miscount of bandwidth event when reactivating previously Unavailable RMID

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4] x86/resctrl: Fix miscount of bandwidth event when reactivating previously Unavailable RMID
@ 2025-10-10 17:08 Babu Moger
  2025-10-10 21:20 ` Reinette Chatre
  2025-10-13 19:36 ` [tip: x86/urgent] x86/resctrl: Fix miscount of bandwidth event when reactivating previously unavailable RMID tip-bot2 for Babu Moger
  0 siblings, 2 replies; 6+ messages in thread
From: Babu Moger @ 2025-10-10 17:08 UTC (permalink / raw)
  To: babu.moger, tony.luck, reinette.chatre, Dave.Martin, james.morse,
	tglx, mingo, bp, dave.hansen
  Cc: x86, hpa, linux-kernel, peternewman, eranian, gautham.shenoy

Users can create as many monitoring groups as the number of RMIDs supported
by the hardware. However, on AMD systems, only a limited number of RMIDs
are guaranteed to be actively tracked by the hardware. RMIDs that exceed
this limit are placed in an "Unavailable" state. When a bandwidth counter
is read for such an RMID, the hardware sets MSR_IA32_QM_CTR.Unavailable
(bit 62). When such an RMID starts being tracked again the hardware counter
is reset to zero. MSR_IA32_QM_CTR.Unavailable remains set on first read
after tracking re-starts and is clear on all subsequent reads as long as
the RMID is tracked.

resctrl miscounts the bandwidth events after an RMID transitions from the
"Unavailable" state back to being tracked. This happens because when the
hardware starts counting again after resetting the counter to zero, resctrl
in turn compares the new count against the counter value stored from the
previous time the RMID was tracked. This results in resctrl computing an
event value that is either undercounting (when new counter is more than
stored counter) or a mistaken overflow (when new counter is less than
stored counter).

Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR to
zero whenever the RMID is in the "Unavailable" state to ensure accurate
counting after the RMID resets to zero when it starts to be tracked again.

Example scenario that results in mistaken overflow
==================================================
1. The resctrl filesystem is mounted, and a task is assigned to a
   monitoring group.

   $mount -t resctrl resctrl /sys/fs/resctrl
   $mkdir /sys/fs/resctrl/mon_groups/test1/
   $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
   21323            <- Total bytes on domain 0
   "Unavailable"    <- Total bytes on domain 1

   Task is running on domain 0. Counter on domain 1 is "Unavailable".

2. The task runs on domain 0 for a while and then moves to domain 1. The
   counter starts incrementing on domain 1.

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
   7345357          <- Total bytes on domain 0
   4545             <- Total bytes on domain 1

3. At some point, the RMID in domain 0 transitions to the "Unavailable"
   state because the task is no longer executing in that domain.

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
   "Unavailable"    <- Total bytes on domain 0
   434341           <- Total bytes on domain 1

4.  Since the task continues to migrate between domains, it may eventually
    return to domain 0.

    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
    17592178699059  <- Overflow on domain 0
    3232332         <- Total bytes on domain 1

In this case, the RMID on domain 0 transitions from "Unavailable" state to
active state. The hardware sets MSR_IA32_QM_CTR.Unavailable (bit 62) when
the counter is read and begins tracking the RMID counting from 0.
Subsequent reads succeed but returns a value smaller than the previously
saved MSR value (7345357). Consequently, the resctrl's overflow logic is
triggered, it compares the previous value (7345357) with the new, smaller
value and incorrectly interprets this as a counter overflow, adding a large
delta. In reality, this is a false positive: the counter did not overflow
but was simply reset when the RMID transitioned from "Unavailable" back to
active state.

Here is the text from APM [1] available from [2].

"In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
first QM_CTR read when it begins tracking an RMID that it was not
previously tracking. The U bit will be zero for all subsequent reads from
that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
read with the U bit set when that RMID is in use by a processor can be
considered 0 when calculating the difference with a subsequent read."

[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
    Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
    Bandwidth (MBM).

Fixes: 4d05bf71f157d ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Cc: stable@vger.kernel.org # needs adjustments for <= v6.17
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
---
v4: Removed switch and replaced with if else in the code. Tested again.
    Removed a stray tab in changelog.

v3: Rephrasing changelog considering undercounting problem.
    Checked again for special charactors (grep -P '[^\t\n\x20-\x7E]').
    Removed few of them now. Thanks Reinette.

v2: Fixed few systax issues.
    Checked for special charachars.
    Added Fixes tag.
    Added CC to stable kernel.
    Rephrased most of the changelog.

v1: Tested this on multiple AMD systems, but not on Intel systems.
    Need help with that. If everything goes well, this patch needs to go
    to all the stable kernels.

v1: https://lore.kernel.org/lkml/515a38328989e48d403ef5a7d6dd321ba3343a61.1759791957.git.babu.moger@amd.com/
v2: https://lore.kernel.org/lkml/2ead4ece804b0f61aab01f47385d9a125714991e.1759952394.git.babu.moger@amd.com/
v3: https://lore.kernel.org/lkml/8eace6e47bb7e124ffb6c10150d1b7d38c2f5494.1760034251.git.babu.moger@amd.com/
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c8945610d455..2cd25a0d4637 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -242,7 +242,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 unused, u32 rmid, enum resctrl_event_id eventid,
 			   u64 *val, void *ignored)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	int cpu = cpumask_any(&d->hdr.cpu_mask);
+	struct arch_mbm_state *am;
 	u64 msr_val;
 	u32 prmid;
 	int ret;
@@ -251,12 +253,16 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,

 	prmid = logical_rmid_to_physical_rmid(cpu, rmid);
 	ret = __rmid_read_phys(prmid, eventid, &msr_val);
-	if (ret)
-		return ret;

-	*val = get_corrected_val(r, d, rmid, eventid, msr_val);
+	if (!ret) {
+		*val = get_corrected_val(r, d, rmid, eventid, msr_val);
+	} else if (ret == -EINVAL) {
+		am = get_arch_mbm_state(hw_dom, rmid, eventid);
+		if (am)
+			am->prev_msr = 0;
+	}

-	return 0;
+	return ret;
 }

 static int __cntr_id_read(u32 cntr_id, u64 *val)
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v4] x86/resctrl: Fix miscount of bandwidth event when reactivating previously Unavailable RMID
  2025-10-10 17:08 [PATCH v4] x86/resctrl: Fix miscount of bandwidth event when reactivating previously Unavailable RMID Babu Moger
@ 2025-10-10 21:20 ` Reinette Chatre
  2025-10-10 23:37   ` Moger, Babu
  2025-10-13 19:36 ` [tip: x86/urgent] x86/resctrl: Fix miscount of bandwidth event when reactivating previously unavailable RMID tip-bot2 for Babu Moger
  1 sibling, 1 reply; 6+ messages in thread
From: Reinette Chatre @ 2025-10-10 21:20 UTC (permalink / raw)
  To: Babu Moger, tony.luck, Dave.Martin, james.morse, tglx, mingo, bp,
	dave.hansen
  Cc: x86, hpa, linux-kernel, peternewman, eranian, gautham.shenoy

Hi Babu,

On 10/10/25 10:08 AM, Babu Moger wrote:
> Users can create as many monitoring groups as the number of RMIDs supported
> by the hardware. However, on AMD systems, only a limited number of RMIDs
> are guaranteed to be actively tracked by the hardware. RMIDs that exceed
> this limit are placed in an "Unavailable" state. When a bandwidth counter
> is read for such an RMID, the hardware sets MSR_IA32_QM_CTR.Unavailable
> (bit 62). When such an RMID starts being tracked again the hardware counter
> is reset to zero. MSR_IA32_QM_CTR.Unavailable remains set on first read
> after tracking re-starts and is clear on all subsequent reads as long as
> the RMID is tracked.
> 
> resctrl miscounts the bandwidth events after an RMID transitions from the
> "Unavailable" state back to being tracked. This happens because when the
> hardware starts counting again after resetting the counter to zero, resctrl
> in turn compares the new count against the counter value stored from the
> previous time the RMID was tracked. This results in resctrl computing an
> event value that is either undercounting (when new counter is more than
> stored counter) or a mistaken overflow (when new counter is less than
> stored counter).
> 
> Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR to
> zero whenever the RMID is in the "Unavailable" state to ensure accurate
> counting after the RMID resets to zero when it starts to be tracked again.
> 
> Example scenario that results in mistaken overflow
> ==================================================
> 1. The resctrl filesystem is mounted, and a task is assigned to a
>    monitoring group.
> 
>    $mount -t resctrl resctrl /sys/fs/resctrl
>    $mkdir /sys/fs/resctrl/mon_groups/test1/
>    $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks
> 
>    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>    21323            <- Total bytes on domain 0
>    "Unavailable"    <- Total bytes on domain 1
> 
>    Task is running on domain 0. Counter on domain 1 is "Unavailable".
> 
> 2. The task runs on domain 0 for a while and then moves to domain 1. The
>    counter starts incrementing on domain 1.
> 
>    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>    7345357          <- Total bytes on domain 0
>    4545             <- Total bytes on domain 1
> 
> 3. At some point, the RMID in domain 0 transitions to the "Unavailable"
>    state because the task is no longer executing in that domain.
> 
>    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>    "Unavailable"    <- Total bytes on domain 0
>    434341           <- Total bytes on domain 1
> 
> 4.  Since the task continues to migrate between domains, it may eventually
>     return to domain 0.
> 
>     $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>     17592178699059  <- Overflow on domain 0
>     3232332         <- Total bytes on domain 1
> 
> In this case, the RMID on domain 0 transitions from "Unavailable" state to
> active state. The hardware sets MSR_IA32_QM_CTR.Unavailable (bit 62) when
> the counter is read and begins tracking the RMID counting from 0.
> Subsequent reads succeed but returns a value smaller than the previously
> saved MSR value (7345357). Consequently, the resctrl's overflow logic is
> triggered, it compares the previous value (7345357) with the new, smaller
> value and incorrectly interprets this as a counter overflow, adding a large
> delta. In reality, this is a false positive: the counter did not overflow
> but was simply reset when the RMID transitioned from "Unavailable" back to
> active state.
> 
> Here is the text from APM [1] available from [2].
> 
> "In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
> first QM_CTR read when it begins tracking an RMID that it was not
> previously tracking. The U bit will be zero for all subsequent reads from
> that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
> read with the U bit set when that RMID is in use by a processor can be
> considered 0 when calculating the difference with a subsequent read."
> 
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>     Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
>     Bandwidth (MBM).
> 
> Fixes: 4d05bf71f157d ("x86/resctrl: Introduce AMD QOS feature")
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> Cc: stable@vger.kernel.org # needs adjustments for <= v6.17
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
> ---
> v4: Removed switch and replaced with if else in the code. Tested again.
>     Removed a stray tab in changelog.
> 
> v3: Rephrasing changelog considering undercounting problem.
>     Checked again for special charactors (grep -P '[^\t\n\x20-\x7E]').
>     Removed few of them now. Thanks Reinette.
> 
> v2: Fixed few systax issues.
>     Checked for special charachars.
>     Added Fixes tag.
>     Added CC to stable kernel.
>     Rephrased most of the changelog.
> 
> v1: Tested this on multiple AMD systems, but not on Intel systems.
>     Need help with that. If everything goes well, this patch needs to go
>     to all the stable kernels.

The behavior of the counter is different on Intel where there are enough
counters backing the RMID and the "Unavailable" bit is not set when counter
starts counting but instead the counter returns "0". For example, when
running equivalent of "step 1" on an Intel system it looks like:

	# cd /sys/fs/resctrl
	# mkdir mon_groups/test1
	# echo $$ > mon_groups/test1/tasks
	# cat mon_groups/test1/mon_data/*/mbm_total_bytes
	0
	1835008

I am not aware of resctrl being able to trigger an Unavailable counter via
these counter registers on existing Intel hardware and as a consequence do
not expect this new flow (ret == -EINVAL) to be triggered on Intel. There may
be some differences with RDT MMIO counters that I need to look into but since
that is not supported by resctrl yet it is not relevant to this fix.

Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Thank you very much.

Reinette


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4] x86/resctrl: Fix miscount of bandwidth event when reactivating previously Unavailable RMID
  2025-10-10 21:20 ` Reinette Chatre
@ 2025-10-10 23:37   ` Moger, Babu
  2025-10-13 15:35     ` Luck, Tony
  0 siblings, 1 reply; 6+ messages in thread
From: Moger, Babu @ 2025-10-10 23:37 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, tony.luck, Dave.Martin, james.morse,
	tglx, mingo, bp, dave.hansen
  Cc: x86, hpa, linux-kernel, peternewman, eranian, gautham.shenoy

Hi Reinette,

On 10/10/2025 4:20 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 10/10/25 10:08 AM, Babu Moger wrote:
>> Users can create as many monitoring groups as the number of RMIDs supported
>> by the hardware. However, on AMD systems, only a limited number of RMIDs
>> are guaranteed to be actively tracked by the hardware. RMIDs that exceed
>> this limit are placed in an "Unavailable" state. When a bandwidth counter
>> is read for such an RMID, the hardware sets MSR_IA32_QM_CTR.Unavailable
>> (bit 62). When such an RMID starts being tracked again the hardware counter
>> is reset to zero. MSR_IA32_QM_CTR.Unavailable remains set on first read
>> after tracking re-starts and is clear on all subsequent reads as long as
>> the RMID is tracked.
>>
>> resctrl miscounts the bandwidth events after an RMID transitions from the
>> "Unavailable" state back to being tracked. This happens because when the
>> hardware starts counting again after resetting the counter to zero, resctrl
>> in turn compares the new count against the counter value stored from the
>> previous time the RMID was tracked. This results in resctrl computing an
>> event value that is either undercounting (when new counter is more than
>> stored counter) or a mistaken overflow (when new counter is less than
>> stored counter).
>>
>> Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR to
>> zero whenever the RMID is in the "Unavailable" state to ensure accurate
>> counting after the RMID resets to zero when it starts to be tracked again.
>>
>> Example scenario that results in mistaken overflow
>> ==================================================
>> 1. The resctrl filesystem is mounted, and a task is assigned to a
>>     monitoring group.
>>
>>     $mount -t resctrl resctrl /sys/fs/resctrl
>>     $mkdir /sys/fs/resctrl/mon_groups/test1/
>>     $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks
>>
>>     $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>>     21323            <- Total bytes on domain 0
>>     "Unavailable"    <- Total bytes on domain 1
>>
>>     Task is running on domain 0. Counter on domain 1 is "Unavailable".
>>
>> 2. The task runs on domain 0 for a while and then moves to domain 1. The
>>     counter starts incrementing on domain 1.
>>
>>     $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>>     7345357          <- Total bytes on domain 0
>>     4545             <- Total bytes on domain 1
>>
>> 3. At some point, the RMID in domain 0 transitions to the "Unavailable"
>>     state because the task is no longer executing in that domain.
>>
>>     $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>>     "Unavailable"    <- Total bytes on domain 0
>>     434341           <- Total bytes on domain 1
>>
>> 4.  Since the task continues to migrate between domains, it may eventually
>>      return to domain 0.
>>
>>      $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>>      17592178699059  <- Overflow on domain 0
>>      3232332         <- Total bytes on domain 1
>>
>> In this case, the RMID on domain 0 transitions from "Unavailable" state to
>> active state. The hardware sets MSR_IA32_QM_CTR.Unavailable (bit 62) when
>> the counter is read and begins tracking the RMID counting from 0.
>> Subsequent reads succeed but returns a value smaller than the previously
>> saved MSR value (7345357). Consequently, the resctrl's overflow logic is
>> triggered, it compares the previous value (7345357) with the new, smaller
>> value and incorrectly interprets this as a counter overflow, adding a large
>> delta. In reality, this is a false positive: the counter did not overflow
>> but was simply reset when the RMID transitioned from "Unavailable" back to
>> active state.
>>
>> Here is the text from APM [1] available from [2].
>>
>> "In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
>> first QM_CTR read when it begins tracking an RMID that it was not
>> previously tracking. The U bit will be zero for all subsequent reads from
>> that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
>> read with the U bit set when that RMID is in use by a processor can be
>> considered 0 when calculating the difference with a subsequent read."
>>
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>      Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
>>      Bandwidth (MBM).
>>
>> Fixes: 4d05bf71f157d ("x86/resctrl: Introduce AMD QOS feature")
>> Signed-off-by: Babu Moger <babu.moger@amd.com>
>> Cc: stable@vger.kernel.org # needs adjustments for <= v6.17
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
>> ---
>> v4: Removed switch and replaced with if else in the code. Tested again.
>>      Removed a stray tab in changelog.
>>
>> v3: Rephrasing changelog considering undercounting problem.
>>      Checked again for special charactors (grep -P '[^\t\n\x20-\x7E]').
>>      Removed few of them now. Thanks Reinette.
>>
>> v2: Fixed few systax issues.
>>      Checked for special charachars.
>>      Added Fixes tag.
>>      Added CC to stable kernel.
>>      Rephrased most of the changelog.
>>
>> v1: Tested this on multiple AMD systems, but not on Intel systems.
>>      Need help with that. If everything goes well, this patch needs to go
>>      to all the stable kernels.
> 
> The behavior of the counter is different on Intel where there are enough
> counters backing the RMID and the "Unavailable" bit is not set when counter
> starts counting but instead the counter returns "0". For example, when
> running equivalent of "step 1" on an Intel system it looks like:
> 
> 	# cd /sys/fs/resctrl
> 	# mkdir mon_groups/test1
> 	# echo $$ > mon_groups/test1/tasks
> 	# cat mon_groups/test1/mon_data/*/mbm_total_bytes
> 	0
> 	1835008
> 

Thanks. That is good to know.

> I am not aware of resctrl being able to trigger an Unavailable counter via
> these counter registers on existing Intel hardware and as a consequence do
> not expect this new flow (ret == -EINVAL) to be triggered on Intel. There may
> be some differences with RDT MMIO counters that I need to look into but since
> that is not supported by resctrl yet it is not relevant to this fix.
> 
> Tested-by: Reinette Chatre <reinette.chatre@intel.com>
> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
> 

Thanks. Much appreciated.
- Babu Moger


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH v4] x86/resctrl: Fix miscount of bandwidth event when reactivating previously Unavailable RMID
  2025-10-10 23:37   ` Moger, Babu
@ 2025-10-13 15:35     ` Luck, Tony
  2025-10-13 18:16       ` Reinette Chatre
  0 siblings, 1 reply; 6+ messages in thread
From: Luck, Tony @ 2025-10-13 15:35 UTC (permalink / raw)
  To: Moger, Babu, Chatre, Reinette, Babu Moger, Dave.Martin@arm.com,
	james.morse@arm.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com
  Cc: x86@kernel.org, hpa@zytor.com, linux-kernel@vger.kernel.org,
	peternewman@google.com, Eranian, Stephane, gautham.shenoy@amd.com

> > The behavior of the counter is different on Intel where there are enough
> > counters backing the RMID and the "Unavailable" bit is not set when counter
> > starts counting but instead the counter returns "0". For example, when

Note that the h/w counter doesn't really return "0" (except for the first time
after CPU reset).

> > running equivalent of "step 1" on an Intel system it looks like:
> >
> >     # cd /sys/fs/resctrl
> >     # mkdir mon_groups/test1

While making the directory mon_add_all_files() does this:

                if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
                        mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);

Which in __mon_event_count() does:

        if (rr->first) {
                if (rr->is_mbm_cntr)
                        resctrl_arch_reset_cntr(rr->r, rr->d, closid, rmid, cntr_id, rr->evtid);
                else
                        resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid);
                m = get_mbm_state(rr->d, closid, rmid, rr->evtid);
                if (m)
                        memset(m, 0, sizeof(struct mbm_state));
                return 0;
        }

If you dig into resctrl_arch_reset_rmid() you will see that it reads the h/w counter and
then that becomes the start point for subsequent values reported when a user reads
from the resctrl event file.

> >     # echo $$ > mon_groups/test1/tasks
> >     # cat mon_groups/test1/mon_data/*/mbm_total_bytes
> >     0
> >     1835008
> >
>
> Thanks. That is good to know.

-Tony

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4] x86/resctrl: Fix miscount of bandwidth event when reactivating previously Unavailable RMID
  2025-10-13 15:35     ` Luck, Tony
@ 2025-10-13 18:16       ` Reinette Chatre
  0 siblings, 0 replies; 6+ messages in thread
From: Reinette Chatre @ 2025-10-13 18:16 UTC (permalink / raw)
  To: Luck, Tony, Moger, Babu, Babu Moger, Dave.Martin@arm.com,
	james.morse@arm.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com
  Cc: x86@kernel.org, hpa@zytor.com, linux-kernel@vger.kernel.org,
	peternewman@google.com, Eranian, Stephane, gautham.shenoy@amd.com

Hi Tony,

On 10/13/25 8:35 AM, Luck, Tony wrote:
>>> The behavior of the counter is different on Intel where there are enough
>>> counters backing the RMID and the "Unavailable" bit is not set when counter
>>> starts counting but instead the counter returns "0". For example, when
> 
> Note that the h/w counter doesn't really return "0" (except for the first time
> after CPU reset).

Correct. 
In this example both the hardware counter and the event returned zero. The
main point was that it does not return "Unavailable".

The goal with the example related to this issue was to demonstrate no impact on Intel
when resetting arch_mbm_state::prev_msr on receipt of "Unavailable". Do you see things
differently?

> 
>>> running equivalent of "step 1" on an Intel system it looks like:
>>>
>>>     # cd /sys/fs/resctrl
>>>     # mkdir mon_groups/test1
> 
> While making the directory mon_add_all_files() does this:
> 
>                 if (!do_sum && resctrl_is_mbm_event(mevt->evtid))
>                         mon_event_read(&rr, r, d, prgrp, &d->hdr.cpu_mask, mevt->evtid, true);
> 
> Which in __mon_event_count() does:
> 
>         if (rr->first) {
>                 if (rr->is_mbm_cntr)
>                         resctrl_arch_reset_cntr(rr->r, rr->d, closid, rmid, cntr_id, rr->evtid);
>                 else
>                         resctrl_arch_reset_rmid(rr->r, rr->d, closid, rmid, rr->evtid);
>                 m = get_mbm_state(rr->d, closid, rmid, rr->evtid);
>                 if (m)
>                         memset(m, 0, sizeof(struct mbm_state));
>                 return 0;
>         }
> 
> If you dig into resctrl_arch_reset_rmid() you will see that it reads the h/w counter and
> then that becomes the start point for subsequent values reported when a user reads
> from the resctrl event file.

I believe resctrl_arch_reset_rmid() already addresses the issue since resctrl_arch_reset_rmid()
always resets the architectural state before attempting to read the RMID. If __rmid_read_phys()
encounters "Unavailable"/-EINVAL then it is fine since arch_mbm_state::prev_msr will already be
zero and thus ready for a subsequent resctrl_arch_rmid_read(), whether hardware counter is ready
or not.

Reinette


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip: x86/urgent] x86/resctrl: Fix miscount of bandwidth event when reactivating previously unavailable RMID
  2025-10-10 17:08 [PATCH v4] x86/resctrl: Fix miscount of bandwidth event when reactivating previously Unavailable RMID Babu Moger
  2025-10-10 21:20 ` Reinette Chatre
@ 2025-10-13 19:36 ` tip-bot2 for Babu Moger
  1 sibling, 0 replies; 6+ messages in thread
From: tip-bot2 for Babu Moger @ 2025-10-13 19:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Babu Moger, Borislav Petkov (AMD), Reinette Chatre,
	stable@vger.kernel.org # needs adjustments for, x86, linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     15292f1b4c55a3a7c940dbcb6cb8793871ed3d92
Gitweb:        https://git.kernel.org/tip/15292f1b4c55a3a7c940dbcb6cb8793871ed3d92
Author:        Babu Moger <babu.moger@amd.com>
AuthorDate:    Fri, 10 Oct 2025 12:08:35 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Mon, 13 Oct 2025 21:24:39 +02:00

x86/resctrl: Fix miscount of bandwidth event when reactivating previously unavailable RMID

Users can create as many monitoring groups as the number of RMIDs supported
by the hardware. However, on AMD systems, only a limited number of RMIDs
are guaranteed to be actively tracked by the hardware. RMIDs that exceed
this limit are placed in an "Unavailable" state.

When a bandwidth counter is read for such an RMID, the hardware sets
MSR_IA32_QM_CTR.Unavailable (bit 62). When such an RMID starts being tracked
again the hardware counter is reset to zero. MSR_IA32_QM_CTR.Unavailable
remains set on first read after tracking re-starts and is clear on all
subsequent reads as long as the RMID is tracked.

resctrl miscounts the bandwidth events after an RMID transitions from the
"Unavailable" state back to being tracked. This happens because when the
hardware starts counting again after resetting the counter to zero, resctrl
in turn compares the new count against the counter value stored from the
previous time the RMID was tracked.

This results in resctrl computing an event value that is either undercounting
(when new counter is more than stored counter) or a mistaken overflow (when
new counter is less than stored counter).

Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR to
zero whenever the RMID is in the "Unavailable" state to ensure accurate
counting after the RMID resets to zero when it starts to be tracked again.

Example scenario that results in mistaken overflow
==================================================
1. The resctrl filesystem is mounted, and a task is assigned to a
   monitoring group.

   $mount -t resctrl resctrl /sys/fs/resctrl
   $mkdir /sys/fs/resctrl/mon_groups/test1/
   $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
   21323            <- Total bytes on domain 0
   "Unavailable"    <- Total bytes on domain 1

   Task is running on domain 0. Counter on domain 1 is "Unavailable".

2. The task runs on domain 0 for a while and then moves to domain 1. The
   counter starts incrementing on domain 1.

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
   7345357          <- Total bytes on domain 0
   4545             <- Total bytes on domain 1

3. At some point, the RMID in domain 0 transitions to the "Unavailable"
   state because the task is no longer executing in that domain.

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
   "Unavailable"    <- Total bytes on domain 0
   434341           <- Total bytes on domain 1

4.  Since the task continues to migrate between domains, it may eventually
    return to domain 0.

    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
    17592178699059  <- Overflow on domain 0
    3232332         <- Total bytes on domain 1

In this case, the RMID on domain 0 transitions from "Unavailable" state to
active state. The hardware sets MSR_IA32_QM_CTR.Unavailable (bit 62) when
the counter is read and begins tracking the RMID counting from 0.

Subsequent reads succeed but return a value smaller than the previously
saved MSR value (7345357). Consequently, the resctrl's overflow logic is
triggered, it compares the previous value (7345357) with the new, smaller
value and incorrectly interprets this as a counter overflow, adding a large
delta.

In reality, this is a false positive: the counter did not overflow but was
simply reset when the RMID transitioned from "Unavailable" back to active
state.

Here is the text from APM [1] available from [2].

"In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
first QM_CTR read when it begins tracking an RMID that it was not
previously tracking. The U bit will be zero for all subsequent reads from
that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
read with the U bit set when that RMID is in use by a processor can be
considered 0 when calculating the difference with a subsequent read."

[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
    Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
    Bandwidth (MBM).

  [ bp: Split commit message into smaller paragraph chunks for better
    consumption. ]

Fixes: 4d05bf71f157d ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org # needs adjustments for <= v6.17
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c894561..2cd25a0 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -242,7 +242,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 unused, u32 rmid, enum resctrl_event_id eventid,
 			   u64 *val, void *ignored)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	int cpu = cpumask_any(&d->hdr.cpu_mask);
+	struct arch_mbm_state *am;
 	u64 msr_val;
 	u32 prmid;
 	int ret;
@@ -251,12 +253,16 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,

 	prmid = logical_rmid_to_physical_rmid(cpu, rmid);
 	ret = __rmid_read_phys(prmid, eventid, &msr_val);
-	if (ret)
-		return ret;

-	*val = get_corrected_val(r, d, rmid, eventid, msr_val);
+	if (!ret) {
+		*val = get_corrected_val(r, d, rmid, eventid, msr_val);
+	} else if (ret == -EINVAL) {
+		am = get_arch_mbm_state(hw_dom, rmid, eventid);
+		if (am)
+			am->prev_msr = 0;
+	}

-	return 0;
+	return ret;
 }

 static int __cntr_id_read(u32 cntr_id, u64 *val)

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-10-13 19:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-10 17:08 [PATCH v4] x86/resctrl: Fix miscount of bandwidth event when reactivating previously Unavailable RMID Babu Moger
2025-10-10 21:20 ` Reinette Chatre
2025-10-10 23:37   ` Moger, Babu
2025-10-13 15:35     ` Luck, Tony
2025-10-13 18:16       ` Reinette Chatre
2025-10-13 19:36 ` [tip: x86/urgent] x86/resctrl: Fix miscount of bandwidth event when reactivating previously unavailable RMID tip-bot2 for Babu Moger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox