[v2] x86/resctrl: Fix buggy overflow when reactivating previously Unavailable RMID

[PATCH v2] x86/resctrl: Fix buggy overflow when reactivating previously Unavailable RMID

Posted by Babu Moger 4 months ago

Users can create as many monitoring groups as the number of RMIDs supported
by the hardware. However, on AMD systems, only a limited number of RMIDs
are guaranteed to be actively tracked by the hardware. RMIDs that exceed
this limit are placed in an "Unavailable" state. When a bandwidth counter
is read for such an RMID, the hardware sets MSR_IA32_QM_CTR.Unavailable
(bit 62).

The problem occurs when an RMID transitions from the “Unavailable” state
back to the active state. When this happens, the hardware resets the
counter to zero, but the kernel compares this new smaller value with the
previously saved MSR value and mistakenly interprets it as an overflow.

Problem scenario:
1. The resctrl filesystem is mounted, and a task is assigned to a
   monitoring group.

   $mount -t resctrl resctrl /sys/fs/resctrl
   $mkdir /sys/fs/resctrl/mon_groups/test1/
   $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
   21323            <- Total bytes on domain 0
   "Unavailable"    <- Total bytes on domain 1

   Task is running on domain 0. Counter on domain 1 is "Unavailable".

2. The task runs on domain 0 for a while and then moves to domain 1. The
   counter starts incrementing on domain 1.

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
   7345357          <- Total bytes on domain 0
   4545             <- Total bytes on domain 1


3. At some point, the RMID in domain 0 transitions to the "Unavailable"
   state because the task is no longer executing in that domain.

   $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
   "Unavailable"    <- Total bytes on domain 0
   434341           <- Total bytes on domain 1

4.  Since the task continues to migrate between domains, it may eventually
    return to domain 0.

    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
    17592178699059  <- Overflow on domain 0
    3232332         <- Total bytes on domain 1

    In this case, the RMID on domain 0 transitions from “Unavailable”
    state to the active state. The hardware sets MSR_IA32_QM_CTR.Unavailable
    (bit 62) when the counter is read and begins tracking the RMID counting
    from 0. Subsequent reads succeed but may return a value smaller than the
    previously saved MSR value (7345357). Consequently, the kernel’s overflow
    logic is triggered—it compares the previous value (7345357) with the new,
    smaller value and incorrectly interprets this as a counter overflow,
    adding a large delta. In reality, this is a false positive: the counter
    did not overflow but was simply reset when the RMID transitioned from
    “Unavailable” back to active.

Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR, used
for handling counter overflows, whenever the RMID transitions to the
“Unavailable” state to resolve the issue.

Here is the text from APM [1] available from [2].

"In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
first QM_CTR read when it begins tracking an RMID that it was not
previously tracking. The U bit will be zero for all subsequent reads from
that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
read with the U bit set when that RMID is in use by a processor can be
considered 0 when calculating the difference with a subsequent read."

[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
    Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
    Bandwidth (MBM).

Cc: stable@vger.kernel.org # needs adjustments for <= v6.17
Fixes: 4d05bf71f157d ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
---

v2: Fixed few systax issues.
    Checked for special charachars.
    Added Fixes tag.
    Added CC to stable kernel.
    Rephrased most of the changelog.

v1:
Tested this on multiple AMD systems, but not on Intel systems.
Need help with that. If everything goes well, this patch needs to
go to all the stable kernels.

https://lore.kernel.org/lkml/515a38328989e48d403ef5a7d6dd321ba3343a61.1759791957.git.babu.moger@amd.com/
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c8945610d455..a685370dd160 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -242,7 +242,9 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 			   u32 unused, u32 rmid, enum resctrl_event_id eventid,
 			   u64 *val, void *ignored)
 {
+	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
 	int cpu = cpumask_any(&d->hdr.cpu_mask);
+	struct arch_mbm_state *am;
 	u64 msr_val;
 	u32 prmid;
 	int ret;
@@ -251,12 +253,21 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
 
 	prmid = logical_rmid_to_physical_rmid(cpu, rmid);
 	ret = __rmid_read_phys(prmid, eventid, &msr_val);
-	if (ret)
-		return ret;
 
-	*val = get_corrected_val(r, d, rmid, eventid, msr_val);
+	switch (ret) {
+	case 0:
+		*val = get_corrected_val(r, d, rmid, eventid, msr_val);
+		break;
+	case -EINVAL:
+		am = get_arch_mbm_state(hw_dom, rmid, eventid);
+		if (am)
+			am->prev_msr = 0;
+		break;
+	default:
+		break;
+	}
 
-	return 0;
+	return ret;
 }
 
 static int __cntr_id_read(u32 cntr_id, u64 *val)
-- 
2.34.1

Re: [PATCH v2] x86/resctrl: Fix buggy overflow when reactivating previously Unavailable RMID

Posted by Reinette Chatre 4 months ago

Hi Babu,

On 10/8/25 12:39 PM, Babu Moger wrote:
> Users can create as many monitoring groups as the number of RMIDs supported
> by the hardware. However, on AMD systems, only a limited number of RMIDs
> are guaranteed to be actively tracked by the hardware. RMIDs that exceed
> this limit are placed in an "Unavailable" state. When a bandwidth counter
> is read for such an RMID, the hardware sets MSR_IA32_QM_CTR.Unavailable
> (bit 62).

To make this context complete I think you can append something like: 
	When such an RMID starts being tracked again the hardware counter is
	reset to zero. MSR_IA32_QM_CTR.Unavailable remains set on first read after
	tracking re-starts and is clear on all subsequent reads as long as the
	RMID is tracked.

> 
> The problem occurs when an RMID transitions from the “Unavailable” state

Which problem? (Please let changelog stand on its own and not be continuation of subject)

> back to the active state. When this happens, the hardware resets the
> counter to zero, but the kernel compares this new smaller value with the
> previously saved MSR value and mistakenly interprets it as an overflow.

I do not think this is just about overflow. Certainly this is the
most visible symptom but the stored counter value may also be smaller than the new
counter value resulting in undercounting of bandwidth? (ignoring that not
counting at all while RMID is unavailable is technically also undercounting).

Would something like below be accurate?

	resctrl miscounts the bandwidth events after an RMID transitions
	from the "Unavailable" state back to being tracked. This happens
	because when the hardware starts counting again after resetting the counter to
	zero, resctrl in turn compares the new count against the counter value
	stored from the previous time the RMID was tracked. This results in resctrl
	computing an event value that is either undercounting (when new counter is more than
	stored counter)	or a mistaken overflow (when new counter is less than stored counter).

If you agree with the summary then please update the subject to match. For example,
"x86/resctrl: Fix miscount of bandwidth event when reactivating previously Unavailable RMID"

I think Dave's feedback about changelog length is valid. The changelog can present the
fix at this point and leave the detailed description of the overflow scenario to the end of
changelog with a heading that reader can use to decide to skip over if problem is clear or use as
reference to see the problem in action. 

I also recommend that the fix be specific and avoid vague statement like "to resolve the issue".
For example,

	Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR to zero
	whenever the RMID is in the "Unavailable" state to ensure accurate
	counting after the RMID resets to zero when it starts to be tracked again

> 
> Problem scenario:

The portion below can have a heading to help reader identify its purpose. For example,

Example scenario that results in mistaken overflow
==================================================


> 1. The resctrl filesystem is mounted, and a task is assigned to a
>    monitoring group.
> 
>    $mount -t resctrl resctrl /sys/fs/resctrl
>    $mkdir /sys/fs/resctrl/mon_groups/test1/
>    $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks
> 
>    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>    21323            <- Total bytes on domain 0
>    "Unavailable"    <- Total bytes on domain 1
> 
>    Task is running on domain 0. Counter on domain 1 is "Unavailable".
> 
> 2. The task runs on domain 0 for a while and then moves to domain 1. The
>    counter starts incrementing on domain 1.
> 
>    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>    7345357          <- Total bytes on domain 0
>    4545             <- Total bytes on domain 1
> 
> 
> 3. At some point, the RMID in domain 0 transitions to the "Unavailable"
>    state because the task is no longer executing in that domain.
> 
>    $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>    "Unavailable"    <- Total bytes on domain 0
>    434341           <- Total bytes on domain 1
> 
> 4.  Since the task continues to migrate between domains, it may eventually
>     return to domain 0.
> 
>     $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>     17592178699059  <- Overflow on domain 0
>     3232332         <- Total bytes on domain 1
> 

Is below intended to be indented?

>     In this case, the RMID on domain 0 transitions from “Unavailable”
>     state to the active state. The hardware sets MSR_IA32_QM_CTR.Unavailable

"active state" -> "tracked state" (to be consistent with terminology - not sure what
is preferred between "active" and "tracked" but please be consistent)

>     (bit 62) when the counter is read and begins tracking the RMID counting
>     from 0. Subsequent reads succeed but may return a value smaller than the

"may return" -> "returns"

>     previously saved MSR value (7345357). Consequently, the kernel’s overflow

"the kernel’s" -> "resctrl's"?

>     logic is triggered—it compares the previous value (7345357) with the new,
>     smaller value and incorrectly interprets this as a counter overflow,
>     adding a large delta. In reality, this is a false positive: the counter
>     did not overflow but was simply reset when the RMID transitioned from
>     “Unavailable” back to active.

Here is what I do to check for non-ascii characters:
$ b4 am <message ID>
$ grep -P '[^\t\n\x20-\x7E]' <downloaded patch>

Could you please try it out on this patch and fix the matches?

> 
> Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR, used
> for handling counter overflows, whenever the RMID transitions to the
> “Unavailable” state to resolve the issue.
> 
> Here is the text from APM [1] available from [2].
> 
> "In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
> first QM_CTR read when it begins tracking an RMID that it was not
> previously tracking. The U bit will be zero for all subsequent reads from
> that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
> read with the U bit set when that RMID is in use by a processor can be
> considered 0 when calculating the difference with a subsequent read."
> 
> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>     Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
>     Bandwidth (MBM).
> 
> Cc: stable@vger.kernel.org # needs adjustments for <= v6.17

Tag ordering guide "Ordering of commit tags" found in
Documentation/process/maintainer-tip.rst places the "Cc" just before
the "Link:" tag.

> Fixes: 4d05bf71f157d ("x86/resctrl: Introduce AMD QOS feature")
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
> ---

Reinette

Re: [PATCH v2] x86/resctrl: Fix buggy overflow when reactivating previously Unavailable RMID

Posted by Babu Moger 4 months ago

Hi Reinette,

On 10/8/25 21:00, Reinette Chatre wrote:
> Hi Babu,
>
> On 10/8/25 12:39 PM, Babu Moger wrote:
>> Users can create as many monitoring groups as the number of RMIDs supported
>> by the hardware. However, on AMD systems, only a limited number of RMIDs
>> are guaranteed to be actively tracked by the hardware. RMIDs that exceed
>> this limit are placed in an "Unavailable" state. When a bandwidth counter
>> is read for such an RMID, the hardware sets MSR_IA32_QM_CTR.Unavailable
>> (bit 62).
> To make this context complete I think you can append something like:
> 	When such an RMID starts being tracked again the hardware counter is
> 	reset to zero. MSR_IA32_QM_CTR.Unavailable remains set on first read after
> 	tracking re-starts and is clear on all subsequent reads as long as the
> 	RMID is tracked.
Sure. Looks good.
>
>> The problem occurs when an RMID transitions from the “Unavailable” state
> Which problem? (Please let changelog stand on its own and not be continuation of subject)

Sure.


>
>> back to the active state. When this happens, the hardware resets the
>> counter to zero, but the kernel compares this new smaller value with the
>> previously saved MSR value and mistakenly interprets it as an overflow.
> I do not think this is just about overflow. Certainly this is the
> most visible symptom but the stored counter value may also be smaller than the new
> counter value resulting in undercounting of bandwidth? (ignoring that not
> counting at all while RMID is unavailable is technically also undercounting).
Yes. That can also happen during that window.
>
> Would something like below be accurate?
>
> 	resctrl miscounts the bandwidth events after an RMID transitions
> 	from the "Unavailable" state back to being tracked. This happens
> 	because when the hardware starts counting again after resetting the counter to
> 	zero, resctrl in turn compares the new count against the counter value
> 	stored from the previous time the RMID was tracked. This results in resctrl
> 	computing an event value that is either undercounting (when new counter is more than
> 	stored counter)	or a mistaken overflow (when new counter is less than stored counter).
Sure,
>
> If you agree with the summary then please update the subject to match. For example,
> "x86/resctrl: Fix miscount of bandwidth event when reactivating previously Unavailable RMID"
Sure.
>
> I think Dave's feedback about changelog length is valid. The changelog can present the
> fix at this point and leave the detailed description of the overflow scenario to the end of
> changelog with a heading that reader can use to decide to skip over if problem is clear or use as
> reference to see the problem in action.
>
> I also recommend that the fix be specific and avoid vague statement like "to resolve the issue".
> For example,
>
> 	Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR to zero
> 	whenever the RMID is in the "Unavailable" state to ensure accurate
> 	counting after the RMID resets to zero when it starts to be tracked again

Looks good.


>
>> Problem scenario:
> The portion below can have a heading to help reader identify its purpose. For example,
>
> Example scenario that results in mistaken overflow
> ==================================================
>
Sure.
>> 1. The resctrl filesystem is mounted, and a task is assigned to a
>>     monitoring group.
>>
>>     $mount -t resctrl resctrl /sys/fs/resctrl
>>     $mkdir /sys/fs/resctrl/mon_groups/test1/
>>     $echo 1234 > /sys/fs/resctrl/mon_groups/test1/tasks
>>
>>     $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>>     21323            <- Total bytes on domain 0
>>     "Unavailable"    <- Total bytes on domain 1
>>
>>     Task is running on domain 0. Counter on domain 1 is "Unavailable".
>>
>> 2. The task runs on domain 0 for a while and then moves to domain 1. The
>>     counter starts incrementing on domain 1.
>>
>>     $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>>     7345357          <- Total bytes on domain 0
>>     4545             <- Total bytes on domain 1
>>
>>
>> 3. At some point, the RMID in domain 0 transitions to the "Unavailable"
>>     state because the task is no longer executing in that domain.
>>
>>     $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>>     "Unavailable"    <- Total bytes on domain 0
>>     434341           <- Total bytes on domain 1
>>
>> 4.  Since the task continues to migrate between domains, it may eventually
>>      return to domain 0.
>>
>>      $cat /sys/fs/resctrl/mon_groups/test1/mon_data/mon_L3_*/mbm_total_bytes
>>      17592178699059  <- Overflow on domain 0
>>      3232332         <- Total bytes on domain 1
>>
> Is below intended to be indented?
Removed the indentation.
>>      In this case, the RMID on domain 0 transitions from “Unavailable”
>>      state to the active state. The hardware sets MSR_IA32_QM_CTR.Unavailable
> "active state" -> "tracked state" (to be consistent with terminology - not sure what
> is preferred between "active" and "tracked" but please be consistent)
Changed it to active state.
>
>>      (bit 62) when the counter is read and begins tracking the RMID counting
>>      from 0. Subsequent reads succeed but may return a value smaller than the
> "may return" -> "returns"
Sure.
>>      previously saved MSR value (7345357). Consequently, the kernel’s overflow
> "the kernel’s" -> "resctrl's"?
Sure.
>>      logic is triggered—it compares the previous value (7345357) with the new,
>>      smaller value and incorrectly interprets this as a counter overflow,
>>      adding a large delta. In reality, this is a false positive: the counter
>>      did not overflow but was simply reset when the RMID transitioned from
>>      “Unavailable” back to active.
> Here is what I do to check for non-ascii characters:
> $ b4 am <message ID>
> $ grep -P '[^\t\n\x20-\x7E]' <downloaded patch>
>
> Could you please try it out on this patch and fix the matches?

Yes. Now I see. Thanks fixed it.


>> Reset the stored value (arch_mbm_state::prev_msr) of MSR_IA32_QM_CTR, used
>> for handling counter overflows, whenever the RMID transitions to the
>> “Unavailable” state to resolve the issue.
>>
>> Here is the text from APM [1] available from [2].
>>
>> "In PQOS Version 2.0 or higher, the MBM hardware will set the U bit on the
>> first QM_CTR read when it begins tracking an RMID that it was not
>> previously tracking. The U bit will be zero for all subsequent reads from
>> that RMID while it is still tracked by the hardware. Therefore, a QM_CTR
>> read with the U bit set when that RMID is in use by a processor can be
>> considered 0 when calculating the difference with a subsequent read."
>>
>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>      Publication # 24593 Revision 3.41 section 19.3.3 Monitoring L3 Memory
>>      Bandwidth (MBM).
>>
>> Cc: stable@vger.kernel.org # needs adjustments for <= v6.17
> Tag ordering guide "Ordering of commit tags" found in
> Documentation/process/maintainer-tip.rst places the "Cc" just before
> the "Link:" tag.

Sure.

Thanks

Babu

RE: [PATCH v2] x86/resctrl: Fix buggy overflow when reactivating previously Unavailable RMID

Posted by Luck, Tony 4 months ago

> Here is what I do to check for non-ascii characters:
> $ b4 am <message ID>
> $ grep -P '[^\t\n\x20-\x7E]' <downloaded patch>
>
> Could you please try it out on this patch and fix the matches?

Does the non-ascii rule include the cover letter? Or just the patches
that will be applied and included into the Linux GIT repository?

My AET patches are "clean", but the cover letter has some output
from the tree(1) command. So the grep kicks out this:

├── mon_PERF_PKG_00
│   ├── activity
│   └── core_energy
└── mon_PERF_PKG_01
    ├── activity
    └── core_energy

-Tony

Re: [PATCH v2] x86/resctrl: Fix buggy overflow when reactivating previously Unavailable RMID

Posted by Reinette Chatre 4 months ago

Hi Tony,

On 10/9/25 11:39 AM, Luck, Tony wrote:
>> Here is what I do to check for non-ascii characters:
>> $ b4 am <message ID>
>> $ grep -P '[^\t\n\x20-\x7E]' <downloaded patch>
>>
>> Could you please try it out on this patch and fix the matches?
> 
> Does the non-ascii rule include the cover letter? Or just the patches
> that will be applied and included into the Linux GIT repository?
> 
> My AET patches are "clean", but the cover letter has some output
> from the tree(1) command. So the grep kicks out this:
> 
> ├── mon_PERF_PKG_00
> │   ├── activity
> │   └── core_energy
> └── mon_PERF_PKG_01
>     ├── activity
>     └── core_energy

I think this is fine. I do not know if there is an official rule about this but in
this case the problem is the subtle differences in the text when it, like in
this changelog, switches between ascii and non-ascii for the "same" character.
For example, like when this changelog switches the quotes between "Unavailable" and
“Unavailable”. This is unnecessary obfuscation.

Reinette