Documentation/arch/x86/resctrl.rst | 25 +- include/linux/resctrl.h | 85 +++-- arch/x86/include/asm/msr-index.h | 1 + arch/x86/kernel/cpu/resctrl/internal.h | 66 ++-- arch/x86/kernel/cpu/resctrl/core.c | 433 +++++++++++++++++----- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 58 +-- arch/x86/kernel/cpu/resctrl/monitor.c | 68 ++-- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 26 +- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 149 ++++---- 9 files changed, 629 insertions(+), 282 deletions(-)
The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features. Prior to this
patch Intel has advised that SNC and RDT are incompatible.
Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.
Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Changes since v13:
Rebase to tip x86/cache
fc747eebef73 ("x86/resctrl: Remove redundant variable in mbm_config_write_domain()")
Applied all Reviewed and Tested tags
Tony Luck (8):
x86/resctrl: Prepare for new domain scope
x86/resctrl: Prepare to split rdt_domain structure
x86/resctrl: Prepare for different scope for control/monitor
operations
x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
x86/resctrl: Add node-scope to the options for feature scope
x86/resctrl: Introduce snc_nodes_per_l3_cache
x86/resctrl: Sub NUMA Cluster detection and enable
x86/resctrl: Update documentation with Sub-NUMA cluster changes
Documentation/arch/x86/resctrl.rst | 25 +-
include/linux/resctrl.h | 85 +++--
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 66 ++--
arch/x86/kernel/cpu/resctrl/core.c | 433 +++++++++++++++++-----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 58 +--
arch/x86/kernel/cpu/resctrl/monitor.c | 68 ++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 26 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 149 ++++----
9 files changed, 629 insertions(+), 282 deletions(-)
base-commit: fc747eebef734563cf68a512f57937c8f231834a
--
2.43.0
I've been wondering whether the SNC patches were creating way too much infrastructure that isn't generally useful. Specifically the capability for a reactrl resource to have different scope for monitoring and control functions. That seems like a very strange thing. History here is that Intel CMT, MBM and L3 CAT features arrived in quick succession, closely followed by MBA and then L2 CAT. At the time it made sense for the "L3" resource to cover all of CMT, MBM and L3 CAT. That became slightly odd when MBA (another L3 scoped resource) was added. It was given its own "struct rdt_resource". AMD later added SMBA as yet another L3-scoped resource with its own "struct resource". I wondered how the SNC series would look if I went back to an approach from one of the earlier versions. In that one I created a new resource for SNC monitoring that was NODE scoped. And then made all the CMT and MBM code switch over to using that one when SNC was enabled. That was a bit hacky, and I moved from there to the dual domain lists per resource. I just tried out an approach the split the RDT_RESOURCE_L3 resource into separate resources, one for control, one for monitoring. It makes the code much simpler: 8 files changed, 235 insertions(+), 30 deletions(-) [1] vs. 9 files changed, 629 insertions(+), 282 deletions(-) Tradeoffs: The complex series (posted as v14) cleanly split the "rdt_domain" structure into two. So it no longer carried all the fields needed for both control and monitor, even though only one set was ever used. But the cost was a lot of code churn that may never be useful for anything other than SNC. On non-SNC systems the new series produces separate linked lists of domains for L3 control & monitor, even though the lists are the same, and the domain structures still carry all fields for both control and monitor functions. Question: Does anyone think that single domain with different scope for control and monitor is inherently more useful than two separate domains? While I get this sorted out, maybe Boris should take James next set of MPAM patches as they are, instead of on top of the complex SNC series. -Tony [1] I'll post this set just as soon as I can get time on and SNC machine to make sure they actually work.
This is the re-worked version of this series that I promised to post
yesterday. Check that e-mail for the arguments for this alternate
approach.
https://lore.kernel.org/all/ZbhLRDvZrxBZDv2j@agluck-desk3/
Apologies to Drew Fustini who I'd somehow dropped from later versions
of this series. Drew: you had made a comment at one point that having
different scopes within a single resource may be useful on RISC-V.
Version 14 included that, but it's gone here. Maybe multiple resctrl
"struct resource" for a single h/w entity like L3 as I'm doing in this
version could work for you too?
Patches 1-5 are almost completely rewritten based around the new
idea to give CMT and MBM their own "resource" instead of sharing
one with L3 CAT. This removes the need for separate domain lists,
and thus most of the churn of the previous version of this series.
Patches 6-8 are largely unchanged. But I removed all the Reviewed
and Tested tags since they are now built on a completely different
base.
Patches are against tip x86/cache:
fc747eebef73 ("x86/resctrl: Remove redundant variable in mbm_config_write_domain()")
Re-work doesn't affect the v14 cover letter, so pasting it here:
The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features. Prior to this
patch Intel has advised that SNC and RDT are incompatible.
Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.
Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tony Luck (8):
x86/resctrl: Split the RDT_RESOURCE_L3 resource
x86/resctrl: Move all monitoring functions to RDT_RESOURCE_L3_MON
x86/resctrl: Prepare for non-cache-scoped resources
x86/resctrl: Add helper function to look up domain_id from scope
x86/resctrl: Add "NODE" as an option for resource scope
x86/resctrl: Introduce snc_nodes_per_l3_cache
x86/resctrl: Sub NUMA Cluster detection and enable
x86/resctrl: Update documentation with Sub-NUMA cluster changes
Documentation/arch/x86/resctrl.rst | 25 ++-
include/linux/resctrl.h | 10 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/internal.h | 3 +
arch/x86/kernel/cpu/resctrl/core.c | 181 +++++++++++++++++++++-
arch/x86/kernel/cpu/resctrl/monitor.c | 28 ++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 6 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 12 +-
8 files changed, 236 insertions(+), 30 deletions(-)
base-commit: fc747eebef734563cf68a512f57937c8f231834a
--
2.43.0
On Tue, Jan 30, 2024 at 02:20:26PM -0800, Tony Luck wrote: > This is the re-worked version of this series that I promised to post > yesterday. Check that e-mail for the arguments for this alternate > approach. > > https://lore.kernel.org/all/ZbhLRDvZrxBZDv2j@agluck-desk3/ > > Apologies to Drew Fustini who I'd somehow dropped from later versions > of this series. Drew: you had made a comment at one point that having > different scopes within a single resource may be useful on RISC-V. > Version 14 included that, but it's gone here. Maybe multiple resctrl > "struct resource" for a single h/w entity like L3 as I'm doing in this > version could work for you too? Sorry for the latency. The RISC-V CBQRI specification [1] describes a bandwidth controller register interface [2]. It allows a controller to implement both bandwidth allocation and bandwidth usage monitoring. The proof-of-concept resctrl implementation [3] that I worked on created two domains for each memory controller in the example SoC. One domain would contain the MBA resource and the other would contain the L3 resource to represent MBM files like local_bytes: # cat /sys/fs/resctrl/schemata MB:4= 80;6= 80;8= 80 L2:0=0fff;1=0fff L3:2=ffff;3=0000;5=0000;7=0000 Where: Domain 0 is L2 cache controller 0 capacity allocation Domain 1 is L2 cache controller 1 capacity allocation Domain 2 is L3 cache controller capacity allocation Domain 4 is Memory controller 0 bandwidth allocation Domain 6 is Memory controller 1 bandwidth allocation Domain 8 is Memory controller 2 bandwidth allocation Domain 3 is Memory controller 0 bandwidth monitoring Domain 5 is Memory controller 1 bandwidth monitoring Domain 7 is Memory controller 2 bandwidth monitoring I think this scheme is confusing but I wasn't able to find a better way to do it at the time. > Patches 1-5 are almost completely rewritten based around the new > idea to give CMT and MBM their own "resource" instead of sharing > one with L3 CAT. This removes the need for separate domain lists, > and thus most of the churn of the previous version of this series. Very interesting. Do you think I would be able to create MBM files for each memory controller without creating pointless L3 domains that show up in schemata? Thanks, Drew [1] https://github.com/riscv-non-isa/riscv-cbqri/releases/tag/v1.0-rc1 [2] https://github.com/riscv-non-isa/riscv-cbqri/blob/main/qos_bandwidth.adoc [3] https://lore.kernel.org/linux-riscv/20230419111111.477118-1-dfustini@baylibre.com/
> > Patches 1-5 are almost completely rewritten based around the new > > idea to give CMT and MBM their own "resource" instead of sharing > > one with L3 CAT. This removes the need for separate domain lists, > > and thus most of the churn of the previous version of this series. > > Very interesting. Do you think I would be able to create MBM files for > each memory controller without creating pointless L3 domains that show > up in schemata? Entries only show up in the schemata file for resources that are "alloc_capable". So you should be able to make as many rdt_hw_resource structures as you need that are "mon_capable", but not "alloc_capable" ... though making more than one such resource may explore untested areas of the code since there has historically only been one mon_capable resource. It looks like the resource id from the "rid" field is passed through to the "show" functions for MBM and CQM. This patch series splits the one resource that is marked as both mon_capable and alloc_capable into two. Maybe that's a useful cleanup, but maybe not a requirement for what you need. -Tony
Hi Tony, On 1/30/2024 2:20 PM, Tony Luck wrote: > This is the re-worked version of this series that I promised to post > yesterday. Check that e-mail for the arguments for this alternate > approach. > > https://lore.kernel.org/all/ZbhLRDvZrxBZDv2j@agluck-desk3/ > > Apologies to Drew Fustini who I'd somehow dropped from later versions > of this series. Drew: you had made a comment at one point that having > different scopes within a single resource may be useful on RISC-V. > Version 14 included that, but it's gone here. Maybe multiple resctrl > "struct resource" for a single h/w entity like L3 as I'm doing in this > version could work for you too? > > Patches 1-5 are almost completely rewritten based around the new > idea to give CMT and MBM their own "resource" instead of sharing > one with L3 CAT. This removes the need for separate domain lists, > and thus most of the churn of the previous version of this series. I do not see it as removing the need for separate domain lists but instead keeping the idea of separate domain lists but in this case duplicating the resource in order to host the second domain list. This solution also keeps the same structures for control and monitoring that previous version cleaned up [1]. To me this thus seems like a similar solution as v14 but with additional duplication due to an extra resource and without the cleanup. Reinette [1] https://lore.kernel.org/lkml/20240126223837.21835-5-tony.luck@intel.com/
Hi Tony, On 1/30/24 16:20, Tony Luck wrote: > This is the re-worked version of this series that I promised to post > yesterday. Check that e-mail for the arguments for this alternate > approach. To be honest, I like this series more than the previous series. I always thought RDT_RESOURCE_L3_MON should have been a separate resource by itself. You need to separate the domain lists for RDT_RESOURCE_L3 and RDT_RESOURCE_L3_MON if you are going this route. I didn't see that in this series. Also I have few other comments as well. Thanks Babu
Hi Babu, On 2/9/2024 7:27 AM, Moger, Babu wrote: > To be honest, I like this series more than the previous series. I always > thought RDT_RESOURCE_L3_MON should have been a separate resource by itself. Would you prefer that your "Reviewed-by" tag be removed from the previous series? Reinette
>> To be honest, I like this series more than the previous series. I always >> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself. > > Would you prefer that your "Reviewed-by" tag be removed from the > previous series? I'm thinking that I could continue splitting things and break "struct rdt_resource" into separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom. Doing that would get rid of the rdt_resources_all[] array. Replacing with individual rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature. Features found on a system would be added to a list of ctrl or list of mon resources. -Tony
Hi Tony, On 2/12/2024 11:57 AM, Luck, Tony wrote: >>> To be honest, I like this series more than the previous series. I always >>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself. >> >> Would you prefer that your "Reviewed-by" tag be removed from the >> previous series? > > I'm thinking that I could continue splitting things and break "struct rdt_resource" into > separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom. It is not obvious what you mean with "continue splitting things". Are you speaking about "continue splitting from v14" or "continue splitting from v15-RFC"? I think that any solution needs to consider what makes sense for resctrl as a whole instead of how to support SNC with smallest patch possible. There should not be any changes that makes resctrl harder to understand and maintain, as exemplified by confusion introduced by a simple thing as resource name choice [1]. > > Doing that would get rid of the rdt_resources_all[] array. Replacing with individual > rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature. > > Features found on a system would be added to a list of ctrl or list of mon resources. Could you please elaborate what is architecturally wrong with v14 and how this new proposal addresses that? Reinette [1] https://lore.kernel.org/lkml/ZcZyqs5hnQqZ5ZV0@agluck-desk3/
On Mon, Feb 12, 2024 at 01:43:56PM -0800, Reinette Chatre wrote: > Hi Tony, > > On 2/12/2024 11:57 AM, Luck, Tony wrote: > >>> To be honest, I like this series more than the previous series. I always > >>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself. > >> > >> Would you prefer that your "Reviewed-by" tag be removed from the > >> previous series? > > > > I'm thinking that I could continue splitting things and break "struct rdt_resource" into > > separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom. > > It is not obvious what you mean with "continue splitting things". Are you > speaking about "continue splitting from v14" or "continue splitting from v15-RFC"? I'm speaking of some future potential changes. Not proposing to do this now. > I think that any solution needs to consider what makes sense for resctrl > as a whole instead of how to support SNC with smallest patch possible. I am officially abandoning my v15-RFC patches. I wasn't clear enough in my e-mail earlier today. https://lore.kernel.org/all/SJ1PR11MB608378D1304224D9E8A9016FFC482@SJ1PR11MB6083.namprd11.prod.outlook.com/ > > There should not be any changes that makes resctrl harder to understand > and maintain, as exemplified by confusion introduced by a simple thing as > resource name choice [1]. > > > > > Doing that would get rid of the rdt_resources_all[] array. Replacing with individual > > rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature. > > > > Features found on a system would be added to a list of ctrl or list of mon resources. > > Could you please elaborate what is architecturally wrong with v14 and how this > new proposal addresses that? There is nothing architecturally wrong with v14. I thought it was more complex than it needed to be. You have convinced me that my v15-RFC series, while simpler, is not a reasonable path for long-term resctrl maintainability. > > Reinette > > [1] https://lore.kernel.org/lkml/ZcZyqs5hnQqZ5ZV0@agluck-desk3/ -Tony
Hello, On 12/02/2024 22:05, Tony Luck wrote: > On Mon, Feb 12, 2024 at 01:43:56PM -0800, Reinette Chatre wrote: >> On 2/12/2024 11:57 AM, Luck, Tony wrote: >>>>> To be honest, I like this series more than the previous series. I always >>>>> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself. >>>> >>>> Would you prefer that your "Reviewed-by" tag be removed from the >>>> previous series? >>> >>> I'm thinking that I could continue splitting things and break "struct rdt_resource" into >>> separate "ctrl" and "mon" structures. Then we'd have a clean split from top to bottom. >> >> It is not obvious what you mean with "continue splitting things". Are you >> speaking about "continue splitting from v14" or "continue splitting from v15-RFC"? > > I'm speaking of some future potential changes. Not proposing to > do this now. > >> I think that any solution needs to consider what makes sense for resctrl >> as a whole instead of how to support SNC with smallest patch possible. >> There should not be any changes that makes resctrl harder to understand >> and maintain, as exemplified by confusion introduced by a simple thing as >> resource name choice [1]. >> >>> >>> Doing that would get rid of the rdt_resources_all[] array. Replacing with individual >>> rdt_hw_ctrl_resource and rdt_hw_mon_resource declarations for each feature. >>> >>> Features found on a system would be added to a list of ctrl or list of mon resources. >> >> Could you please elaborate what is architecturally wrong with v14 and how this >> new proposal addresses that? > > There is nothing architecturally wrong with v14. I thought it was more > complex than it needed to be. You have convinced me that my v15-RFC > series, while simpler, is not a reasonable path for long-term resctrl > maintainability. I'm not sure if its helpful to describe a third approach at this point - but on the off chance its useful: With SNC enable, the L3 monitors are unaffected, but the controls behave as if they were part of some other component in the system.. ACPI describes something called "memory side caches" [0] in the HMAT table, which are outside the CPU cache hierarchy, and are associated with a Proximity-Domain. I've heard that one of Arm's partners has built a system with MPAM controls on something like this. How would we support this - and would this be a better fit for the way SNC behaves? I think this would be a new resource and schema, 'MSC'(?) with domain-ids using the NUMA nid. As these aren't CPU caches, they wouldn't appear in the same part of the sysfs hierarchy, and wouldn't necessarily have a cache-id. For SNC systems, I think this would look like CMT on the L3, and CAT on the 'MSC'. Existing software wouldn't know to use the new schema, but equally wouldn't be surprised by the domain-ids being something other than the cache-id, and the controls and monitors not lining up. Where its not quite right for SNC is sysfs may not describe a memory side cache, but one would be present in resctrl. I don't think that's a problem - unless these systems do also have a memory-side-cache that behaves differently. (where is the controls being applied at the 'near' side of the link - I don't think the difference matters) I'm a little nervous that the SNC support looks strange if we ever add support for something like the above. Given its described in ACPI, I assume there are plenty of machines out there that look like this. (Why aren't memory-side-caches a CPU cache? They live near the memory controller and cache based on the PA, not the CPU that issued the transaction) Thanks, James [0] https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#memory-side-cache-overview
> With SNC enable, the L3 monitors are unaffected, but the controls behave as if they were > part of some other component in the system. I don't think of it like that. See attached picture of a single socket divided in two by SNC. [If the attachment is stripped off for those reading this via mailing lists, if you want the picture, just send me an e-mail.] Everything in blue is node 0. Yellow for node 1. The rectangles in the middle represent the L3 cache (12-way associative). When cores in node 0 access memory in node 0, it will be cached using the "top" half of the cache indices. Similarly for node 1 using the "bottom" half. Here’s how each of the Intel L3 resctrl functions operate with SNC enabled: CQM: Reports how much of your half of the L3 cache is occupied MBM: Reports on memory traffic from your half of the cache to your memory controllers. CAT: Still controls which ways of the cache are available for allocation (but each way has half the capacity.) MBA: The same throttling levels applied to "blue" and "yellow" traffic (because there are only socket level controls). > I'm a little nervous that the SNC support looks strange if we ever add support for > something like the above. Given its described in ACPI, I assume there are plenty of > machines out there that look like this. I'm also nervous as h/w designers find various ways to diverge from the old paradigm of socket scope == L3 cache scope == NUMA node scope -Tony
On 2/12/24 13:44, Reinette Chatre wrote: > Hi Babu, > > On 2/9/2024 7:27 AM, Moger, Babu wrote: > >> To be honest, I like this series more than the previous series. I always >> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself. > > Would you prefer that your "Reviewed-by" tag be removed from the > previous series? > Sure. I will plan to review again the new series when Tony submits v16. -- Thanks Babu Moger
On Fri, Feb 09, 2024 at 09:27:56AM -0600, Moger, Babu wrote: > Hi Tony, > > On 1/30/24 16:20, Tony Luck wrote: > > This is the re-worked version of this series that I promised to post > > yesterday. Check that e-mail for the arguments for this alternate > > approach. > > To be honest, I like this series more than the previous series. I always > thought RDT_RESOURCE_L3_MON should have been a separate resource by itself. > > You need to separate the domain lists for RDT_RESOURCE_L3 and > RDT_RESOURCE_L3_MON if you are going this route. I didn't see that in this > series. Also I have few other comments as well. They are separated. Each "struct rdt_resource" has its own domain list. Or do you mean break up the struct rdt_domain into the control and monitor versions as was done in the previous series? > > Thanks > Babu >
On 2/9/24 12:31, Tony Luck wrote: > On Fri, Feb 09, 2024 at 09:27:56AM -0600, Moger, Babu wrote: >> Hi Tony, >> >> On 1/30/24 16:20, Tony Luck wrote: >>> This is the re-worked version of this series that I promised to post >>> yesterday. Check that e-mail for the arguments for this alternate >>> approach. >> >> To be honest, I like this series more than the previous series. I always >> thought RDT_RESOURCE_L3_MON should have been a separate resource by itself. >> >> You need to separate the domain lists for RDT_RESOURCE_L3 and >> RDT_RESOURCE_L3_MON if you are going this route. I didn't see that in this >> series. Also I have few other comments as well. > > They are separated. Each "struct rdt_resource" has its own domain list. Yea. You are right. > > Or do you mean break up the struct rdt_domain into the control and > monitor versions as was done in the previous series? No. Not required. Each resource has its own domain list. So, it is separated already as far as I can see. Reinette seem to have some concerns about this series. But, I am fine with both these approaches. I feel this is more clean approach. -- Thanks Babu Moger
On 2/9/2024 11:38 AM, Moger, Babu wrote: > On 2/9/24 12:31, Tony Luck wrote: >> On Fri, Feb 09, 2024 at 09:27:56AM -0600, Moger, Babu wrote: >>> On 1/30/24 16:20, Tony Luck wrote: > > Reinette seem to have some concerns about this series. But, I am fine with > both these approaches. I feel this is more clean approach. I questioned the motivation but never received a response. Reinette
>> Reinette seem to have some concerns about this series. But, I am fine with >> both these approaches. I feel this is more clean approach. > > I questioned the motivation but never received a response. Reinette, Sorry. My motivation was to reduce the amount of code churn that was done in the by the previous incarnation. 9 files changed, 629 insertions(+), 282 deletions(-) Vast amounts of that just added "_mon" or "_ctrl" to structure or variable names. -Tony
Hi Tony, On 2/9/2024 1:36 PM, Luck, Tony wrote: >>> Reinette seem to have some concerns about this series. But, I am fine with >>> both these approaches. I feel this is more clean approach. >> >> I questioned the motivation but never received a response. > > Reinette, > > Sorry. My motivation was to reduce the amount of code churn that > was done in the by the previous incarnation. > > 9 files changed, 629 insertions(+), 282 deletions(-) > > Vast amounts of that just added "_mon" or "_ctrl" to structure > or variable names. I actually had specific points that this response also ignores. Let me repeat and highlight the same points: 1) You claim that this series "removes the need for separate domain lists" ... but then this series does just that (create a separate domain list), but in an obfuscated way (duplicate the resource to have the monitoring domain list in there). 2) You claim this series "reduces amount of code churn", but this is because this series keeps using the same original data structures for separate monitoring and control usages. The previous series made an effort to separate the structures for the different usages but this series does not. What makes it ok in this series to use the same data structures for different usages? Additionally: Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure or variable names." ... that is because the structures are actually split, no? It is not just renaming for unnecessary churn. What is the benefit of keeping the data structures to be shared between monitor and control usages? If there is a benefit to keeping these data structures, why not just address this aspect in previous solution? Reinette
> I actually had specific points that this response also ignores. > Let me repeat and highlight the same points: > > 1) You claim that this series "removes the need for separate domain > lists" ... but then this series does just that (create a separate > domain list), but in an obfuscated way (duplicate the resource to > have the monitoring domain list in there). That was poorly worded on my part. I should have said "removes the need for separate domain lists within a single rdt_resource". Adding an extra domain list to a resource may be the start of a slippery slope. What if there is some additional "L3"-like resctrl operation that acts at the socket level (Intel has made products with multiple L3 instances per socket before). Would you be OK add a third domain list to every struct rdt_resource to handle this? Or would it be simpler to just add a new rdt_resource structure with socket scoped domains? > 2) You claim this series "reduces amount of code churn", but this is > because this series keeps using the same original data structures > for separate monitoring and control usages. The previous series made > an effort to separate the structures for the different usages > but this series does not. What makes it ok in this series to > use the same data structures for different usages? Legacy resctrl has been using the same rdt_domain structure for both usages since the dawn of time. So it has been OK up until now. > Additionally: > > Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure > or variable names." ... that is because the structures are actually split, > no? It is not just renaming for unnecessary churn. Perhaps not "unnecessary" churn. But certainly a lot of code change for what I perceive as very little real gain. > What is the benefit of keeping the data structures to be shared > between monitor and control usages? Benefit is no code changes. Cost is continuing to waste memory with structures that are slightly bigger than they need to be. > If there is a benefit to keeping these data structures, why not just > address this aspect in previous solution? The previous solution evolved to splitting these structures. But this happened incrementally (remember that at an early stage the monitor structures all got the "_mon" addition to their names, but the control structures kept the original names). Only when I got to the end of this process did I look at the magnitude of the change. -Tony
Hi Tony, On 2/9/2024 3:44 PM, Luck, Tony wrote: >> I actually had specific points that this response also ignores. >> Let me repeat and highlight the same points: >> >> 1) You claim that this series "removes the need for separate domain >> lists" ... but then this series does just that (create a separate >> domain list), but in an obfuscated way (duplicate the resource to >> have the monitoring domain list in there). > > That was poorly worded on my part. I should have said "removes the > need for separate domain lists within a single rdt_resource". > > Adding an extra domain list to a resource may be the start of a slippery > slope. What if there is some additional "L3"-like resctrl operation that > acts at the socket level (Intel has made products with multiple L3 > instances per socket before). Would you be OK add a third domain > list to every struct rdt_resource to handle this? Or would it be simpler > to just add a new rdt_resource structure with socket scoped domains? This should not be about what is simplest to patch into current resctrl. There is no need to support a new domain list for a new scope. The domain lists support the functionality: control or monitoring. If control has socket scope the existing implementation supports that. If there is another operation supported by a resource apart from control or monitoring then we can consider how to support it when we know what it is. That would also be a great point to decide if the same data structure should just grow to support an operation that not all resources may support. That may depend on the amount of data needed to support this hypothetical operation. > >> 2) You claim this series "reduces amount of code churn", but this is >> because this series keeps using the same original data structures >> for separate monitoring and control usages. The previous series made >> an effort to separate the structures for the different usages >> but this series does not. What makes it ok in this series to >> use the same data structures for different usages? > > Legacy resctrl has been using the same rdt_domain structure for both > usages since the dawn of time. So it has been OK up until now. This is not the same. Legacy resctrl uses the same data structure in the same list for both control and monitoring usages so it is fine to have both monitoring and control data in the data structure. What you are doing in both solutions is to place the same data structure in separate lists for control and monitoring usages. In the one list only the control data is used, on the other only the monitoring data is used. >> Additionally: >> >> Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure >> or variable names." ... that is because the structures are actually split, >> no? It is not just renaming for unnecessary churn. > > Perhaps not "unnecessary" churn. But certainly a lot of code change for > what I perceive as very little real gain. ok. There may be little gain wrt saving space. One complication with this single data structure is that its content may only be decided based on which list it is part of. It should be obvious to developers when which members are valid. Perhaps this can be addressed with clear documentation of the data structures. > >> What is the benefit of keeping the data structures to be shared >> between monitor and control usages? > > Benefit is no code changes. Cost is continuing to waste memory with > structures that are slightly bigger than they need to be. > >> If there is a benefit to keeping these data structures, why not just >> address this aspect in previous solution? > > The previous solution evolved to splitting these structures. But this > happened incrementally (remember that at an early stage the monitor > structures all got the "_mon" addition to their names, but the control > structures kept the original names). Only when I got to the end of this > process did I look at the magnitude of the change. Not answering my question. Reinette
> >> I actually had specific points that this response also ignores. > >> Let me repeat and highlight the same points: > >> > >> 1) You claim that this series "removes the need for separate domain > >> lists" ... but then this series does just that (create a separate > >> domain list), but in an obfuscated way (duplicate the resource to > >> have the monitoring domain list in there). > > > > That was poorly worded on my part. I should have said "removes the > > need for separate domain lists within a single rdt_resource". > > > > Adding an extra domain list to a resource may be the start of a slippery > > slope. What if there is some additional "L3"-like resctrl operation that > > acts at the socket level (Intel has made products with multiple L3 > > instances per socket before). Would you be OK add a third domain > > list to every struct rdt_resource to handle this? Or would it be simpler > > to just add a new rdt_resource structure with socket scoped domains? > > This should not be about what is simplest to patch into current resctrl. I wanted to offer this in case Boris also thought that the previous version was too much churn to support an obscure Intel-only (so far) feature. But if you are going to Nack this new version on the grounds that it muddies the water about usage of the rdt_domain structure, then I will abandon it. > There is no need to support a new domain list for a new scope. The domain > lists support the functionality: control or monitoring. If control has > socket scope the existing implementation supports that. > If there is another operation supported by a resource apart from > control or monitoring then we can consider how to support it when > we know what it is. That would also be a great point to decide if > the same data structure should just grow to support an operation that > not all resources may support. That may depend on the amount of data > needed to support this hypothetical operation. > > > > >> 2) You claim this series "reduces amount of code churn", but this is > >> because this series keeps using the same original data structures > >> for separate monitoring and control usages. The previous series made > >> an effort to separate the structures for the different usages > >> but this series does not. What makes it ok in this series to > >> use the same data structures for different usages? > > > > Legacy resctrl has been using the same rdt_domain structure for both > > usages since the dawn of time. So it has been OK up until now. > > This is not the same. > > Legacy resctrl uses the same data structure in the same list for both control > and monitoring usages so it is fine to have both monitoring and control data > in the data structure. > > What you are doing in both solutions is to place the same data structure > in separate lists for control and monitoring usages. In the one list only the > control data is used, on the other only the monitoring data is used. > > >> Additionally: > >> > >> Regarding "Vast amounts of that just added "_mon" or "_ctrl" to structure > >> or variable names." ... that is because the structures are actually split, > >> no? It is not just renaming for unnecessary churn. > > > > Perhaps not "unnecessary" churn. But certainly a lot of code change for > > what I perceive as very little real gain. > > ok. There may be little gain wrt saving space. One complication with > this single data structure is that its content may only be decided based > on which list it is part of. It should be obvious to developers when > which members are valid. Perhaps this can be addressed with clear > documentation of the data structures. > > > > >> What is the benefit of keeping the data structures to be shared > >> between monitor and control usages? > > > > Benefit is no code changes. Cost is continuing to waste memory with > > structures that are slightly bigger than they need to be. > > > >> If there is a benefit to keeping these data structures, why not just > >> address this aspect in previous solution? > > > > The previous solution evolved to splitting these structures. But this > > happened incrementally (remember that at an early stage the monitor > > structures all got the "_mon" addition to their names, but the control > > structures kept the original names). Only when I got to the end of this > > process did I look at the magnitude of the change. > > Not answering my question. I'm not exactly sure what "aspect" you thought could be addressed in the previous series. But the point is moot now. This diversion from the series has come to a dead end, and I hope that Boris will look at v14 (either before the next group of ARM patches, or after). -Tony
The RDT_RESOURCE_L3 is unique in that it is used for both monitoring
an control functions. This made sense while both uses had the same
scope. But systems with Sub-NUMA clustering enabled do not follow this
pattern.
Create a new resource: RDT_RESOURCE_L3_MON ready to take over the
monitoring functions.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++++++
2 files changed, 11 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 52e7e7deee10..c6051bc70e96 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -429,6 +429,7 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
extern struct dentry *debugfs_resctrl;
enum resctrl_res_level {
+ RDT_RESOURCE_L3_MON,
RDT_RESOURCE_L3,
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index aa9810a64258..c50f55d7790e 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -60,6 +60,16 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
struct rdt_hw_resource rdt_resources_all[] = {
+ [RDT_RESOURCE_L3_MON] =
+ {
+ .r_resctrl = {
+ .rid = RDT_RESOURCE_L3_MON,
+ .name = "L3",
+ .cache_level = 3,
+ .domains = domain_init(RDT_RESOURCE_L3_MON),
+ .fflags = RFTYPE_RES_CACHE,
+ },
+ },
[RDT_RESOURCE_L3] =
{
.r_resctrl = {
--
2.43.0
Hi Tony,
On 1/30/24 16:20, Tony Luck wrote:
> The RDT_RESOURCE_L3 is unique in that it is used for both monitoring
> an control functions. This made sense while both uses had the same
an/and
> scope. But systems with Sub-NUMA clustering enabled do not follow this
> pattern.
>
> Create a new resource: RDT_RESOURCE_L3_MON ready to take over the
> monitoring functions.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 1 +
> arch/x86/kernel/cpu/resctrl/core.c | 10 ++++++++++
> 2 files changed, 11 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index 52e7e7deee10..c6051bc70e96 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -429,6 +429,7 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
> extern struct dentry *debugfs_resctrl;
>
> enum resctrl_res_level {
> + RDT_RESOURCE_L3_MON,
> RDT_RESOURCE_L3,
How about?
RDT_RESOURCE_L3,
RDT_RESOURCE_L3_MON,
> RDT_RESOURCE_L2,
> RDT_RESOURCE_MBA,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index aa9810a64258..c50f55d7790e 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -60,6 +60,16 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
> #define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
>
> struct rdt_hw_resource rdt_resources_all[] = {
> + [RDT_RESOURCE_L3_MON] =
> + {
> + .r_resctrl = {
> + .rid = RDT_RESOURCE_L3_MON,
> + .name = "L3",
L3_MON ?
> + .cache_level = 3,
> + .domains = domain_init(RDT_RESOURCE_L3_MON),
> + .fflags = RFTYPE_RES_CACHE,
> + },
> + },
> [RDT_RESOURCE_L3] =
> {
> .r_resctrl = {
--
Thanks
Babu Moger
On Fri, Feb 09, 2024 at 09:28:16AM -0600, Moger, Babu wrote:
> > enum resctrl_res_level {
> > + RDT_RESOURCE_L3_MON,
> > RDT_RESOURCE_L3,
>
> How about?
> RDT_RESOURCE_L3,
> RDT_RESOURCE_L3_MON,
Does the order matter? I put the L3_MON one first because historically
cache occupancy was the first resctrl tool. But if you have a better
argument for the order, then I can change it.
> > RDT_RESOURCE_L2,
> > RDT_RESOURCE_MBA,
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index aa9810a64258..c50f55d7790e 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -60,6 +60,16 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
> > #define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
> >
> > struct rdt_hw_resource rdt_resources_all[] = {
> > + [RDT_RESOURCE_L3_MON] =
> > + {
> > + .r_resctrl = {
> > + .rid = RDT_RESOURCE_L3_MON,
> > + .name = "L3",
>
> L3_MON ?
That was my first choice too. But I found:
$ ls /sys/fs/resctrl/info
L3 L3_MON_MON last_cmd_status MB
This would be easy to fix ... just change this code to not append
an extra "_MON" to the directory name:
for_each_mon_capable_rdt_resource(r) {
fflags = r->fflags | RFTYPE_MON_INFO;
sprintf(name, "%s_MON", r->name);
ret = rdtgroup_mkdir_info_resdir(r, name, fflags);
if (ret)
goto out_destroy;
}
But I also saw this:
$ ls /sys/fs/resctrl/mon_data/
mon_L3_MON_00 mon_L3_MON_01
which didn't seem to have an easy fix. So I took the easy route and left
the ".name" field as "L3_MON".
-Tony
On 2/9/24 12:44, Tony Luck wrote:
> On Fri, Feb 09, 2024 at 09:28:16AM -0600, Moger, Babu wrote:
>>> enum resctrl_res_level {
>>> + RDT_RESOURCE_L3_MON,
>>> RDT_RESOURCE_L3,
>>
>> How about?
>> RDT_RESOURCE_L3,
>> RDT_RESOURCE_L3_MON,
>
> Does the order matter? I put the L3_MON one first because historically
> cache occupancy was the first resctrl tool. But if you have a better
> argument for the order, then I can change it.
That is fine. No need to change.
>
>>> RDT_RESOURCE_L2,
>>> RDT_RESOURCE_MBA,
>>> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
>>> index aa9810a64258..c50f55d7790e 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/core.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/core.c
>>> @@ -60,6 +60,16 @@ mba_wrmsr_amd(struct rdt_domain *d, struct msr_param *m,
>>> #define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].r_resctrl.domains)
>>>
>>> struct rdt_hw_resource rdt_resources_all[] = {
>>> + [RDT_RESOURCE_L3_MON] =
>>> + {
>>> + .r_resctrl = {
>>> + .rid = RDT_RESOURCE_L3_MON,
>>> + .name = "L3",
>>
>> L3_MON ?
>
> That was my first choice too. But I found:
>
> $ ls /sys/fs/resctrl/info
> L3 L3_MON_MON last_cmd_status MB
>
> This would be easy to fix ... just change this code to not append
> an extra "_MON" to the directory name:
>
> for_each_mon_capable_rdt_resource(r) {
> fflags = r->fflags | RFTYPE_MON_INFO;
> sprintf(name, "%s_MON", r->name);
> ret = rdtgroup_mkdir_info_resdir(r, name, fflags);
> if (ret)
> goto out_destroy;
> }
>
> But I also saw this:
> $ ls /sys/fs/resctrl/mon_data/
> mon_L3_MON_00 mon_L3_MON_01
>
> which didn't seem to have an easy fix. So I took the easy route and left
> the ".name" field as "L3_MON".
>
Ok. Sounds good.
--
Thanks
Babu Moger
Switch over all places that setup and use monitoring funtions to
use the new resource structure.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 6 ++++--
arch/x86/kernel/cpu/resctrl/monitor.c | 12 ++++--------
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
3 files changed, 9 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index c50f55d7790e..0828575c3e13 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -591,11 +591,13 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
return;
}
- if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
+ if (r == &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
cancel_delayed_work(&d->mbm_over);
mbm_setup_overflow_handler(d, 0);
}
+ }
+ if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
has_busy_rmid(r, d)) {
cancel_delayed_work(&d->cqm_limbo);
@@ -826,7 +828,7 @@ static __init bool get_rdt_alloc_resources(void)
static __init bool get_rdt_mon_resources(void)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3a6c069614eb..080cad0d7288 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -268,7 +268,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
*/
void __check_limbo(struct rdt_domain *d, bool force_free)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
struct rmid_entry *entry;
u32 crmid = 1, nrmid;
bool rmid_dirty;
@@ -333,7 +333,7 @@ int alloc_rmid(void)
static void add_rmid_to_limbo(struct rmid_entry *entry)
{
- struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
struct rdt_domain *d;
int cpu, err;
u64 val = 0;
@@ -623,7 +623,7 @@ void cqm_handle_limbo(struct work_struct *work)
mutex_lock(&rdtgroup_mutex);
- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
d = container_of(work, struct rdt_domain, cqm_limbo.work);
__check_limbo(d, false);
@@ -659,7 +659,7 @@ void mbm_handle_overflow(struct work_struct *work)
if (!static_branch_likely(&rdt_mon_enable_key))
goto out_unlock;
- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
d = container_of(work, struct rdt_domain, mbm_over.work);
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
@@ -736,10 +736,6 @@ static struct mon_evt mbm_local_event = {
/*
* Initialize the event list for the resource.
- *
- * Note that MBM events are also part of RDT_RESOURCE_L3 resource
- * because as per the SDM the total and local memory bandwidth
- * are enumerated as part of L3 monitoring.
*/
static void l3_mon_evt_init(struct rdt_resource *r)
{
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index aa24343f1d23..9ee3a9906781 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2644,7 +2644,7 @@ static int rdt_get_tree(struct fs_context *fc)
static_branch_enable_cpuslocked(&rdt_enable_key);
if (is_mbm_enabled()) {
- r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
list_for_each_entry(dom, &r->domains, list)
mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
}
--
2.43.0
Tony,
On 1/30/24 16:20, Tony Luck wrote:
> Switch over all places that setup and use monitoring funtions to
functions?
> use the new resource structure.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 6 ++++--
> arch/x86/kernel/cpu/resctrl/monitor.c | 12 ++++--------
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
> 3 files changed, 9 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index c50f55d7790e..0828575c3e13 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -591,11 +591,13 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> return;
> }
>
> - if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> + if (r == &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl) {
> if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> cancel_delayed_work(&d->mbm_over);
> mbm_setup_overflow_handler(d, 0);
> }
> + }
> + if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
RDT_RESOURCE_L3_MON?
> if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
> has_busy_rmid(r, d)) {
> cancel_delayed_work(&d->cqm_limbo);
> @@ -826,7 +828,7 @@ static __init bool get_rdt_alloc_resources(void)
>
> static __init bool get_rdt_mon_resources(void)
> {
> - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
>
> if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
> rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 3a6c069614eb..080cad0d7288 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -268,7 +268,7 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> */
> void __check_limbo(struct rdt_domain *d, bool force_free)
> {
> - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
> struct rmid_entry *entry;
> u32 crmid = 1, nrmid;
> bool rmid_dirty;
> @@ -333,7 +333,7 @@ int alloc_rmid(void)
>
> static void add_rmid_to_limbo(struct rmid_entry *entry)
> {
> - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
> struct rdt_domain *d;
> int cpu, err;
> u64 val = 0;
> @@ -623,7 +623,7 @@ void cqm_handle_limbo(struct work_struct *work)
>
> mutex_lock(&rdtgroup_mutex);
>
> - r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
> d = container_of(work, struct rdt_domain, cqm_limbo.work);
>
> __check_limbo(d, false);
> @@ -659,7 +659,7 @@ void mbm_handle_overflow(struct work_struct *work)
> if (!static_branch_likely(&rdt_mon_enable_key))
> goto out_unlock;
>
> - r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
> d = container_of(work, struct rdt_domain, mbm_over.work);
>
> list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
> @@ -736,10 +736,6 @@ static struct mon_evt mbm_local_event = {
>
> /*
> * Initialize the event list for the resource.
> - *
> - * Note that MBM events are also part of RDT_RESOURCE_L3 resource
> - * because as per the SDM the total and local memory bandwidth
> - * are enumerated as part of L3 monitoring.
> */
> static void l3_mon_evt_init(struct rdt_resource *r)
> {
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index aa24343f1d23..9ee3a9906781 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2644,7 +2644,7 @@ static int rdt_get_tree(struct fs_context *fc)
> static_branch_enable_cpuslocked(&rdt_enable_key);
>
> if (is_mbm_enabled()) {
> - r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> + r = &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl;
> list_for_each_entry(dom, &r->domains, list)
> mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
> }
--
Thanks
Babu Moger
On Fri, Feb 09, 2024 at 09:28:25AM -0600, Moger, Babu wrote:
> Tony,
>
> On 1/30/24 16:20, Tony Luck wrote:
> > Switch over all places that setup and use monitoring funtions to
>
> functions?
Yes. Will fix.
> > use the new resource structure.
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > ---
> > arch/x86/kernel/cpu/resctrl/core.c | 6 ++++--
> > arch/x86/kernel/cpu/resctrl/monitor.c | 12 ++++--------
> > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
> > 3 files changed, 9 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index c50f55d7790e..0828575c3e13 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -591,11 +591,13 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> > return;
> > }
> >
> > - if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
> > + if (r == &rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl) {
> > if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
> > cancel_delayed_work(&d->mbm_over);
> > mbm_setup_overflow_handler(d, 0);
> > }
> > + }
> > + if (r == &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl) {
>
> RDT_RESOURCE_L3_MON?
Good catch.
-Tony
Not all resources are scoped in line with some level of hardware cache.
Prepare by renaming the "cache_level" field to "scope" and change
the type to an enum to ease adding new scopes.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 9 +++++++--
arch/x86/kernel/cpu/resctrl/core.c | 14 +++++++-------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
4 files changed, 16 insertions(+), 11 deletions(-)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..2155dc15e636 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -144,13 +144,18 @@ struct resctrl_membw {
struct rdt_parse_data;
struct resctrl_schema;
+enum resctrl_scope {
+ RESCTRL_L2_CACHE = 2,
+ RESCTRL_L3_CACHE = 3,
+};
+
/**
* struct rdt_resource - attributes of a resctrl resource
* @rid: The index of the resource
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @num_rmid: Number of RMIDs available
- * @cache_level: Which cache level defines scope of this resource
+ * @scope: Hardware scope for this resource
* @cache: Cache allocation related data
* @membw: If the component has bandwidth controls, their properties.
* @domains: All domains for this resource
@@ -168,7 +173,7 @@ struct rdt_resource {
bool alloc_capable;
bool mon_capable;
int num_rmid;
- int cache_level;
+ enum resctrl_scope scope;
struct resctrl_cache cache;
struct resctrl_membw membw;
struct list_head domains;
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 0828575c3e13..d89dce63397b 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3_MON,
.name = "L3",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_L3_MON),
.fflags = RFTYPE_RES_CACHE,
},
@@ -75,7 +75,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L3,
.name = "L3",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_L3),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -89,7 +89,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_L2,
.name = "L2",
- .cache_level = 2,
+ .scope = RESCTRL_L2_CACHE,
.domains = domain_init(RDT_RESOURCE_L2),
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
@@ -103,7 +103,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_MBA,
.name = "MB",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_MBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -115,7 +115,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
.r_resctrl = {
.rid = RDT_RESOURCE_SMBA,
.name = "SMBA",
- .cache_level = 3,
+ .scope = RESCTRL_L3_CACHE,
.domains = domain_init(RDT_RESOURCE_SMBA),
.parse_ctrlval = parse_bw,
.format_str = "%d=%*u",
@@ -514,7 +514,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
*/
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_cpu_cacheinfo_id(cpu, r->scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
@@ -564,7 +564,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
static void domain_remove_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
+ int id = get_cpu_cacheinfo_id(cpu, r->scope);
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 8f559eeae08e..6a72fb627aa5 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -311,7 +311,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == plr->s->res->cache_level) {
+ if (ci->info_list[i].level == plr->s->res->scope) {
plr->line_size = ci->info_list[i].coherency_line_size;
return 0;
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 9ee3a9906781..eff9d87547c9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1416,7 +1416,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
- if (ci->info_list[i].level == r->cache_level) {
+ if (ci->info_list[i].level == r->scope) {
size = ci->info_list[i].size / r->cache.cbm_len * num_b;
break;
}
--
2.43.0
Hi Tony,
On 1/30/24 16:20, Tony Luck wrote:
> Not all resources are scoped in line with some level of hardware cache.
same level?
Thanks
Babu
>
> Prepare by renaming the "cache_level" field to "scope" and change
> the type to an enum to ease adding new scopes.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 9 +++++++--
> arch/x86/kernel/cpu/resctrl/core.c | 14 +++++++-------
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
> 4 files changed, 16 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 66942d7fba7f..2155dc15e636 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -144,13 +144,18 @@ struct resctrl_membw {
> struct rdt_parse_data;
> struct resctrl_schema;
>
> +enum resctrl_scope {
> + RESCTRL_L2_CACHE = 2,
> + RESCTRL_L3_CACHE = 3,
> +};
> +
> /**
> * struct rdt_resource - attributes of a resctrl resource
> * @rid: The index of the resource
> * @alloc_capable: Is allocation available on this machine
> * @mon_capable: Is monitor feature available on this machine
> * @num_rmid: Number of RMIDs available
> - * @cache_level: Which cache level defines scope of this resource
> + * @scope: Hardware scope for this resource
> * @cache: Cache allocation related data
> * @membw: If the component has bandwidth controls, their properties.
> * @domains: All domains for this resource
> @@ -168,7 +173,7 @@ struct rdt_resource {
> bool alloc_capable;
> bool mon_capable;
> int num_rmid;
> - int cache_level;
> + enum resctrl_scope scope;
> struct resctrl_cache cache;
> struct resctrl_membw membw;
> struct list_head domains;
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 0828575c3e13..d89dce63397b 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -65,7 +65,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_L3_MON,
> .name = "L3",
> - .cache_level = 3,
> + .scope = RESCTRL_L3_CACHE,
> .domains = domain_init(RDT_RESOURCE_L3_MON),
> .fflags = RFTYPE_RES_CACHE,
> },
> @@ -75,7 +75,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_L3,
> .name = "L3",
> - .cache_level = 3,
> + .scope = RESCTRL_L3_CACHE,
> .domains = domain_init(RDT_RESOURCE_L3),
> .parse_ctrlval = parse_cbm,
> .format_str = "%d=%0*x",
> @@ -89,7 +89,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_L2,
> .name = "L2",
> - .cache_level = 2,
> + .scope = RESCTRL_L2_CACHE,
> .domains = domain_init(RDT_RESOURCE_L2),
> .parse_ctrlval = parse_cbm,
> .format_str = "%d=%0*x",
> @@ -103,7 +103,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_MBA,
> .name = "MB",
> - .cache_level = 3,
> + .scope = RESCTRL_L3_CACHE,
> .domains = domain_init(RDT_RESOURCE_MBA),
> .parse_ctrlval = parse_bw,
> .format_str = "%d=%*u",
> @@ -115,7 +115,7 @@ struct rdt_hw_resource rdt_resources_all[] = {
> .r_resctrl = {
> .rid = RDT_RESOURCE_SMBA,
> .name = "SMBA",
> - .cache_level = 3,
> + .scope = RESCTRL_L3_CACHE,
> .domains = domain_init(RDT_RESOURCE_SMBA),
> .parse_ctrlval = parse_bw,
> .format_str = "%d=%*u",
> @@ -514,7 +514,7 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> */
> static void domain_add_cpu(int cpu, struct rdt_resource *r)
> {
> - int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> + int id = get_cpu_cacheinfo_id(cpu, r->scope);
> struct list_head *add_pos = NULL;
> struct rdt_hw_domain *hw_dom;
> struct rdt_domain *d;
> @@ -564,7 +564,7 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>
> static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> {
> - int id = get_cpu_cacheinfo_id(cpu, r->cache_level);
> + int id = get_cpu_cacheinfo_id(cpu, r->scope);
> struct rdt_hw_domain *hw_dom;
> struct rdt_domain *d;
>
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 8f559eeae08e..6a72fb627aa5 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -311,7 +311,7 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
> plr->size = rdtgroup_cbm_to_size(plr->s->res, plr->d, plr->cbm);
>
> for (i = 0; i < ci->num_leaves; i++) {
> - if (ci->info_list[i].level == plr->s->res->cache_level) {
> + if (ci->info_list[i].level == plr->s->res->scope) {
> plr->line_size = ci->info_list[i].coherency_line_size;
> return 0;
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 9ee3a9906781..eff9d87547c9 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1416,7 +1416,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> num_b = bitmap_weight(&cbm, r->cache.cbm_len);
> ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
> for (i = 0; i < ci->num_leaves; i++) {
> - if (ci->info_list[i].level == r->cache_level) {
> + if (ci->info_list[i].level == r->scope) {
> size = ci->info_list[i].size / r->cache.cbm_len * num_b;
> break;
> }
--
Thanks
Babu Moger
On Fri, Feb 09, 2024 at 09:28:35AM -0600, Moger, Babu wrote: > Hi Tony, > > On 1/30/24 16:20, Tony Luck wrote: > > Not all resources are scoped in line with some level of hardware cache. > > same level? No. "same" isn't what I meant here. If I shuffle this around: Not all resources are scoped to match the scope of a hardware cache level. Is that more clear? -Tony
On 2/9/24 12:57, Tony Luck wrote: > On Fri, Feb 09, 2024 at 09:28:35AM -0600, Moger, Babu wrote: >> Hi Tony, >> >> On 1/30/24 16:20, Tony Luck wrote: >>> Not all resources are scoped in line with some level of hardware cache. >> >> same level? > > No. "same" isn't what I meant here. If I shuffle this around: > > Not all resources are scoped to match the scope of a hardware > cache level. > > Is that more clear? Sure. Looks good. -- Thanks Babu Moger
Prepare for more options for scope of resources. Add some diagnostic
messages if lookup fails.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/core.c | 29 +++++++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index d89dce63397b..59e6aa7abef5 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -499,6 +499,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
return 0;
}
+static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
+{
+ switch (scope) {
+ case RESCTRL_L2_CACHE:
+ case RESCTRL_L3_CACHE:
+ return get_cpu_cacheinfo_id(cpu, scope);
+ default:
+ break;
+ }
+
+ return -EINVAL;
+}
+
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@@ -514,12 +527,18 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
*/
static void domain_add_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->scope);
+ int id = get_domain_id_from_scope(cpu, r->scope);
struct list_head *add_pos = NULL;
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
int err;
+ if (id < 0) {
+ pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->scope, r->name);
+ return;
+ }
+
d = rdt_find_domain(r, id, &add_pos);
if (IS_ERR(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
@@ -564,10 +583,16 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
static void domain_remove_cpu(int cpu, struct rdt_resource *r)
{
- int id = get_cpu_cacheinfo_id(cpu, r->scope);
+ int id = get_domain_id_from_scope(cpu, r->scope);
struct rdt_hw_domain *hw_dom;
struct rdt_domain *d;
+ if (id < 0) {
+ pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
+ cpu, r->scope, r->name);
+ return;
+ }
+
d = rdt_find_domain(r, id, NULL);
if (IS_ERR_OR_NULL(d)) {
pr_warn("Couldn't find cache id for CPU %d\n", cpu);
--
2.43.0
Hi Tony,
On 1/30/24 16:20, Tony Luck wrote:
> Prepare for more options for scope of resources. Add some diagnostic
> messages if lookup fails.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/core.c | 29 +++++++++++++++++++++++++++--
> 1 file changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index d89dce63397b..59e6aa7abef5 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -499,6 +499,19 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> return 0;
> }
>
> +static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> +{
> + switch (scope) {
> + case RESCTRL_L2_CACHE:
> + case RESCTRL_L3_CACHE:
> + return get_cpu_cacheinfo_id(cpu, scope);
> + default:
> + break;
> + }
> +
> + return -EINVAL;
> +}
> +
> /*
> * domain_add_cpu - Add a cpu to a resource's domain list.
> *
> @@ -514,12 +527,18 @@ static int arch_domain_mbm_alloc(u32 num_rmid, struct rdt_hw_domain *hw_dom)
> */
> static void domain_add_cpu(int cpu, struct rdt_resource *r)
> {
> - int id = get_cpu_cacheinfo_id(cpu, r->scope);
> + int id = get_domain_id_from_scope(cpu, r->scope);
> struct list_head *add_pos = NULL;
> struct rdt_hw_domain *hw_dom;
> struct rdt_domain *d;
> int err;
>
> + if (id < 0) {
> + pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> + cpu, r->scope, r->name);
Will it be good to move pr_warn_once inside get_domain_id_from_scope
instead of repeating during every call?
> + return;
> + }
> +
> d = rdt_find_domain(r, id, &add_pos);
> if (IS_ERR(d)) {
> pr_warn("Couldn't find cache id for CPU %d\n", cpu);
> @@ -564,10 +583,16 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
>
> static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> {
> - int id = get_cpu_cacheinfo_id(cpu, r->scope);
> + int id = get_domain_id_from_scope(cpu, r->scope);
> struct rdt_hw_domain *hw_dom;
> struct rdt_domain *d;
>
> + if (id < 0) {
> + pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> + cpu, r->scope, r->name);
Same comment as above. Will it be good to move pr_warn_once inside
get_domain_id_from_scope ?
> + return;
> + }
> +
> d = rdt_find_domain(r, id, NULL);
> if (IS_ERR_OR_NULL(d)) {
> pr_warn("Couldn't find cache id for CPU %d\n", cpu);
--
Thanks
Babu Moger
On Fri, Feb 09, 2024 at 09:28:52AM -0600, Moger, Babu wrote:
> > + if (id < 0) {
> > + pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> > + cpu, r->scope, r->name);
>
> Will it be good to move pr_warn_once inside get_domain_id_from_scope
> instead of repeating during every call?
Yes. Will move from here to get_domain_id_from_scope().
> > + if (id < 0) {
> > + pr_warn_once("Can't find domain id for CPU:%d scope:%d for resource %s\n",
> > + cpu, r->scope, r->name);
>
> Same comment as above. Will it be good to move pr_warn_once inside
> get_domain_id_from_scope ?
Moved this one too.
Thanks
-Tony
Add RESCTRL_NODE to the enum, and to the helper function that looks
up a domain id from a scope.
There are a couple of places where the scope must be a cache scope.
Add some defensive WARN_ON checks to those.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
include/linux/resctrl.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 3 +++
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 ++++
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 3 +++
4 files changed, 11 insertions(+)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 2155dc15e636..e3cddf3f07f8 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -147,6 +147,7 @@ struct resctrl_schema;
enum resctrl_scope {
RESCTRL_L2_CACHE = 2,
RESCTRL_L3_CACHE = 3,
+ RESCTRL_NODE,
};
/**
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index 59e6aa7abef5..b741cbf61843 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -505,6 +505,9 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
case RESCTRL_L2_CACHE:
case RESCTRL_L3_CACHE:
return get_cpu_cacheinfo_id(cpu, scope);
+ case RESCTRL_NODE:
+ return cpu_to_node(cpu);
+
default:
break;
}
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index 6a72fb627aa5..2bafc73b51e2 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
*/
static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
{
+ enum resctrl_scope scope = plr->s->res->scope;
struct cpu_cacheinfo *ci;
int ret;
int i;
+ if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
+ return -ENODEV;
+
/* Pick the first cpu we find that is associated with the cache. */
plr->cpu = cpumask_first(&plr->d->cpu_mask);
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index eff9d87547c9..770f2bf98462 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1413,6 +1413,9 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
unsigned int size = 0;
int num_b, i;
+ if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
+ return size;
+
num_b = bitmap_weight(&cbm, r->cache.cbm_len);
ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
for (i = 0; i < ci->num_leaves; i++) {
--
2.43.0
Hi Tony,
This patch probably needs to be merged with with patch 7.
Thanks
On 1/30/24 16:20, Tony Luck wrote:
> Add RESCTRL_NODE to the enum, and to the helper function that looks
> up a domain id from a scope.
>
> There are a couple of places where the scope must be a cache scope.
> Add some defensive WARN_ON checks to those.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> include/linux/resctrl.h | 1 +
> arch/x86/kernel/cpu/resctrl/core.c | 3 +++
> arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 ++++
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 3 +++
> 4 files changed, 11 insertions(+)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 2155dc15e636..e3cddf3f07f8 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -147,6 +147,7 @@ struct resctrl_schema;
> enum resctrl_scope {
> RESCTRL_L2_CACHE = 2,
> RESCTRL_L3_CACHE = 3,
> + RESCTRL_NODE,
> };
>
> /**
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index 59e6aa7abef5..b741cbf61843 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -505,6 +505,9 @@ static int get_domain_id_from_scope(int cpu, enum resctrl_scope scope)
> case RESCTRL_L2_CACHE:
> case RESCTRL_L3_CACHE:
> return get_cpu_cacheinfo_id(cpu, scope);
> + case RESCTRL_NODE:
> + return cpu_to_node(cpu);
> +
> default:
> break;
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 6a72fb627aa5..2bafc73b51e2 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -292,10 +292,14 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
> */
> static int pseudo_lock_region_init(struct pseudo_lock_region *plr)
> {
> + enum resctrl_scope scope = plr->s->res->scope;
> struct cpu_cacheinfo *ci;
> int ret;
> int i;
>
> + if (WARN_ON_ONCE(scope != RESCTRL_L2_CACHE && scope != RESCTRL_L3_CACHE))
> + return -ENODEV;
> +
> /* Pick the first cpu we find that is associated with the cache. */
> plr->cpu = cpumask_first(&plr->d->cpu_mask);
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index eff9d87547c9..770f2bf98462 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1413,6 +1413,9 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> unsigned int size = 0;
> int num_b, i;
>
> + if (WARN_ON_ONCE(r->scope != RESCTRL_L2_CACHE && r->scope != RESCTRL_L3_CACHE))
> + return size;
> +
> num_b = bitmap_weight(&cbm, r->cache.cbm_len);
> ci = get_cpu_cacheinfo(cpumask_any(&d->cpu_mask));
> for (i = 0; i < ci->num_leaves; i++) {
--
Thanks
Babu Moger
On Fri, Feb 09, 2024 at 09:29:03AM -0600, Moger, Babu wrote: > Hi Tony, > > This patch probably needs to be merged with with patch 7. If it just added RESCTRL_NODE to the enum and the switch() I'd agree (as patch 7 is where RESCTRL_NODE first used). But this patch also adds the sanity checks on scope where it has to be a cache level. Patch 7 is already on the big side (119 lines added to core.c). If you really think this series needs to cut back the number of patches, I could move the sanity check pieces from here to patch 3 (where the enum is introduced) and just the RESCTRL_NODE bits to patch 7. -Tony
Hi Tony, On 2/9/24 13:23, Tony Luck wrote: > On Fri, Feb 09, 2024 at 09:29:03AM -0600, Moger, Babu wrote: >> Hi Tony, >> >> This patch probably needs to be merged with with patch 7. > > If it just added RESCTRL_NODE to the enum and the switch() I'd > agree (as patch 7 is where RESCTRL_NODE first used). But this > patch also adds the sanity checks on scope where it has to be > a cache level. > > Patch 7 is already on the big side (119 lines added to core.c). > > If you really think this series needs to cut back the > number of patches, I could move the sanity check pieces > from here to patch 3 (where the enum is introduced) and > just the RESCTRL_NODE bits to patch 7. No. You dont have to cut back on number of patches. I think it is easy to review if these changes are merged with patch 7. -- Thanks Babu Moger
Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.
This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.
Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.
The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.
The other mode divides the RMIDs and renumbers the ones on the second
SNC node to start from zero.
Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.
Add a global integer "snc_nodes_per_l3_cache" that shows how many
SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
SNC mode is either not implemented, or not enabled.
Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
nodes.
3) The RMID renumbering operates when using the value from the
IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
counter, adjust from the logical RMID to the physical
RMID value for the SNC node that it wishes to read and load the
adjusted value into the IA32_QM_EVTSEL MSR.
4) Divide the L3 cache between the SNC nodes. Divide the value
reported in the resctrl "size" file by the number of SNC
nodes because the effective amount of cache that can be allocated
is reduced by that factor.
5) Disable the "-o mba_MBps" mount option in SNC mode
because the monitoring is being done per SNC node, while the
bandwidth allocation is still done at the L3 cache scope.
Trying to use this feedback loop might result in contradictory
changes to the throttling level coming from each of the SNC
node bandwidth measurements.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
arch/x86/kernel/cpu/resctrl/monitor.c | 16 +++++++++++++---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++--
4 files changed, 24 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index c6051bc70e96..d9c6dcf30922 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -428,6 +428,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
extern struct dentry *debugfs_resctrl;
+extern unsigned int snc_nodes_per_l3_cache;
+
enum resctrl_res_level {
RDT_RESOURCE_L3_MON,
RDT_RESOURCE_L3,
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index b741cbf61843..dc886d2c9a33 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -48,6 +48,12 @@ int max_name_width, max_data_width;
*/
bool rdt_alloc_capable;
+/*
+ * Number of SNC nodes that share each L3 cache. Default is 1 for
+ * systems that do not support SNC, or have SNC disabled.
+ */
+unsigned int snc_nodes_per_l3_cache = 1;
+
static void
mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 080cad0d7288..357919bbadbe 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
{
+ struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
+ int cpu = smp_processor_id();
+ int rmid_offset = 0;
u64 msr_val;
+ /*
+ * When SNC mode is on, need to compute the offset to read the
+ * physical RMID counter for the node to which this CPU belongs.
+ */
+ if (snc_nodes_per_l3_cache > 1)
+ rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
+
/*
* As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
* with a valid event code for supported resource type and the bits
@@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
*/
- wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
+ wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
rdmsrl(MSR_IA32_QM_CTR, msr_val);
if (msr_val & RMID_VAL_ERROR)
@@ -757,8 +767,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
int ret;
resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
- hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
- r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
+ hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
+ r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 770f2bf98462..e639069f871a 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
}
}
- return size;
+ return size / snc_nodes_per_l3_cache;
}
/*
@@ -2293,7 +2293,8 @@ static bool supports_mba_mbps(void)
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
return (is_mbm_local_enabled() &&
- r->alloc_capable && is_mba_linear());
+ r->alloc_capable && is_mba_linear() &&
+ snc_nodes_per_l3_cache == 1);
}
/*
--
2.43.0
Hi Tony,
On 1/30/24 16:20, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
>
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
>
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
>
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
>
> The other mode divides the RMIDs and renumbers the ones on the second
> SNC node to start from zero.
>
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
>
> Add a global integer "snc_nodes_per_l3_cache" that shows how many
> SNC nodes share each L3 cache. When "snc_nodes_per_l3_cache" is "1",
> SNC mode is either not implemented, or not enabled.
>
> Update all places to take appropriate action when SNC mode is enabled:
> 1) The number of logical RMIDs per L3 cache available for use is the
> number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
> nodes.
> 3) The RMID renumbering operates when using the value from the
> IA32_PQR_ASSOC MSR to count accesses by a task. When reading an RMID
> counter, adjust from the logical RMID to the physical
> RMID value for the SNC node that it wishes to read and load the
> adjusted value into the IA32_QM_EVTSEL MSR.
> 4) Divide the L3 cache between the SNC nodes. Divide the value
> reported in the resctrl "size" file by the number of SNC
> nodes because the effective amount of cache that can be allocated
> is reduced by that factor.
> 5) Disable the "-o mba_MBps" mount option in SNC mode
> because the monitoring is being done per SNC node, while the
> bandwidth allocation is still done at the L3 cache scope.
> Trying to use this feedback loop might result in contradictory
> changes to the throttling level coming from each of the SNC
> node bandwidth measurements.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/internal.h | 2 ++
> arch/x86/kernel/cpu/resctrl/core.c | 6 ++++++
> arch/x86/kernel/cpu/resctrl/monitor.c | 16 +++++++++++++---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 5 +++--
> 4 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index c6051bc70e96..d9c6dcf30922 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -428,6 +428,8 @@ DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>
> extern struct dentry *debugfs_resctrl;
>
> +extern unsigned int snc_nodes_per_l3_cache;
I feel this can be part of rdt_resource instead of global.
> +
> enum resctrl_res_level {
> RDT_RESOURCE_L3_MON,
> RDT_RESOURCE_L3,
> diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> index b741cbf61843..dc886d2c9a33 100644
> --- a/arch/x86/kernel/cpu/resctrl/core.c
> +++ b/arch/x86/kernel/cpu/resctrl/core.c
> @@ -48,6 +48,12 @@ int max_name_width, max_data_width;
> */
> bool rdt_alloc_capable;
>
> +/*
> + * Number of SNC nodes that share each L3 cache. Default is 1 for
> + * systems that do not support SNC, or have SNC disabled.
> + */
> +unsigned int snc_nodes_per_l3_cache = 1;
> +
> static void
> mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> struct rdt_resource *r);
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 080cad0d7288..357919bbadbe 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
>
> static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> {
> + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
RDT_RESOURCE_L3_MON?
> + int cpu = smp_processor_id();
> + int rmid_offset = 0;
> u64 msr_val;
>
> + /*
> + * When SNC mode is on, need to compute the offset to read the
> + * physical RMID counter for the node to which this CPU belongs.
> + */
> + if (snc_nodes_per_l3_cache > 1)
> + rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
Not sure if you have tested or not. r->num_rmid is initialized for the
resource RDT_RESOURCE_L3_MON. For other resource it is always 0.
> +
> /*
> * As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
> * with a valid event code for supported resource type and the bits
> @@ -158,7 +168,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
> * are error bits.
> */
> - wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> + wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid + rmid_offset);
> rdmsrl(MSR_IA32_QM_CTR, msr_val);
>
> if (msr_val & RMID_VAL_ERROR)
> @@ -757,8 +767,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
> int ret;
>
> resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> - hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> - r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> + hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> + r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
> hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>
> if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 770f2bf98462..e639069f871a 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1425,7 +1425,7 @@ unsigned int rdtgroup_cbm_to_size(struct rdt_resource *r,
> }
> }
>
> - return size;
> + return size / snc_nodes_per_l3_cache;
> }
>
> /*
> @@ -2293,7 +2293,8 @@ static bool supports_mba_mbps(void)
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>
> return (is_mbm_local_enabled() &&
> - r->alloc_capable && is_mba_linear());
> + r->alloc_capable && is_mba_linear() &&
> + snc_nodes_per_l3_cache == 1);
> }
>
> /*
--
Thanks
Babu Moger
On Fri, Feb 09, 2024 at 09:29:16AM -0600, Moger, Babu wrote:
> >
> > +extern unsigned int snc_nodes_per_l3_cache;
>
> I feel this can be part of rdt_resource instead of global.
Mixed emotions about that. It would be another field that appears
in every instance of rdt_resource, but only used by the RDT_RESOURCE_L3_MON
copy.
>
> > +
> > enum resctrl_res_level {
> > RDT_RESOURCE_L3_MON,
> > RDT_RESOURCE_L3,
> > diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
> > index b741cbf61843..dc886d2c9a33 100644
> > --- a/arch/x86/kernel/cpu/resctrl/core.c
> > +++ b/arch/x86/kernel/cpu/resctrl/core.c
> > @@ -48,6 +48,12 @@ int max_name_width, max_data_width;
> > */
> > bool rdt_alloc_capable;
> >
> > +/*
> > + * Number of SNC nodes that share each L3 cache. Default is 1 for
> > + * systems that do not support SNC, or have SNC disabled.
> > + */
> > +unsigned int snc_nodes_per_l3_cache = 1;
> > +
> > static void
> > mba_wrmsr_intel(struct rdt_domain *d, struct msr_param *m,
> > struct rdt_resource *r);
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index 080cad0d7288..357919bbadbe 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -148,8 +148,18 @@ static inline struct rmid_entry *__rmid_entry(u32 rmid)
> >
> > static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> > {
> > + struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>
> RDT_RESOURCE_L3_MON?
Second good catch.
>
> > + int cpu = smp_processor_id();
> > + int rmid_offset = 0;
> > u64 msr_val;
> >
> > + /*
> > + * When SNC mode is on, need to compute the offset to read the
> > + * physical RMID counter for the node to which this CPU belongs.
> > + */
> > + if (snc_nodes_per_l3_cache > 1)
> > + rmid_offset = (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
>
> Not sure if you have tested or not. r->num_rmid is initialized for the
> resource RDT_RESOURCE_L3_MON. For other resource it is always 0.
I hadn't got time on the SNC machine to try this out. Thanks
for catching this one, I'd have been scratching my head for a
while to track the symptoms of this problem back to this mistake.
Thanks
-Tony
Hi Tony, On 2/9/24 13:35, Tony Luck wrote: > On Fri, Feb 09, 2024 at 09:29:16AM -0600, Moger, Babu wrote: >>> >>> +extern unsigned int snc_nodes_per_l3_cache; >> >> I feel this can be part of rdt_resource instead of global. > > Mixed emotions about that. It would be another field that appears > in every instance of rdt_resource, but only used by the RDT_RESOURCE_L3_MON > copy. > That was the comment I got earlier for my patches(search for global). https://lore.kernel.org/lkml/47f870e0-ad1a-4a4d-9d9e-4c05bf431858@intel.com/ -- Thanks Babu Moger
There isn't a simple hardware bit that indicates whether a CPU is
running in Sub NUMA Cluster (SNC) mode. Infer the state by comparing
the ratio of NUMA nodes to L3 cache instances.
When SNC mode is detected, reconfigure the RMID counters by updating
the MSR_RMID_SNC_CONFIG MSR on each socket as CPUs are seen. Update
the scope of the RDT_RESOURCE_L3_MON resource to NODE.
Clearing bit zero of the MSR divides the RMIDs and renumbers the ones
on the second SNC node to start from zero.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/resctrl/core.c | 119 +++++++++++++++++++++++++++++
2 files changed, 120 insertions(+)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index f1bd7b91b3c6..f6ba7d0397b8 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1119,6 +1119,7 @@
#define MSR_IA32_QM_CTR 0xc8e
#define MSR_IA32_PQR_ASSOC 0xc8f
#define MSR_IA32_L3_CBM_BASE 0xc90
+#define MSR_RMID_SNC_CONFIG 0xca0
#define MSR_IA32_L2_CBM_BASE 0xd10
#define MSR_IA32_MBA_THRTL_BASE 0xd50
diff --git a/arch/x86/kernel/cpu/resctrl/core.c b/arch/x86/kernel/cpu/resctrl/core.c
index dc886d2c9a33..84c36e10241f 100644
--- a/arch/x86/kernel/cpu/resctrl/core.c
+++ b/arch/x86/kernel/cpu/resctrl/core.c
@@ -16,11 +16,14 @@
#define pr_fmt(fmt) "resctrl: " fmt
+#include <linux/cpu.h>
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/cacheinfo.h>
#include <linux/cpuhotplug.h>
+#include <linux/mod_devicetable.h>
+#include <asm/cpu_device_id.h>
#include <asm/intel-family.h>
#include <asm/resctrl.h>
#include "internal.h"
@@ -651,11 +654,42 @@ static void clear_closid_rmid(int cpu)
wrmsr(MSR_IA32_PQR_ASSOC, 0, 0);
}
+/*
+ * The power-on reset value of MSR_RMID_SNC_CONFIG is 0x1
+ * which indicates that RMIDs are configured in legacy mode.
+ * This mode is incompatible with Linux resctrl semantics
+ * as RMIDs are partitioned between SNC nodes, which requires
+ * a user to know which RMID is allocated to a task.
+ * Clearing bit 0 reconfigures the RMID counters for use
+ * in Sub NUMA Cluster mode. This mode is better for Linux.
+ * The RMID space is divided between all SNC nodes with the
+ * RMIDs renumbered to start from zero in each node when
+ * couning operations from tasks. Code to read the counters
+ * must adjust RMID counter numbers based on SNC node. See
+ * __rmid_read() for code that does this.
+ */
+static void snc_remap_rmids(int cpu)
+{
+ u64 val;
+
+ /* Only need to enable once per package. */
+ if (cpumask_first(topology_core_cpumask(cpu)) != cpu)
+ return;
+
+ rdmsrl(MSR_RMID_SNC_CONFIG, val);
+ val &= ~BIT_ULL(0);
+ wrmsrl(MSR_RMID_SNC_CONFIG, val);
+}
+
static int resctrl_online_cpu(unsigned int cpu)
{
struct rdt_resource *r;
mutex_lock(&rdtgroup_mutex);
+
+ if (snc_nodes_per_l3_cache > 1)
+ snc_remap_rmids(cpu);
+
for_each_capable_rdt_resource(r)
domain_add_cpu(cpu, r);
/* The cpu is set in default rdtgroup after online. */
@@ -910,11 +944,96 @@ static __init bool get_rdt_resources(void)
return (rdt_mon_capable || rdt_alloc_capable);
}
+/* CPU models that support MSR_RMID_SNC_CONFIG */
+static const struct x86_cpu_id snc_cpu_ids[] __initconst = {
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, 0),
+ X86_MATCH_INTEL_FAM6_MODEL(GRANITERAPIDS_X, 0),
+ {}
+};
+
+/*
+ * There isn't a simple hardware bit that indicates whether a CPU is running
+ * in Sub NUMA Cluster (SNC) mode. Infer the state by comparing the
+ * ratio of NUMA nodes to L3 cache instances.
+ * It is not possible to accurately determine SNC state if the system is
+ * booted with a maxcpus=N parameter. That distorts the ratio of SNC nodes
+ * to L3 caches. It will be OK if system is booted with hyperthreading
+ * disabled (since this doesn't affect the ratio).
+ */
+static __init int snc_get_config(void)
+{
+ unsigned long *node_caches;
+ int mem_only_nodes = 0;
+ int cpu, node, ret;
+ int num_l3_caches;
+ int cache_id;
+
+ if (!x86_match_cpu(snc_cpu_ids))
+ return 1;
+
+ node_caches = bitmap_zalloc(num_possible_cpus(), GFP_KERNEL);
+ if (!node_caches)
+ return 1;
+
+ cpus_read_lock();
+
+ if (num_online_cpus() != num_present_cpus())
+ pr_warn("Some CPUs offline, SNC detection may be incorrect\n");
+
+ for_each_node(node) {
+ cpu = cpumask_first(cpumask_of_node(node));
+ if (cpu < nr_cpu_ids) {
+ cache_id = get_cpu_cacheinfo_id(cpu, 3);
+ if (cache_id != -1)
+ set_bit(cache_id, node_caches);
+ } else {
+ mem_only_nodes++;
+ }
+ }
+ cpus_read_unlock();
+
+ num_l3_caches = bitmap_weight(node_caches, num_possible_cpus());
+ kfree(node_caches);
+
+ if (!num_l3_caches)
+ goto insane;
+
+ /* sanity check #1: Number of CPU nodes must be multiple of num_l3_caches */
+ if ((nr_node_ids - mem_only_nodes) % num_l3_caches)
+ goto insane;
+
+ ret = (nr_node_ids - mem_only_nodes) / num_l3_caches;
+
+ /* sanity check #2: Only valid results are 1, 2, 3, 4 */
+ switch (ret) {
+ case 1:
+ break;
+ case 2:
+ case 3:
+ case 4:
+ rdt_resources_all[RDT_RESOURCE_L3_MON].r_resctrl.scope = RESCTRL_NODE;
+ pr_info("Sub-NUMA Cluster: %d nodes per L3 cache\n", ret);
+ break;
+ default:
+ goto insane;
+ }
+
+ return ret;
+insane:
+ pr_warn("SNC insanity: CPU nodes = %d num_l3_caches = %d\n",
+ (nr_node_ids - mem_only_nodes), num_l3_caches);
+ return 1;
+}
+
static __init void rdt_init_res_defs_intel(void)
{
struct rdt_hw_resource *hw_res;
struct rdt_resource *r;
+ snc_nodes_per_l3_cache = snc_get_config();
+
for_each_rdt_resource(r) {
hw_res = resctrl_to_arch_res(r);
--
2.43.0
With Sub-NUMA Cluster mode enabled the scope of monitoring resources is
per-NODE instead of per-L3 cache. Suffixes of directories with "L3" in
their name refer to Sub-NUMA nodes instead of L3 cache ids.
Users should be aware that SNC mode also affects the amount of L3 cache
available for allocation within each SNC node.
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Documentation/arch/x86/resctrl.rst | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..15f1cff6ee76 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -366,10 +366,10 @@ When control is enabled all CTRL_MON groups will also contain:
When monitoring is enabled all MON groups will also contain:
"mon_data":
- This contains a set of files organized by L3 domain and by
- RDT event. E.g. on a system with two L3 domains there will
- be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
- directories have one file per event (e.g. "llc_occupancy",
+ This contains a set of files organized by L3 domain or by NUMA
+ node (depending on whether Sub-NUMA Cluster (SNC) mode is disabled
+ or enabled respectively) and by RDT event. Each of these
+ directories has one file per event (e.g. "llc_occupancy",
"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
files provide a read out of the current value of the event for
all tasks in the group. In CTRL_MON groups these files provide
@@ -478,6 +478,23 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
each bit represents 5% of the capacity of the cache. You could partition
the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes. Users who do not bind tasks to the CPUs of a
+specific Sub-NUMA node must read the "llc_occupancy", "mbm_total_bytes",
+and "mbm_local_bytes" for all Sub-NUMA nodes where the tasks may execute
+to get the full view of traffic for which the tasks were the source.
+
+The cache allocation feature still provides the same number of
+bits in a mask to control allocation into the L3 cache, but each
+of those ways has its capacity reduced because the cache is divided
+between the SNC nodes. The values reported in the resctrl
+"size" files are adjusted accordingly.
+
Memory bandwidth Allocation and monitoring
==========================================
--
2.43.0
© 2016 - 2025 Red Hat, Inc.