[PATCH v16 0/9] Add support for Sub-NUMA cluster (SNC) systems

Tony Luck posted 9 patches 1 year, 11 months ago
There is a newer version of this series
Documentation/arch/x86/resctrl.rst        |  44 ++-
include/linux/resctrl.h                   |  85 +++--
arch/x86/include/asm/msr-index.h          |   1 +
arch/x86/kernel/cpu/resctrl/internal.h    |  67 ++--
arch/x86/kernel/cpu/resctrl/core.c        | 428 ++++++++++++++++++----
arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  56 +--
arch/x86/kernel/cpu/resctrl/monitor.c     |  70 ++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  26 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 213 +++++++----
9 files changed, 718 insertions(+), 272 deletions(-)
[PATCH v16 0/9] Add support for Sub-NUMA cluster (SNC) systems
Posted by Tony Luck 1 year, 11 months ago
The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
that share an L3 cache into two or more sets. This plays havoc with the
Resource Director Technology (RDT) monitoring features.  Prior to this
patch Intel has advised that SNC and RDT are incompatible.

Some of these CPU support an MSR that can partition the RMID counters in
the same way. This allows monitoring features to be used. With the caveat
that users must be aware that Linux may migrate tasks more frequently
between SNC nodes than between "regular" NUMA nodes, so reading counters
from all SNC nodes may be needed to get a complete picture of activity
for tasks.

Cache and memory bandwidth allocation features continue to operate at
the scope of the L3 cache.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
Changes since v15: Link: https://lore.kernel.org/all/20240228112935.8087-tony.luck@intel.com/

0) Note that v14 Reviewed/Testing tags have been removed because of the
   extent of refactoring to catch up with upstream. But nothing
   fundamental changed, so everything should look familiar.

1) Refactor to apply on top of Link: https://lore.kernel.org/all/20240308213846.77075-1-tony.luck@intel.com/
   [So base commit is either tip x86/cache, or upstream current merge PLUS
    the two patches in that series]

2) Add patch 9 which adds files showing mappings from domains to CPUs
   Reinette suggested this, James thinks it duplicates information
   that can be gathered from /sys/devices/system/
   Discussion here: Link: https://lore.kernel.org/all/ZetcM9GO2PH6SC0j@agluck-desk3/
   This part is a nice-to-have. I'm fine if just the first eight patches
   are applied without this while the discussion continues.

Tony Luck (9):
  x86/resctrl: Prepare for new domain scope
  x86/resctrl: Prepare to split rdt_domain structure
  x86/resctrl: Prepare for different scope for control/monitor
    operations
  x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
  x86/resctrl: Add node-scope to the options for feature scope
  x86/resctrl: Introduce snc_nodes_per_l3_cache
  x86/resctrl: Sub NUMA Cluster detection and enable
  x86/resctrl: Update documentation with Sub-NUMA cluster changes
  x86/resctrl: Add info files to show mappings from domains to lists of
    cpus

 Documentation/arch/x86/resctrl.rst        |  44 ++-
 include/linux/resctrl.h                   |  85 +++--
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  67 ++--
 arch/x86/kernel/cpu/resctrl/core.c        | 428 ++++++++++++++++++----
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  56 +--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  70 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |  26 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 213 +++++++----
 9 files changed, 718 insertions(+), 272 deletions(-)

-- 
2.44.0
Re: [PATCH v16 0/9] Add support for Sub-NUMA cluster (SNC) systems
Posted by Reinette Chatre 1 year, 10 months ago
(+Maciej)

Hi Tony,

(Please add x86/resctrl to Subject prefix of cover letter)

On 3/12/2024 2:42 PM, Tony Luck wrote:
> The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
> that share an L3 cache into two or more sets. This plays havoc with the
> Resource Director Technology (RDT) monitoring features.  Prior to this
> patch Intel has advised that SNC and RDT are incompatible.
> 
> Some of these CPU support an MSR that can partition the RMID counters in
> the same way. This allows monitoring features to be used. With the caveat
> that users must be aware that Linux may migrate tasks more frequently
> between SNC nodes than between "regular" NUMA nodes, so reading counters
> from all SNC nodes may be needed to get a complete picture of activity
> for tasks.
> 
> Cache and memory bandwidth allocation features continue to operate at
> the scope of the L3 cache.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> 
> ---
> Changes since v15: Link: https://lore.kernel.org/all/20240228112935.8087-tony.luck@intel.com/
> 
> 0) Note that v14 Reviewed/Testing tags have been removed because of the
>    extent of refactoring to catch up with upstream. But nothing
>    fundamental changed, so everything should look familiar.
> 
> 1) Refactor to apply on top of Link: https://lore.kernel.org/all/20240308213846.77075-1-tony.luck@intel.com/
>    [So base commit is either tip x86/cache, or upstream current merge PLUS
>     the two patches in that series]
> 
> 2) Add patch 9 which adds files showing mappings from domains to CPUs
>    Reinette suggested this, James thinks it duplicates information
>    that can be gathered from /sys/devices/system/
>    Discussion here: Link: https://lore.kernel.org/all/ZetcM9GO2PH6SC0j@agluck-desk3/
>    This part is a nice-to-have. I'm fine if just the first eight patches
>    are applied without this while the discussion continues.

I agree to drop patch #9. 

The core support for SNC continue to look good to me (I just had a few nitpicks).

What remains is the user interface that continues to gather opinions [3]. These new
discussions were prompted by user space needing a way to determine if resctrl supports
SNC. This started by using the "size" file but thinking about it more user space could
also look at whether the number of L3 control domains are different from the number
of L3 monitoring domains? I am adding Maciej for his opinion (please also include him
in future versions of this series). 

Apart from the user space requirement to know if SNC is supported by resctrl there
is also the interface with which user space obtains the monitoring data.
James highlighted [1] that the interface used in this series uses existing files to
represent different content, and can thus be considered as "broken". It is not obvious
to me how to "fix" this. Should we continue to explore interfaces like [2] that
attempts to add SNC support into resctrl or should the message continue to be
that SNC "plays havoc with the RDT monitoring features" and users wanting to use
SNC and RDT at the same time are expected to adapt to the peculiar interface ...
or is the preference that after this series "SNC and RDT are compatible" and
thus presented with an intuitive interface?

Reinette

[1] https://lore.kernel.org/lkml/88430722-67b3-4f7d-8db2-95ee52b6f0b0@arm.com/
[2] https://lore.kernel.org/lkml/SJ1PR11MB608309F47C00F964E16205D6FC2D2@SJ1PR11MB6083.namprd11.prod.outlook.com/
[3] https://lore.kernel.org/lkml/SJ1PR11MB608310C72D7189C139EA6302FC212@SJ1PR11MB6083.namprd11.prod.outlook.com/
RE: [PATCH v16 0/9] Add support for Sub-NUMA cluster (SNC) systems
Posted by Luck, Tony 1 year, 10 months ago
> I agree to drop patch #9.
>
> The core support for SNC continue to look good to me (I just had a few nitpicks).
>
> What remains is the user interface that continues to gather opinions [3]. These new
> discussions were prompted by user space needing a way to determine if resctrl supports
> SNC. This started by using the "size" file but thinking about it more user space could
> also look at whether the number of L3 control domains are different from the number
> of L3 monitoring domains? I am adding Maciej for his opinion (please also include him
> in future versions of this series).

Implementing the revised user interface will make significant changes to the
patches needed to support SNC. Working on them now.

Minor stuff is the that "size" file won't need to change (because summing across
all SNC domains mean that all of the cache is counted in mon_data/mon_L3_xx/llc_occupancy.
The new mon_data/mon_L3_xx/mon_NODE_xxllc_occupancy files are the ones limited
to 1/snc_ways of L3 capacity.

Major stuff is that we now need both the L3-scoped domain list as well as the NODE
scoped domain list. So no longer just changing the "mon_scope" field in the L3
rdt_resource from one value to another.

I'll maybe have a draft set of patches in a day or two.

> Apart from the user space requirement to know if SNC is supported by resctrl there
> is also the interface with which user space obtains the monitoring data.
> James highlighted [1] that the interface used in this series uses existing files to
> represent different content, and can thus be considered as "broken". It is not obvious
> to me how to "fix" this. Should we continue to explore interfaces like [2] that
> attempts to add SNC support into resctrl or should the message continue to be
> that SNC "plays havoc with the RDT monitoring features" and users wanting to use
> SNC and RDT at the same time are expected to adapt to the peculiar interface ...
> or is the preference that after this series "SNC and RDT are compatible" and
> thus presented with an intuitive interface?
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/88430722-67b3-4f7d-8db2-95ee52b6f0b0@arm.com/
> [2] https://lore.kernel.org/lkml/SJ1PR11MB608309F47C00F964E16205D6FC2D2@SJ1PR11MB6083.namprd11.prod.outlook.com/
> [3] https://lore.kernel.org/lkml/SJ1PR11MB608310C72D7189C139EA6302FC212@SJ1PR11MB6083.namprd11.prod.outlook.com/
Re: [PATCH v16 0/9] Add support for Sub-NUMA cluster (SNC) systems
Posted by Reinette Chatre 1 year, 10 months ago
Hi Tony,

On 3/19/2024 1:31 PM, Luck, Tony wrote:
>> I agree to drop patch #9.
>>
>> The core support for SNC continue to look good to me (I just had a few nitpicks).
>>
>> What remains is the user interface that continues to gather opinions [3]. These new
>> discussions were prompted by user space needing a way to determine if resctrl supports
>> SNC. This started by using the "size" file but thinking about it more user space could
>> also look at whether the number of L3 control domains are different from the number
>> of L3 monitoring domains? I am adding Maciej for his opinion (please also include him
>> in future versions of this series).
> 
> Implementing the revised user interface will make significant changes to the
> patches needed to support SNC. Working on them now.
> 
> Minor stuff is the that "size" file won't need to change (because summing across
> all SNC domains mean that all of the cache is counted in mon_data/mon_L3_xx/llc_occupancy.
> The new mon_data/mon_L3_xx/mon_NODE_xxllc_occupancy files are the ones limited
> to 1/snc_ways of L3 capacity.

If the "size" file is impacted by this change then I think there is something
wrong with the current series because "size" is intended to mirror the schemata
file that represents the size in bytes that correspond to the bits from schemata -
this monitoring change does not change this relationship.

> 
> Major stuff is that we now need both the L3-scoped domain list as well as the NODE
> scoped domain list. So no longer just changing the "mon_scope" field in the L3
> rdt_resource from one value to another.

Here may be opportunity to isolate the SNC layout from the customs. For example, the
extra layers of files only need to be added when knowing SNC is active and the
cache<->node relationships can be determined dynamically on file create and
read instead of creating new lists and data structures to try and describe it
as part of the architecture.

> 
> I'll maybe have a draft set of patches in a day or two.

I was actually hoping to get a discussion started about what folks really want
to see instead of you implementing yet another new feature that folks may
disagree with.

Reinette
Re: [PATCH v16 0/9] Add support for Sub-NUMA cluster (SNC) systems
Posted by Maciej Wieczor-Retman 1 year, 10 months ago
Hi Reinette,

On 2024-03-19 at 10:51:14 -0700, Reinette Chatre wrote:
>What remains is the user interface that continues to gather opinions [3]. These new
>discussions were prompted by user space needing a way to determine if resctrl supports
>SNC. This started by using the "size" file but thinking about it more user space could
>also look at whether the number of L3 control domains are different from the number
>of L3 monitoring domains? I am adding Maciej for his opinion (please also include him
>in future versions of this series). 

By this do you mean comparing the contents of main "schemata" file with the
number of mon_L3_* files?

>Apart from the user space requirement to know if SNC is supported by resctrl there
>is also the interface with which user space obtains the monitoring data.
>James highlighted [1] that the interface used in this series uses existing files to
>represent different content, and can thus be considered as "broken". It is not obvious
>to me how to "fix" this. Should we continue to explore interfaces like [2] that
>attempts to add SNC support into resctrl or should the message continue to be
>that SNC "plays havoc with the RDT monitoring features" and users wanting to use
>SNC and RDT at the same time are expected to adapt to the peculiar interface ...
>or is the preference that after this series "SNC and RDT are compatible" and
>thus presented with an intuitive interface?

I kind of liked this idea [1]. Hiding SNC related information behind some not
obvious text parsing and size comparisons might eliminate any ease of use for
userspace applications. But I agree with you [2] that it's hard to predict the
future for this interface and any potential problems with setting up this
file structure.

[1] https://lore.kernel.org/all/SJ1PR11MB608309F47C00F964E16205D6FC2D2@SJ1PR11MB6083.namprd11.prod.outlook.com/
[2] https://lore.kernel.org/all/7f15a700-f23a-48f9-b335-13ea1735ad84@intel.com/

-- 
Kind regards
Maciej Wieczór-Retman
Re: [PATCH v16 0/9] Add support for Sub-NUMA cluster (SNC) systems
Posted by Reinette Chatre 1 year, 10 months ago
Hi Maciej,

On 3/20/2024 8:21 AM, Maciej Wieczor-Retman wrote:
> Hi Reinette,
> 
> On 2024-03-19 at 10:51:14 -0700, Reinette Chatre wrote:
>> What remains is the user interface that continues to gather opinions [3]. These new
>> discussions were prompted by user space needing a way to determine if resctrl supports
>> SNC. This started by using the "size" file but thinking about it more user space could
>> also look at whether the number of L3 control domains are different from the number
>> of L3 monitoring domains? I am adding Maciej for his opinion (please also include him
>> in future versions of this series). 
> 
> By this do you mean comparing the contents of main "schemata" file with the
> number of mon_L3_* files?

(assuming you mean mon_L3_* directories)

Yes, the "schemata" file can be used. There is also the "bit_usage" file in
the info directory that indicates how many control domains there are.
Do you think doing so also falls into the "not obvious text parsing and size
comparison" category?

> 
>> Apart from the user space requirement to know if SNC is supported by resctrl there
>> is also the interface with which user space obtains the monitoring data.
>> James highlighted [1] that the interface used in this series uses existing files to
>> represent different content, and can thus be considered as "broken". It is not obvious
>> to me how to "fix" this. Should we continue to explore interfaces like [2] that
>> attempts to add SNC support into resctrl or should the message continue to be
>> that SNC "plays havoc with the RDT monitoring features" and users wanting to use
>> SNC and RDT at the same time are expected to adapt to the peculiar interface ...
>> or is the preference that after this series "SNC and RDT are compatible" and
>> thus presented with an intuitive interface?
> 
> I kind of liked this idea [1]. Hiding SNC related information behind some not
> obvious text parsing and size comparisons might eliminate any ease of use for
> userspace applications. But I agree with you [2] that it's hard to predict the
> future for this interface and any potential problems with setting up this
> file structure.

Thanks to you for trying this out from user space side and highlighting the
difficulty trying to do so.

Reinette

> 
> [1] https://lore.kernel.org/all/SJ1PR11MB608309F47C00F964E16205D6FC2D2@SJ1PR11MB6083.namprd11.prod.outlook.com/
> [2] https://lore.kernel.org/all/7f15a700-f23a-48f9-b335-13ea1735ad84@intel.com/
>
Re: [PATCH v16 0/9] Add support for Sub-NUMA cluster (SNC) systems
Posted by Maciej Wieczor-Retman 1 year, 10 months ago
On 2024-03-20 at 08:50:51 -0700, Reinette Chatre wrote:
>Hi Maciej,
>
>On 3/20/2024 8:21 AM, Maciej Wieczor-Retman wrote:
>> Hi Reinette,
>> 
>> On 2024-03-19 at 10:51:14 -0700, Reinette Chatre wrote:
>>> What remains is the user interface that continues to gather opinions [3]. These new
>>> discussions were prompted by user space needing a way to determine if resctrl supports
>>> SNC. This started by using the "size" file but thinking about it more user space could
>>> also look at whether the number of L3 control domains are different from the number
>>> of L3 monitoring domains? I am adding Maciej for his opinion (please also include him
>>> in future versions of this series). 
>> 
>> By this do you mean comparing the contents of main "schemata" file with the
>> number of mon_L3_* files?
>
>(assuming you mean mon_L3_* directories)
>
>Yes, the "schemata" file can be used. There is also the "bit_usage" file in
>the info directory that indicates how many control domains there are.

I just did a test on my IceLake server that has SNC enabled and you're right.
This can indicate kernel support for SNC. It also seems more reliable in
determining the ratio of nodes per socket (since the ratio of cpus per
cache/node can potentially fail with offline cpus).

>Do you think doing so also falls into the "not obvious text parsing and size
>comparison" category?

Ideally [1] seems just the most user friendly in my opinion. Comparing schemata
with mon_L3_* directories feels like a good next solution though.

>>> Apart from the user space requirement to know if SNC is supported by resctrl there
>>> is also the interface with which user space obtains the monitoring data.
>>> James highlighted [1] that the interface used in this series uses existing files to
>>> represent different content, and can thus be considered as "broken". It is not obvious
>>> to me how to "fix" this. Should we continue to explore interfaces like [2] that
>>> attempts to add SNC support into resctrl or should the message continue to be
>>> that SNC "plays havoc with the RDT monitoring features" and users wanting to use
>>> SNC and RDT at the same time are expected to adapt to the peculiar interface ...
>>> or is the preference that after this series "SNC and RDT are compatible" and
>>> thus presented with an intuitive interface?
>> 
>> I kind of liked this idea [1]. Hiding SNC related information behind some not
>> obvious text parsing and size comparisons might eliminate any ease of use for
>> userspace applications. But I agree with you [2] that it's hard to predict the
>> future for this interface and any potential problems with setting up this
>> file structure.
>
>Thanks to you for trying this out from user space side and highlighting the
>difficulty trying to do so.

I suppose it wasn't that difficult in execution but it required a lot of
thinking about what is a reliable way for checking kernel support, what can fail
in checking the ratio of nodes to sockets etc. Maybe once we have this figured
out all it will take is a good documentation and userspace applications won't
have a hard time with SNC.

>
>Reinette
>

[1] https://lore.kernel.org/all/SJ1PR11MB608309F47C00F964E16205D6FC2D2@SJ1PR11MB6083.namprd11.prod.outlook.com/
[2] https://lore.kernel.org/all/7f15a700-f23a-48f9-b335-13ea1735ad84@intel.com/

-- 
Kind regards
Maciej Wieczór-Retman
Re: [PATCH v16 0/9] Add support for Sub-NUMA cluster (SNC) systems
Posted by Maciej Wieczor-Retman 1 year, 10 months ago
Hi Tony,

On 2024-03-12 at 14:42:38 -0700, Tony Luck wrote:
>The Sub-NUMA cluster feature on some Intel processors partitions the CPUs
>that share an L3 cache into two or more sets. This plays havoc with the
>Resource Director Technology (RDT) monitoring features.  Prior to this
>patch Intel has advised that SNC and RDT are incompatible.
>
>Some of these CPU support an MSR that can partition the RMID counters in
>the same way. This allows monitoring features to be used. With the caveat
>that users must be aware that Linux may migrate tasks more frequently
>between SNC nodes than between "regular" NUMA nodes, so reading counters
>from all SNC nodes may be needed to get a complete picture of activity
>for tasks.
>
>Cache and memory bandwidth allocation features continue to operate at
>the scope of the L3 cache.
>
>Signed-off-by: Tony Luck <tony.luck@intel.com>
>
>---
>Changes since v15: Link: https://lore.kernel.org/all/20240228112935.8087-tony.luck@intel.com/
>
>0) Note that v14 Reviewed/Testing tags have been removed because of the
>   extent of refactoring to catch up with upstream. But nothing
>   fundamental changed, so everything should look familiar.
>
>1) Refactor to apply on top of Link: https://lore.kernel.org/all/20240308213846.77075-1-tony.luck@intel.com/
>   [So base commit is either tip x86/cache, or upstream current merge PLUS
>    the two patches in that series]

I'm having problems with cleanly applying this series. Here are the steps I
took:

- I pulled this series with b4.
- I pulled two patches from the "Pass domain to target CPU" series with b4.
- Then I hard reseted my branch to tip x86/cache
- Tried applying first the two patches and then this series.

And here I'm getting conflicts on patch 0002 "Prepare to split rdt_domain
structure". I also tried with tip master and applying James Morse's series
before your patches but I got the same problem. Am I doing something wrong?

I wanted to find what causes the conflict so I can give more details but three
way merging doesn't seem to work and git am doesn't leave any conflict markers,
just fails.

-- 
Kind regards
Maciej Wieczór-Retman
RE: [PATCH v16 0/9] Add support for Sub-NUMA cluster (SNC) systems
Posted by Luck, Tony 1 year, 10 months ago
> I wanted to find what causes the conflict so I can give more details but three
> way merging doesn't seem to work and git am doesn't leave any conflict markers,
> just fails.

Maybe there's something else touching resctrl in either TIP or now in upstream?

In my local tree I built on upstream after Linus pulled TIP x86/cache (and a bunch
or other stuff). So my base commit is:

b29f377119f6 Merge tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/ti

Then the two patch cleanup:

b0eebe2ee45d x86/resctrl: Simplify call convention for MSR update functions
edf14f3a6093 x86/resctrl: Pass domain to target CPU

Then this v16 SNC series.

Reinette pointed out a couple of things to fix in v16, I'll handle those and post
v17 next week (based on Linus v6.9-rc1). I'll fix up any additional merge issues
while rebasing.

-Tony
Re: [PATCH v16 0/9] Add support for Sub-NUMA cluster (SNC) systems
Posted by Maciej Wieczor-Retman 1 year, 10 months ago
On 2024-03-21 at 16:51:51 +0000, Luck, Tony wrote:
>> I wanted to find what causes the conflict so I can give more details but three
>> way merging doesn't seem to work and git am doesn't leave any conflict markers,
>> just fails.
>
>Maybe there's something else touching resctrl in either TIP or now in upstream?
>
>In my local tree I built on upstream after Linus pulled TIP x86/cache (and a bunch
>or other stuff). So my base commit is:
>
>b29f377119f6 Merge tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/ti

Thanks, I'll try using this one.

>
>Then the two patch cleanup:
>
>b0eebe2ee45d x86/resctrl: Simplify call convention for MSR update functions
>edf14f3a6093 x86/resctrl: Pass domain to target CPU
>
>Then this v16 SNC series.
>
>Reinette pointed out a couple of things to fix in v16, I'll handle those and post
>v17 next week (based on Linus v6.9-rc1). I'll fix up any additional merge issues
>while rebasing.
>
>-Tony

-- 
Kind regards
Maciej Wieczór-Retman