[PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems

Tony Luck posted 7 patches 2 years, 7 months ago
There is a newer version of this series
Documentation/arch/x86/resctrl.rst        |  10 +-
include/linux/resctrl.h                   |   5 +-
arch/x86/include/asm/resctrl.h            |   2 +
arch/x86/kernel/cpu/resctrl/internal.h    |  20 +++
arch/x86/kernel/cpu/resctrl/core.c        | 154 ++++++++++++++++++++--
arch/x86/kernel/cpu/resctrl/monitor.c     |  24 ++--
arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c    |   4 +-
8 files changed, 191 insertions(+), 30 deletions(-)
[PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
Posted by Tony Luck 2 years, 7 months ago
There isn't a simple hardware enumeration to indicate to software that
a system is running with Sub-NUMA Clustering enabled.

Compare the number of NUMA nodes with the number of L3 caches to calculate
the number of Sub-NUMA nodes per L3 cache.

When Sub-NUMA clustering mode is enabled in BIOS setup, the RMID counters
are distributed equally between the SNC nodes within each socket.

E.g. if there are 400 RMID counters, and the system is configured with
two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
0 on the socket, and RMID counter 200..399 on SNC node 1.

A model specific MSR (0xca0) can change the configuration of the RMIDs
when SNC mode is enabled.

The MSR controls the interpretation of the RMID field in the
IA32_PQR_ASSOC MSR so that the appropriate hardware counters
within the SNC node are updated. If reconfigured from default, RMIDs
 are divided evenly across clusters.
.  

Also initialize a per-cpu RMID offset value. Use this
to calculate the value to write to the IA32_QM_EVTSEL MSR when
reading RMID event values.

N.B. this works well for well-behaved NUMA applications that access
memory predominantly from the local memory node. For applications that
access memory across multiple nodes it may be necessary for the user
to read counters for all SNC nodes on a socket and add the values to
get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
all that different from applications that span across multiple sockets
in a legacy system.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v1:

Re-based to tip/master (on June 21, 2023)

Fenghua:
+ Better comment for l3_mon_evt_init()
+ Don't need .fflags = RFTYPE_RES_MB for node resource. Use .fflags = 0

James:
+ Add helper function to choose resource based on snc_ways
+ Drop the info/snc_ways file. No current justification for it.
+ Typos s/Suffices/Suffixes/, s/Sun-NUMA/Sub-NUMA/
+ Expand SNC acronym on first use in Documentation/x86/resctrl.rst

Peter:
+ Add checks for cpu-less nodes.

Tony Luck (7):
  x86/resctrl: Refactor in preparation for node-scoped resources
  x86/resctrl: Remove hard code of RDT_RESOURCE_L3 in monitor.c
  x86/resctrl: Add a new node-scoped resource to rdt_resources_all[]
  x86/resctrl: Add code to setup monitoring at L3 or NODE scope.
  x86/resctrl: Add package scoped resource
  x86/resctrl: Update documentation with Sub-NUMA cluster changes
  x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

 Documentation/arch/x86/resctrl.rst        |  10 +-
 include/linux/resctrl.h                   |   5 +-
 arch/x86/include/asm/resctrl.h            |   2 +
 arch/x86/kernel/cpu/resctrl/internal.h    |  20 +++
 arch/x86/kernel/cpu/resctrl/core.c        | 154 ++++++++++++++++++++--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  24 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |   4 +-
 8 files changed, 191 insertions(+), 30 deletions(-)


base-commit: 746d03317c1175666aad909ecc45384da42218aa
-- 
2.40.1
RE: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
Posted by Shaopeng Tan (Fujitsu) 2 years, 7 months ago
Hi tony,

I ran selftest/resctrl in my environment,
CMT test failed when I enabled Sub-NUMA Cluster.

I don't know why it failed yet,
I paste the test results below.

Processer in my environment:
Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz

$ sudo make -C tools/testing/selftests/resctrl run_tests
# # Starting CMT test ...
# # Mounting resctrl to "/sys/fs/resctrl"
# # Mounting resctrl to "/sys/fs/resctrl"
# # Cache size :25952256
# # Benchmark PID: 8638
# # Writing benchmark parameters to resctrl FS
# # Checking for pass/fail
# # Fail: Check cache miss rate within 15%
# # Percent diff=21
# # Number of bits: 5
# # Average LLC val: 9216000
# # Cache span (bytes): 11796480
# not ok 3 CMT: test

Best regards,
Shaopeng TAN
RE: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
Posted by Luck, Tony 2 years, 7 months ago
> I ran selftest/resctrl in my environment,
> CMT test failed when I enabled Sub-NUMA Cluster.
>
> I don't know why it failed yet,
> I paste the test results below.
>
> Processer in my environment:
> Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
>
> $ sudo make -C tools/testing/selftests/resctrl run_tests
> # # Starting CMT test ...
> # # Mounting resctrl to "/sys/fs/resctrl"
> # # Mounting resctrl to "/sys/fs/resctrl"
> # # Cache size :25952256
> # # Benchmark PID: 8638
> # # Writing benchmark parameters to resctrl FS
> # # Checking for pass/fail
> # # Fail: Check cache miss rate within 15%
> # # Percent diff=21
> # # Number of bits: 5
> # # Average LLC val: 9216000
> # # Cache span (bytes): 11796480
> # not ok 3 CMT: test

This is expected. When SNC is enabled, CAT still supports the same number of
bits in the allocation cache mask. But each bit represents half as much cache.

Think of the cache as a 2-D matrix with the cache-ways (bits in the CAT mask)
as the columns, and the rows are the hashed index of the physical address.
When SNC is turned on the hash function for physical addresses from one
of the SNC number nodes will only pick half of those rows (and the other
SNC node gets the other half of the rows).

-Tony
Re: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
Posted by Reinette Chatre 2 years, 7 months ago
Hi Tony,

On 6/29/2023 9:05 AM, Luck, Tony wrote:
>> I ran selftest/resctrl in my environment,
>> CMT test failed when I enabled Sub-NUMA Cluster.
>>
>> I don't know why it failed yet,
>> I paste the test results below.
>>
>> Processer in my environment:
>> Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
>>
>> $ sudo make -C tools/testing/selftests/resctrl run_tests
>> # # Starting CMT test ...
>> # # Mounting resctrl to "/sys/fs/resctrl"
>> # # Mounting resctrl to "/sys/fs/resctrl"
>> # # Cache size :25952256
>> # # Benchmark PID: 8638
>> # # Writing benchmark parameters to resctrl FS
>> # # Checking for pass/fail
>> # # Fail: Check cache miss rate within 15%
>> # # Percent diff=21
>> # # Number of bits: 5
>> # # Average LLC val: 9216000
>> # # Cache span (bytes): 11796480
>> # not ok 3 CMT: test
> 
> This is expected. When SNC is enabled, CAT still supports the same number of
> bits in the allocation cache mask. But each bit represents half as much cache.
> 
> Think of the cache as a 2-D matrix with the cache-ways (bits in the CAT mask)
> as the columns, and the rows are the hashed index of the physical address.
> When SNC is turned on the hash function for physical addresses from one
> of the SNC number nodes will only pick half of those rows (and the other
> SNC node gets the other half of the rows).

If a test is expected to fail in a particular scenario then I think
the test failure should be communicated as a "pass". If not this will 
reduce confidence in accuracy of tests. Even so, from the description
it sounds as though this test can be made more accurate to indeed pass
in the scenario when SNC is enabled?

Reinette
Re: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
Posted by Tony Luck 2 years, 7 months ago
On Tue, Jul 11, 2023 at 01:50:02PM -0700, Reinette Chatre wrote:
> Hi Tony,
> > This is expected. When SNC is enabled, CAT still supports the same number of
> > bits in the allocation cache mask. But each bit represents half as much cache.
> > 
> > Think of the cache as a 2-D matrix with the cache-ways (bits in the CAT mask)
> > as the columns, and the rows are the hashed index of the physical address.
> > When SNC is turned on the hash function for physical addresses from one
> > of the SNC number nodes will only pick half of those rows (and the other
> > SNC node gets the other half of the rows).
> 
> If a test is expected to fail in a particular scenario then I think
> the test failure should be communicated as a "pass". If not this will 
> reduce confidence in accuracy of tests. Even so, from the description
> it sounds as though this test can be made more accurate to indeed pass
> in the scenario when SNC is enabled?

Hi Reinette,

Yes. This could be done. The resctrl tests would need to determine
if SNC mode is enabled. But I think that is possible by comparing
output of sysfs files. E.g. with SNC disabled the lists of cpus for a node
and a CPU on that node will match like this:

$ cat /sys/devices/system/node/node0/cpulist
0-35,72-107
$ cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
0-35,72-107

but with SNC enabled, the CPUs sharing a cache will be divided across
two or four nodes.

It looks like the existing tests may print a warning. I see
this code in:

tools/testing/selftests/resctrl/resctrl_tests.c

123         res = cmt_resctrl_val(cpu_no, 5, benchmark_cmd);
124         ksft_test_result(!res, "CMT: test\n");
125         if ((get_vendor() == ARCH_INTEL) && res)
126                 ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");

but at first glance that warning doesn't appear to try and
check if SNC was the actual problem.

-Tony
Re: [PATCH v2 0/7] x86/resctrl: Add support for Sub-NUMA cluster (SNC) systems
Posted by Reinette Chatre 2 years, 7 months ago
Hi Tony,

On 7/11/2023 2:23 PM, Tony Luck wrote:
> On Tue, Jul 11, 2023 at 01:50:02PM -0700, Reinette Chatre wrote:
>> Hi Tony,
>>> This is expected. When SNC is enabled, CAT still supports the same number of
>>> bits in the allocation cache mask. But each bit represents half as much cache.
>>>
>>> Think of the cache as a 2-D matrix with the cache-ways (bits in the CAT mask)
>>> as the columns, and the rows are the hashed index of the physical address.
>>> When SNC is turned on the hash function for physical addresses from one
>>> of the SNC number nodes will only pick half of those rows (and the other
>>> SNC node gets the other half of the rows).
>>
>> If a test is expected to fail in a particular scenario then I think
>> the test failure should be communicated as a "pass". If not this will 
>> reduce confidence in accuracy of tests. Even so, from the description
>> it sounds as though this test can be made more accurate to indeed pass
>> in the scenario when SNC is enabled?
> 
> Hi Reinette,
> 
> Yes. This could be done. The resctrl tests would need to determine
> if SNC mode is enabled. But I think that is possible by comparing
> output of sysfs files. E.g. with SNC disabled the lists of cpus for a node
> and a CPU on that node will match like this:
> 
> $ cat /sys/devices/system/node/node0/cpulist
> 0-35,72-107
> $ cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
> 0-35,72-107
> 
> but with SNC enabled, the CPUs sharing a cache will be divided across
> two or four nodes.
> 
> It looks like the existing tests may print a warning. I see
> this code in:
> 
> tools/testing/selftests/resctrl/resctrl_tests.c
> 
> 123         res = cmt_resctrl_val(cpu_no, 5, benchmark_cmd);
> 124         ksft_test_result(!res, "CMT: test\n");
> 125         if ((get_vendor() == ARCH_INTEL) && res)
> 126                 ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");
> 
> but at first glance that warning doesn't appear to try and
> check if SNC was the actual problem.

Your first glance is accurate. This message was added after finding
tests fail on SNC systems but not finding the correct way to enumerate
whether SNC is enabled. At that time it was still recommended that
SNC not be enabled and thus test failures continued to be accurate.
This work changes that.

Reinette