[PATCH v5 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

James Morse posted 21 patches 3 years, 10 months ago
There is a newer version of this series
arch/x86/include/asm/resctrl.h            |   9 +
arch/x86/kernel/cpu/resctrl/core.c        | 117 ++++-------
arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  75 ++++---
arch/x86/kernel/cpu/resctrl/internal.h    |  61 +++---
arch/x86/kernel/cpu/resctrl/monitor.c     | 232 ++++++++++++++--------
arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   2 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 216 ++++++++++++++++----
include/linux/resctrl.h                   |  64 +++++-
8 files changed, 514 insertions(+), 262 deletions(-)
[PATCH v5 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes
Posted by James Morse 3 years, 10 months ago
Changes in this version?
 * Use supports_mba_mbps() in resctrl_{on,off}line_domain()
 * Remove some error handling for errors that can't happen.
 * Restored mbps_val[] reset code to set_mba_sc().
 * Moved mbps_val[] reset in rdtgroup_init_mba() to reduce noise.
 * Added resctrl_arch_round_mon_val() to fix the user provided value.

---
The aim of this series is to insert a split between the parts of the monitor
code that the architecture must implement, and those that are part of the
resctrl filesystem. The eventual aim is to move all filesystem parts out
to live in /fs/resctrl, so that resctrl can be wired up for MPAM.

What's MPAM? See the cover letter of a previous series. [1]

The series adds domain online/offline callbacks to allow the filesystem to
manage some of its structures itself, then moves all the 'mba_sc' behaviour
to be part of the filesystem.
This means another architecture doesn't need to provide an mbps_val array.
As its all software, the resctrl filesystem should be able to do this without
any help from the architecture code.

Finally __rmid_read() is refactored to be the API call that the architecture
provides to read a counter value. All the hardware specific overflow detection,
scaling and value correction should occur behind this helper.


This series is based on v5.19-rc1, and can be retrieved from:
git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/resctrl_monitors_in_bytes/v5

[0] git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/resctrl_merge_cdp/v7
[1] https://lore.kernel.org/lkml/20210728170637.25610-1-james.morse@arm.com/

[v1] https://lore.kernel.org/lkml/20210729223610.29373-1-james.morse@arm.com/
[v2] https://lore.kernel.org/lkml/20211001160302.31189-1-james.morse@arm.com/
[v3] https://lore.kernel.org/lkml/20220217182110.7176-1-james.morse@arm.com/
[v4] https://lore.kernel.org/lkml/20220412124419.30689-1-james.morse@arm.com/

James Morse (21):
  x86/resctrl: Kill off alloc_enabled
  x86/resctrl: Merge mon_capable and mon_enabled
  x86/resctrl: Add domain online callback for resctrl work
  x86/resctrl: Group struct rdt_hw_domain cleanup
  x86/resctrl: Add domain offline callback for resctrl work
  x86/resctrl: Remove set_mba_sc()s control array re-initialisation
  x86/resctrl: Abstract and use supports_mba_mbps()
  x86/resctrl: Create mba_sc configuration in the rdt_domain
  x86/resctrl: Switch over to the resctrl mbps_val list
  x86/resctrl: Remove architecture copy of mbps_val
  x86/resctrl: Allow update_mba_bw() to update controls directly
  x86/resctrl: Calculate bandwidth from the previous __mon_event_count()
    chunks
  x86/resctrl: Add per-rmid arch private storage for overflow and chunks
  x86/resctrl: Allow per-rmid arch private storage to be reset
  x86/resctrl: Abstract __rmid_read()
  x86/resctrl: Pass the required parameters into
    resctrl_arch_rmid_read()
  x86/resctrl: Move mbm_overflow_count() into resctrl_arch_rmid_read()
  x86/resctrl: Move get_corrected_mbm_count() into
    resctrl_arch_rmid_read()
  x86/resctrl: Rename and change the units of resctrl_cqm_threshold
  x86/resctrl: Add resctrl_rmid_realloc_limit to abstract x86's
    boot_cpu_data
  x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes

 arch/x86/include/asm/resctrl.h            |   9 +
 arch/x86/kernel/cpu/resctrl/core.c        | 117 ++++-------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  75 ++++---
 arch/x86/kernel/cpu/resctrl/internal.h    |  61 +++---
 arch/x86/kernel/cpu/resctrl/monitor.c     | 232 ++++++++++++++--------
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   2 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 216 ++++++++++++++++----
 include/linux/resctrl.h                   |  64 +++++-
 8 files changed, 514 insertions(+), 262 deletions(-)

-- 
2.30.2
Re: [PATCH v5 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes
Posted by Reinette Chatre 3 years, 7 months ago
Hi James,

On 6/22/2022 9:46 AM, James Morse wrote:
> The aim of this series is to insert a split between the parts of the monitor
> code that the architecture must implement, and those that are part of the
> resctrl filesystem. The eventual aim is to move all filesystem parts out
> to live in /fs/resctrl, so that resctrl can be wired up for MPAM.
> 
> What's MPAM? See the cover letter of a previous series. [1]
> 
> The series adds domain online/offline callbacks to allow the filesystem to
> manage some of its structures itself, then moves all the 'mba_sc' behaviour
> to be part of the filesystem.
> This means another architecture doesn't need to provide an mbps_val array.
> As its all software, the resctrl filesystem should be able to do this without
> any help from the architecture code.
> 
> Finally __rmid_read() is refactored to be the API call that the architecture
> provides to read a counter value. All the hardware specific overflow detection,
> scaling and value correction should occur behind this helper.
> 

Thank you for your patience as I was offline for a while. 

This series looks good to me. I have one remaining comment that I provided
in reply to "[07/21] x86/resctrl: Abstract and use supports_mba_mbps()" where
it seems to me that an existing issue could easily be addressed in the new
function. 

I do not have tests for the software controller and only did basic sanity
checks. It would be great if the folks using this feature could test this
series.

Thank you very much. From my side it looks good:

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette
Re: [PATCH v5 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes
Posted by James Morse 3 years, 7 months ago
Hi Reinette,

On 23/08/2022 18:20, Reinette Chatre wrote:
> On 6/22/2022 9:46 AM, James Morse wrote:
>> The aim of this series is to insert a split between the parts of the monitor
>> code that the architecture must implement, and those that are part of the
>> resctrl filesystem. The eventual aim is to move all filesystem parts out
>> to live in /fs/resctrl, so that resctrl can be wired up for MPAM.
>>
>> What's MPAM? See the cover letter of a previous series. [1]
>>
>> The series adds domain online/offline callbacks to allow the filesystem to
>> manage some of its structures itself, then moves all the 'mba_sc' behaviour
>> to be part of the filesystem.
>> This means another architecture doesn't need to provide an mbps_val array.
>> As its all software, the resctrl filesystem should be able to do this without
>> any help from the architecture code.
>>
>> Finally __rmid_read() is refactored to be the API call that the architecture
>> provides to read a counter value. All the hardware specific overflow detection,
>> scaling and value correction should occur behind this helper.
>>
> 
> Thank you for your patience as I was offline for a while. 

No problem,


> This series looks good to me. I have one remaining comment that I provided
> in reply to "[07/21] x86/resctrl: Abstract and use supports_mba_mbps()" where
> it seems to me that an existing issue could easily be addressed in the new
> function. 

Yup, that made sense to me.


> I do not have tests for the software controller and only did basic sanity
> checks. It would be great if the folks using this feature could test this
> series.
> 
> Thank you very much. From my side it looks good:
> 
> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>


Thanks!

James
Re: [PATCH v5 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes
Posted by Xin Hao 3 years, 9 months ago
Hi  james,

I have a review all of the patches, it looks goot to me, but i also test 
them once again, i have a little confusion with my test.

# mkdir p1

# echo "L3:0=001;1=001" > p1/schemata

# [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat schemata
     MB:0=100;1=100
     L3:0=001;1=001

# memhog -r1000000 1000m > /mnt/log &

[1] 53023
[root@iZbp1bu26qv0j3ddyusot3Z p1]# echo 53023 > tasks
[
[root@iZbp1bu26qv0j3ddyusot3Z p1]# cat mon_data/mon_L3_00/llc_occupancy
3833856
[root@iZbp1bu26qv0j3ddyusot3Z p1]# cat mon_data/mon_L3_00/llc_occupancy
3620864
[root@iZbp1bu26qv0j3ddyusot3Z p1]# cat mon_data/mon_L3_00/llc_occupancy
3727360
[root@iZbp1bu26qv0j3ddyusot3Z p1]# cat size
     MB:0=100;1=100
     L3:0=3407872;1=3407872

Obviously, the value has been overflowed,  Can you explain why?

My machine environment is:

Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz

numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
23 24 25 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 
73 74 75 76 77
node 0 size: 191813 MB
node 0 free: 189340 MB
node 1 cpus: 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 
46 47 48 49 50 51 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 
96 97 98 99 100 101 102 103
node 1 size: 193522 MB
node 1 free: 192332 MB
node distances:
node   0   1
   0:  10  21
   1:  21  10

On 6/23/22 12:46 AM, James Morse wrote:
> Changes in this version?
>   * Use supports_mba_mbps() in resctrl_{on,off}line_domain()
>   * Remove some error handling for errors that can't happen.
>   * Restored mbps_val[] reset code to set_mba_sc().
>   * Moved mbps_val[] reset in rdtgroup_init_mba() to reduce noise.
>   * Added resctrl_arch_round_mon_val() to fix the user provided value.
>
> ---
> The aim of this series is to insert a split between the parts of the monitor
> code that the architecture must implement, and those that are part of the
> resctrl filesystem. The eventual aim is to move all filesystem parts out
> to live in /fs/resctrl, so that resctrl can be wired up for MPAM.
>
> What's MPAM? See the cover letter of a previous series. [1]
>
> The series adds domain online/offline callbacks to allow the filesystem to
> manage some of its structures itself, then moves all the 'mba_sc' behaviour
> to be part of the filesystem.
> This means another architecture doesn't need to provide an mbps_val array.
> As its all software, the resctrl filesystem should be able to do this without
> any help from the architecture code.
>
> Finally __rmid_read() is refactored to be the API call that the architecture
> provides to read a counter value. All the hardware specific overflow detection,
> scaling and value correction should occur behind this helper.
>
>
> This series is based on v5.19-rc1, and can be retrieved from:
> git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/resctrl_monitors_in_bytes/v5
>
> [0] git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/resctrl_merge_cdp/v7
> [1] https://lore.kernel.org/lkml/20210728170637.25610-1-james.morse@arm.com/
>
> [v1] https://lore.kernel.org/lkml/20210729223610.29373-1-james.morse@arm.com/
> [v2] https://lore.kernel.org/lkml/20211001160302.31189-1-james.morse@arm.com/
> [v3] https://lore.kernel.org/lkml/20220217182110.7176-1-james.morse@arm.com/
> [v4] https://lore.kernel.org/lkml/20220412124419.30689-1-james.morse@arm.com/
>
> James Morse (21):
>    x86/resctrl: Kill off alloc_enabled
>    x86/resctrl: Merge mon_capable and mon_enabled
>    x86/resctrl: Add domain online callback for resctrl work
>    x86/resctrl: Group struct rdt_hw_domain cleanup
>    x86/resctrl: Add domain offline callback for resctrl work
>    x86/resctrl: Remove set_mba_sc()s control array re-initialisation
>    x86/resctrl: Abstract and use supports_mba_mbps()
>    x86/resctrl: Create mba_sc configuration in the rdt_domain
>    x86/resctrl: Switch over to the resctrl mbps_val list
>    x86/resctrl: Remove architecture copy of mbps_val
>    x86/resctrl: Allow update_mba_bw() to update controls directly
>    x86/resctrl: Calculate bandwidth from the previous __mon_event_count()
>      chunks
>    x86/resctrl: Add per-rmid arch private storage for overflow and chunks
>    x86/resctrl: Allow per-rmid arch private storage to be reset
>    x86/resctrl: Abstract __rmid_read()
>    x86/resctrl: Pass the required parameters into
>      resctrl_arch_rmid_read()
>    x86/resctrl: Move mbm_overflow_count() into resctrl_arch_rmid_read()
>    x86/resctrl: Move get_corrected_mbm_count() into
>      resctrl_arch_rmid_read()
>    x86/resctrl: Rename and change the units of resctrl_cqm_threshold
>    x86/resctrl: Add resctrl_rmid_realloc_limit to abstract x86's
>      boot_cpu_data
>    x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes
>
>   arch/x86/include/asm/resctrl.h            |   9 +
>   arch/x86/kernel/cpu/resctrl/core.c        | 117 ++++-------
>   arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  75 ++++---
>   arch/x86/kernel/cpu/resctrl/internal.h    |  61 +++---
>   arch/x86/kernel/cpu/resctrl/monitor.c     | 232 ++++++++++++++--------
>   arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   2 +-
>   arch/x86/kernel/cpu/resctrl/rdtgroup.c    | 216 ++++++++++++++++----
>   include/linux/resctrl.h                   |  64 +++++-
>   8 files changed, 514 insertions(+), 262 deletions(-)
>
-- 
Best Regards!
Xin Hao

Re: [PATCH v5 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes
Posted by Reinette Chatre 3 years, 7 months ago
Hi,

On 7/3/2022 8:54 AM, Xin Hao wrote:
> Hi  james,
> 
> I have a review all of the patches, it looks goot to me, but i also test them once again, i have a little confusion with my test.
> 
> # mkdir p1
> 
> # echo "L3:0=001;1=001" > p1/schemata
> 
> # [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat schemata
>     MB:0=100;1=100
>     L3:0=001;1=001
> 
> # memhog -r1000000 1000m > /mnt/log &
> 
> [1] 53023
> [root@iZbp1bu26qv0j3ddyusot3Z p1]# echo 53023 > tasks
> [
> [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat mon_data/mon_L3_00/llc_occupancy
> 3833856
> [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat mon_data/mon_L3_00/llc_occupancy
> 3620864
> [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat mon_data/mon_L3_00/llc_occupancy
> 3727360
> [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat size
>     MB:0=100;1=100
>     L3:0=3407872;1=3407872
> 
> Obviously, the value has been overflowed,  Can you explain why?

Are you seeing different behavior before and after you apply this
series?

I do not think the conclusion should immediately be that there is an
overflow issue. Have you perhaps run into the scenario "Notes on
cache occupancy monitoring and control" described in
Documentation/x86/resctrl.rst?

When "memhog" starts it can allocate to the entire L3 for a while
before it is moved to the constrained resource group. It's cache
lines are not evicted as part of this move so it is not unusual for
it to have more lines in L3 than it is allowed to allocate into.

Understanding the occupancy values require understanding of the workload
as well as the system environment.

Depending on the workload's data usage (for example if it keeps loading
new data - note that if the workload keeps loading the same data and the
data is already present in an area of cache that the workload cannot
allocate into then the data read would still result in a cache hit for the
workload, the data would not be moved to the area the
workload can allocate into) and other workloads on the system (there is
other load present also that evicts the lines owned by the workload) the
L3 occupancy rate should go down after a while to match the space it
can allocate into.

Reinette
Re: [PATCH v5 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes
Posted by haoxin 3 years, 7 months ago
在 2022/8/24 上午1:09, Reinette Chatre 写道:
> Hi,
>
> On 7/3/2022 8:54 AM, Xin Hao wrote:
>> Hi  james,
>>
>> I have a review all of the patches, it looks goot to me, but i also test them once again, i have a little confusion with my test.
>>
>> # mkdir p1
>>
>> # echo "L3:0=001;1=001" > p1/schemata
>>
>> # [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat schemata
>>      MB:0=100;1=100
>>      L3:0=001;1=001
>>
>> # memhog -r1000000 1000m > /mnt/log &
>>
>> [1] 53023
>> [root@iZbp1bu26qv0j3ddyusot3Z p1]# echo 53023 > tasks
>> [
>> [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat mon_data/mon_L3_00/llc_occupancy
>> 3833856
>> [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat mon_data/mon_L3_00/llc_occupancy
>> 3620864
>> [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat mon_data/mon_L3_00/llc_occupancy
>> 3727360
>> [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat size
>>      MB:0=100;1=100
>>      L3:0=3407872;1=3407872
>>
>> Obviously, the value has been overflowed,  Can you explain why?
> Are you seeing different behavior before and after you apply this
> series?
No,they have the same test result。
>
> I do not think the conclusion should immediately be that there is an
> overflow issue. Have you perhaps run into the scenario "Notes on
> cache occupancy monitoring and control" described in
> Documentation/x86/resctrl.rst?
>
> When "memhog" starts it can allocate to the entire L3 for a while
> before it is moved to the constrained resource group. It's cache
> lines are not evicted as part of this move so it is not unusual for
> it to have more lines in L3 than it is allowed to allocate into.

Yes as you said, the mon_data/mon_L3_00/llc_occupancy does not 
immediately become the value small than the set by schemata,  it may 
takes a few minutes to reduce to the set value.

I don't quite understand why it takes so long to see the llc_occupancy 
degrage.

>
> Understanding the occupancy values require understanding of the workload
> as well as the system environment.
>
> Depending on the workload's data usage (for example if it keeps loading
> new data - note that if the workload keeps loading the same data and the
> data is already present in an area of cache that the workload cannot
> allocate into then the data read would still result in a cache hit for the
> workload, the data would not be moved to the area the
> workload can allocate into) and other workloads on the system (there is
> other load present also that evicts the lines owned by the workload) the
> L3 occupancy rate should go down after a while to match the space it
> can allocate into.
>
> Reinette
Re: [PATCH v5 00/21] x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes
Posted by James Morse 3 years, 7 months ago
Hi Hao Xin,

On 07/09/2022 06:46, haoxin wrote:
> 在 2022/8/24 上午1:09, Reinette Chatre 写道:
>> On 7/3/2022 8:54 AM, Xin Hao wrote:
>>> [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat mon_data/mon_L3_00/llc_occupancy
>>> 3833856
>>> [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat mon_data/mon_L3_00/llc_occupancy
>>> 3620864
>>> [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat mon_data/mon_L3_00/llc_occupancy
>>> 3727360
>>> [root@iZbp1bu26qv0j3ddyusot3Z p1]# cat size
>>>      MB:0=100;1=100
>>>      L3:0=3407872;1=3407872
>>>
>>> Obviously, the value has been overflowed,  Can you explain why?

>> I do not think the conclusion should immediately be that there is an
>> overflow issue. Have you perhaps run into the scenario "Notes on
>> cache occupancy monitoring and control" described in
>> Documentation/x86/resctrl.rst?
>>
>> When "memhog" starts it can allocate to the entire L3 for a while
>> before it is moved to the constrained resource group. It's cache
>> lines are not evicted as part of this move so it is not unusual for
>> it to have more lines in L3 than it is allowed to allocate into.
> 
> Yes as you said, the mon_data/mon_L3_00/llc_occupancy does not immediately become the
> value small than the set by schemata,  it may takes a few minutes to reduce to the set value.
> 
> I don't quite understand why it takes so long to see the llc_occupancy degrage.

Do you have workloads in other control groups causing cache allocations?

One of the ways this stuff can be built is for the cache to use the policy to choose which
lines to evict. The cache may already have some LRU or line-state preferences when it
comes to eviction, so it may not apply the RDT policy as the first choice.

If there is no cache pressure from outside the control group - does it matter how quickly
it takes to apply?


Thanks,

James