.../admin-guide/kernel-parameters.txt | 2 +- Documentation/arch/x86/resctrl.rst | 242 +++++++ arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/msr-index.h | 2 + arch/x86/kernel/cpu/cpuid-deps.c | 3 + arch/x86/kernel/cpu/resctrl/core.c | 23 +- arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 13 + arch/x86/kernel/cpu/resctrl/internal.h | 91 ++- arch/x86/kernel/cpu/resctrl/monitor.c | 402 +++++++++++- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 620 ++++++++++++++++-- arch/x86/kernel/cpu/scattered.c | 1 + include/linux/resctrl.h | 34 +- 12 files changed, 1350 insertions(+), 84 deletions(-)
This series adds the support for Assignable Bandwidth Monitoring Counters
(ABMC). It is also called QoS RMID Pinning feature
Series is written such that it is easier to support other assignable
features supported from different vendors.
The feature details are documented in the APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC). The documentation is available at
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
The patches are based on top of commit
d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
# Introduction
Users can create as many monitor groups as RMIDs supported by the hardware.
However, bandwidth monitoring feature on AMD system only guarantees that
RMIDs currently assigned to a processor will be tracked by hardware.
The counters of any other RMIDs which are no longer being tracked will be
reset to zero. The MBM event counters return "Unavailable" for the RMIDs
that are not tracked by hardware. So, there can be only limited number of
groups that can give guaranteed monitoring numbers. With ever changing
configurations there is no way to definitely know which of these groups
are being tracked for certain point of time. Users do not have the option
to monitor a group or set of groups for certain period of time without
worrying about counter being reset in between.
The ABMC feature provides an option to the user to assign a hardware
counter to an RMID, event pair and monitor the bandwidth as long as it is
assigned. The assigned RMID will be tracked by the hardware until the user
unassigns it manually. There is no need to worry about counters being reset
during this period. Additionally, the user can specify a bitmask identifying
the specific bandwidth types from the given source to track with the counter.
Without ABMC enabled, monitoring will work in current 'default' mode without
assignment option.
# Linux Implementation
Create a generic interface aimed to support user space assignment
of scarce counters used for monitoring. First usage of interface
is by ABMC with option to expand usage to "soft-ABMC" and MPAM
counters in future.
Feature adds following interface files:
/sys/fs/resctrl/info/L3_MON/mbm_assign_mode: Reports the list of assignable
monitoring features supported. The enclosed brackets indicate which
feature is enabled.
/sys/fs/resctrl/info/L3_MON/num_mbm_cntrs: Reports the number of monitoring
counters available for assignment.
/sys/fs/resctrl/info/L3_MON/available_mbm_cntrs: Reports the number of monitoring
counters free in each domain.
/sys/fs/resctrl/info/L3_MON/mbm_assign_control: Reports the resctrl group and monitor
status of each group. Assignment state can be updated by writing to the
interface.
# Examples
a. Check if ABMC support is available
#mount -t resctrl resctrl /sys/fs/resctrl/
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
[mbm_cntr_assign]
default
ABMC feature is detected and it is enabled.
b. Check how many ABMC counters are available.
# cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
32
c. Check how many ABMC counters are available in each domain.
# cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
0=30;1=30
d. Create few resctrl groups.
# mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp
e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
to list and modify any group's monitoring states. File provides single place
to list monitoring states of all the resctrl groups. It makes it easier for
user space to learn about the used counters without needing to traverse all
the groups thus reducing the number of file system calls.
The list follows the following format:
"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
Format for specific type of groups:
* Default CTRL_MON group:
"//<domain_id>=<flags>"
* Non-default CTRL_MON group:
"<CTRL_MON group>//<domain_id>=<flags>"
* Child MON group of default CTRL_MON group:
"/<MON group>/<domain_id>=<flags>"
* Child MON group of non-default CTRL_MON group:
"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
Flags can be one of the following:
t MBM total event is enabled.
l MBM local event is enabled.
tl Both total and local MBM events are enabled.
_ None of the MBM events are enabled
Examples:
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
non_default_ctrl_mon_grp//0=tl;1=tl
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
//0=tl;1=tl
/child_default_mon_grp/0=tl;1=tl
There are four groups and all the groups have local and total
event enabled on domain 0 and 1.
f. Update the group assignment states using the interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control.
The write format is similar to the above list format with addition
of opcode for the assignment operation.
“<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>”
* Default CTRL_MON group:
"//<domain_id><opcode><flags>"
* Non-default CTRL_MON group:
"<CTRL_MON group>//<domain_id><opcode><flags>"
* Child MON group of default CTRL_MON group:
"/<MON group>/<domain_id><opcode><flags>"
* Child MON group of non-default CTRL_MON group:
"<CTRL_MON group>/<MON group>/<domain_id><opcode><flags>"
Opcode can be one of the following:
= Update the assignment to match the flags.
+ Assign a new MBM event without impacting existing assignments.
- Unassign a MBM event from currently assigned events.
Flags can be one of the following:
t MBM total event.
l MBM local event.
tl Both total and local MBM events.
_ None of the MBM events. Only works with '=' opcode. This flag cannot be combined with other flags.
Initial group status:
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
non_default_ctrl_mon_grp//0=tl;1=tl
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
//0=tl;1=tl
/child_default_mon_grp/0=tl;1=tl
To update the default group to enable only total event on domain 0:
# echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
Assignment status after the update:
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
non_default_ctrl_mon_grp//0=tl;1=tl
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
//0=t;1=tl
/child_default_mon_grp/0=tl;1=tl
To update the MON group child_default_mon_grp to remove total event on domain 1:
# echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
Assignment status after the update:
$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
non_default_ctrl_mon_grp//0=tl;1=tl
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
//0=t;1=tl
/child_default_mon_grp/0=tl;1=l
To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to
remove both local and total events on domain 1:
# echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control
Assignment status after the update:
$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
non_default_ctrl_mon_grp//0=tl;1=tl
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_
//0=t;1=tl
/child_default_mon_grp/0=tl;1=l
To update the default group to add a local event domain 0.
# echo "//0+l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
Assignment status after the update:
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
non_default_ctrl_mon_grp//0=tl;1=tl
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_
//0=tl;1=tl
/child_default_mon_grp/0=tl;1=l
To update the non default CTRL_MON group non_default_ctrl_mon_grp to unassign all
the MBM events on all the domains.
# echo "non_default_ctrl_mon_grp//*=_" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
Assignment status after the update:
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
non_default_ctrl_mon_grp//0=_;1=_
non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_
//0=tl;1=tl
/child_default_mon_grp/0=tl;1=l
g. Read the event mbm_total_bytes and mbm_local_bytes of the default group.
There is no change in reading the events with ABMC. If the event is unassigned
when reading, then the read will come back as "Unassigned".
# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
779247936
# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
765207488
h. Check the bandwidth configuration for the group. Note that bandwidth
configuration has a domain scope. Total event defaults to 0x7F (to
count all the events) and local event defaults to 0x15 (to count all
the local numa events). The event bitmap decoding is available at
https://www.kernel.org/doc/Documentation/x86/resctrl.rst
in section "mbm_total_bytes_config", "mbm_local_bytes_config":
#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
0=0x7f;1=0x7f
#cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
0=0x15;1=0x15
i. Change the bandwidth source for domain 0 for the total event to count only reads.
Note that this change effects total events on the domain 0.
#echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
0=0x33;1=0x7F
j. Now read the total event again. The first read may come back with "Unavailable"
status. The subsequent read of mbm_total_bytes will display only the read events.
#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
Unavailable
#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
314101
k. Users will have the option to go back to 'default' mbm_assign_mode if required.
This can be done using the following command. Note that switching the
mbm_assign_mode will reset all the MBM counters of all resctrl groups.
# echo "default" > /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
mbm_cntr_assign
[default]
l. Unmount the resctrl
#umount /sys/fs/resctrl/
---
v11:
The commit 2937f9c361f7a ("x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags")
is already merged. Removed from the series.
Resolved minor conflicts due to code displacement in latest code.
Moved the monitoring related calls to monitor.c file when possible.
Moved some of the changes from include/linux/resctrl.h to arch/x86/kernel/cpu/resctrl/internal.h
as requested by Reinette. This changes will be moved back when arch and non code is separated.
Renamed rdtgroup_mbm_assign_mode_show() to resctrl_mbm_assign_mode_show().
Renamed rdtgroup_num_mbm_cntrs_show() to resctrl_num_mbm_cntrs_show().
Moved the mon_config_info structure definition to internal.h.
Moved resctrl_arch_mon_event_config_get() and resctrl_arch_mon_event_config_set()
to monitor.c file.
Moved resctrl_arch_assign_cntr() and resctrl_abmc_config_one_amd() to monitor.c.
Added the code to reset the arch state in resctrl_arch_assign_cntr().
Also removed resctrl_arch_reset_rmid() inside IPI as the counters are reset from the callers.
Renamed rdtgroup_assign_cntr_event() to resctrl_assign_cntr_event().
Refactored the resctrl_assign_cntr_event().
Added functionality to exit on the first error during assignment.
Simplified mbm_cntr_free().
Removed the function mbm_cntr_assigned(). Will be using mbm_cntr_get() to
figure out if the counter is assigned or not.
Renamed rdtgroup_unassign_cntr_event() to resctrl_unassign_cntr_event().
Refactored the resctrl_unassign_cntr_event().
Moved mbm_cntr_reset() to monitor.c.
Added code reset non-architectural state in mbm_cntr_reset().
Added missing rdtgroup_unassign_cntrs() calls on failure path.
Domain can be NULL with SNC support so moved the unassign check in rdtgroup_mondata_show().
Renamed rdtgroup_mbm_assign_mode_write() to resctrl_mbm_assign_mode_write().
Added more details in resctrl.rst about mbm_cntr_assign mode.
Re-arranged the text in resctrl.rst file in section mbm_cntr_assign.
Moved resctrl_arch_mbm_cntr_assign_set_one() to monitor.c
Added non-arch RMID reset in mbm_config_write_domain().
Removed resctrl_arch_reset_rmid() call in resctrl_abmc_config_one_amd(). Not required
as reset of arch and non-arch rmid counters done from the callers. It simplies the IPI code.
Fixed printing the separator after each domain while listing the group assignments.
Renamed rdtgroup_mbm_assign_control_show to resctrl_mbm_assign_control_show().
Fixed the static check warning with initializing dom_id in resctrl_process_flags()
Added change log in each patch for specific changes.
v10:
Major change is related to domain specific assignment.
Added struct mbm_cntr_cfg inside mon domains. This will handle
the domain specific assignments as discussed in below.
https://lore.kernel.org/lkml/CALPaoCj+zWq1vkHVbXYP0znJbe6Ke3PXPWjtri5AFgD9cQDCUg@mail.gmail.com/
I did not see the need to add cntr_id in mbm_state structure. Not used in the code.
Following patches take care of these changes.
Patch 12, 13, 15, 16, 17, 18.
Added __init attribute to cache_alloc_hsw_probe(). Followed function
prototype rules (preferred order is storage class before return type).
Moved the mon_config_info structure definition to resctrl.h
Added call resctrl_arch_reset_rmid() to reset the RMID in the domain inside IPI call
resctrl_abmc_config_one_amd.
SMP and non-SMP call support is not required in resctrl_arch_config_cntr with new
domain specific assign approach/data structure.
Assigned the counter before exposing the event files.
Moved the call rdtgroup_assign_cntrs() inside mkdir_rdt_prepare_rmid_alloc().
This is called both CNTR_MON and MON group creation.
Call mbm_cntr_reset() when unmounted to clear all the assignments.
Fixed the issue with finding the domain in multiple iterations in rdtgroup_process_flags().
Printed full error message with domain information when assign fails.
Taken care of other text comments in all the patches. Patch specific changes are in each patch.
If I missed something please point me and it is not intentional.
v9:
Patch 14 is a new addition.
Major change in patch 24.
Moved the fix patch to address __init attribute to begining of the series.
Fixed all the call sequences. Added additional Fixed tags.
Added Reviewed-by where applicable.
Took care of couple of minor merge conflicts with latest code.
Re-ordered the MSR in couple of instances.
Added available_mbm_cntrs (patch 14) to print the number of counter in a domain.
Used MBM_EVENT_ARRAY_INDEX macro to get the event index.
Introduced rdtgroup_cntr_id_init() to initialize the cntr_id
Introduced new function resctrl_config_cntr to assign the counter, update
the bitmap and reset the architectural state.
Taken care of error handling(freeing the counter) when assignment fails.
Changed rdtgroup_assign_cntrs() and rdtgroup_unassign_cntrs() to return void.
Updated couple of rdtgroup_unassign_cntrs() calls properly.
Fixed problem changing the mode to mbm_cntr_assign mode when it is
not supported. Added extra checks to detect if systems supports it.
https://lore.kernel.org/lkml/03b278b5-6c15-4d09-9ab7-3317e84a409e@intel.com/
As discussed in the above comment, introduced resctrl_mon_event_config_set to
handle IPI. But sending another IPI inside IPI causes problem. Kernel
reports SMP warning. So, introduced resctrl_arch_update_cntr() to send the
command directly.
Fixed handling special case '//0=' and '//".
Removed extra strstr() call in rdtgroup_mbm_assign_control_write().
Added generic failure text when assignment operation fails.
Corrected user documentation format texts.
v8:
Patches are getting into final stages.
Couple of changes Patch 8, Patch 19 and Patch 23.
Most of the other changes are related to rename and text message updates.
Details are in each patch. Here is the summary.
Added __init attribute to dom_data_init() in patch 8/25.
Moved the mbm_cntrs_init() and mbm_cntrs_exit() functionality inside
dom_data_init() and dom_data_exit() respectively.
Renamed resctrl_mbm_evt_config_init() to arch_mbm_evt_config_init()
Renamed resctrl_arch_event_config_get() to resctrl_arch_mon_event_config_get().
resctrl_arch_event_config_set() to resctrl_arch_mon_event_config_set().
Rename resctrl_arch_assign_cntr to resctrl_arch_config_cntr.
Renamed rdtgroup_assign_cntr() to rdtgroup_assign_cntr_event().
Added the code to return the error if rdtgroup_assign_cntr_event fails.
Moved definition of MBM_EVENT_ARRAY_INDEX to resctrl/internal.h.
Renamed rdtgroup_mbm_cntr_is_assigned to mbm_cntr_assigned_to_domain
Added return error handling in resctrl_arch_config_cntr().
Renamed rdtgroup_assign_grp to rdtgroup_assign_cntrs.
Renamed rdtgroup_unassign_grp to rdtgroup_unassign_cntrs.
Fixed the problem with unassigning the child MON groups of CTRL_MON group.
Reset the internal counters after mbm_cntr_assign mode is changed.
Renamed rdtgroup_mbm_cntr_reset() to mbm_cntr_reset()
Renamed resctrl_arch_mbm_cntr_assign_configure to
resctrl_arch_mbm_cntr_assign_set_one.
Used the same IPI as event update to modify the assignment.
Could not do the way we discussed in the thread.
https://lore.kernel.org/lkml/f77737ac-d3f6-3e4b-3565-564f79c86ca8@amd.com/
Needed to figure out event type to update the configuration.
Moved unassign first and assign during the assign modification.
Assign none "_" takes priority. Cannot be mixed with other flags.
Updated the documentation and .rst file format. htmldoc looks ok.
v7:
Major changes are related to FS and arch codes separation.
Changed few interface names based on feedback.
Here are the summary and each patch contains changes specific the patch.
Removed WARN_ON for num_mbm_cntrs. Decided to dynamically allocate the bitmap.
WARN_ON is not required anymore.
Renamed the function resctrl_arch_get_abmc_enabled() to resctrl_arch_mbm_cntr_assign_enabled().
Merged resctrl_arch_mbm_cntr_assign_disable, resctrl_arch_mbm_cntr_assign_disable
and renamed to resctrl_arch_mbm_cntr_assign_set(). Passed the struct rdt_resource
to these functions.
Removed resctrl_arch_reset_rmid_all() from arch code. This will be done from FS the caller.
Updated the descriptions/commit log in resctrl.rst to generic text. Removed ABMC references.
Renamed mbm_mode to mbm_assign_mode.
Renamed mbm_control to mbm_assign_control.
Introduced mutex lock in rdtgroup_mbm_mode_show().
The 'legacy' mode is called 'default' mode.
Removed the static allocation and now allocating bitmap mbm_cntr_free_map dynamically.
Merged rdtgroup_assign_cntr(), rdtgroup_alloc_cntr() into one.
Merged rdtgroup_unassign_cntr(), rdtgroup_free_cntr() into one.
Added struct rdt_resource to the interface functions resctrl_arch_assign_cntr ()
and resctrl_arch_unassign_cntr().
Rename rdtgroup_abmc_cfg() to resctrl_abmc_config_one_amd().
Added a new patch to fix counter assignment on event config changes.
Removed the references of ABMC from user interfaces.
Simplified the parsing (strsep(&token, "//") in rdtgroup_mbm_assign_control_write().
Added mutex lock in rdtgroup_mbm_assign_control_write() while processing.
Thomas Gleixner asked us to update https://gitlab.com/x86-cpuid.org/x86-cpuid-db.
It needs internal approval. We are working on it.
v6:
We still need to finalize few interface details on mbm_assign_mode and mbm_assign_control
in case of ABMC and Soft-ABMC. We can continue the discussion with this series.
Added support for domain-id '*' to update all the domains at once.
Fixed assign interface to allocate the counter if counter is
not assigned.
Fixed unassign interface to free the counter if the counter is not
assigned in any of the domains.
Renamed abmc_capable to mbm_cntr_assignable.
Renamed abmc_enabled to mbm_cntr_assign_enabled.
Used msr_set_bit and msr_clear_bit for msr updates.
Renamed resctrl_arch_abmc_enable() to resctrl_arch_mbm_cntr_assign_enable().
Renamed resctrl_arch_abmc_disable() to resctrl_arch_mbm_cntr_assign_disable().
Changed the display name from num_cntrs to num_mbm_cntrs.
Removed the variable mbm_cntrs_free_map_len. This is not required.
Removed the call mbm_cntrs_init() in arch code. This needs to be done at higher level.
Used DECLARE_BITMAP to initialize mbm_cntrs_free_map.
Removed unused config value definitions.
Introduced mbm_cntr_map to track counters at domain level. With this
we dont need to send MSR read to read the counter configuration.
Separated all the counter id management to upper level in FS code.
Added checks to detect "Unassigned" before reading the RMID.
More details in each patch.
v5:
Rebase changes (because of SNC support)
Interface changes.
/sys/fs/resctrl/mbm_assign to /sys/fs/resctrl/mbm_assign_mode.
/sys/fs/resctrl/mbm_assign_control to /sys/fs/resctrl/mbm_assign_control.
Added few arch specific routines.
resctrl_arch_get_abmc_enabled.
resctrl_arch_abmc_enable.
resctrl_arch_abmc_disable.
Few renames
num_cntrs_free_map -> mbm_cntrs_free_map
num_cntrs_init -> mbm_cntrs_init
arch_domain_mbm_evt_config -> resctrl_arch_mbm_evt_config
Introduced resctrl_arch_event_config_get and
resctrl_arch_event_config_set() to update event configuration.
Removed mon_state field mongroup. Added MON_CNTR_UNSET to initialize counters.
Renamed ctr_id to cntr_id for the hardware counter.
Report "Unassigned" in case the user attempts to read the events without assigning the counter.
ABMC is enabled during the boot up. Can be enabled or disabled later.
Fixed opcode and flags combination.
'=_" is valid.
"-_" amd "+_" is not valid.
Added all the comments as far as I know. If I missed something, it is not intentional.
v4:
Main change is domain specific event assignment.
Kept the ABMC feature as a default.
Dynamcic switching between ABMC and mbm_legacy is still allowed.
We are still not clear about mount option.
Moved the monitoring related data in resctrl_mon structure from rdt_resource.
Fixed the display of legacy and ABMC mode.
Used bimap APIs when possible.
Removed event configuration read from MSRs. We can use the
internal saved data.(patch 12)
Added more comments about L3_QOS_ABMC_CFG MSR.
Added IPIs to read the assignment status for each domain (patch 18 and 19)
More details in each patch.
v3:
This series adds the support for global assignment mode discussed in
the thread. https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/
Removed the individual assignment mode and included the global assignment interface.
Added following interface files.
a. /sys/fs/resctrl/info/L3_MON/mbm_assign
Used for displaying the current assignment mode and switch between
ABMC and legacy mode.
b. /sys/fs/resctrl/info/L3_MON/mbm_assign_control
Used for lising the groups assignment mode and modify the assignment states.
c. Most of the changes are related to the new interface.
d. Addressed the comments from Reinette, James and Peter.
e. Hope I have addressed most of the major feedbacks discussed. If I missed
something then it is not intentional. Please feel free to comment.
f. Sending this as an RFC as per Reinette's comment. So, this is still open
for discussion.
v2:
a. Major change is the way ABMC is enabled. Earlier, user needed to remount
with -o abmc to enable ABMC feature. Removed that option now.
Now users can enable ABMC by "$echo 1 to /sys/fs/resctrl/info/L3_MON/mbm_assign_enable".
b. Added new word 21 to x86/cpufeatures.h.
c. Display unsupported if user attempts to read the events when ABMC is enabled
and event is not assigned.
d. Display monitor_state as "Unsupported" when ABMC is disabled.
e. Text updates and rebase to latest tip tree (as of Jan 18).
f. This series is still work in progress. I am yet to hear from ARM developers.
--------------------------------------------------------------------------------------
Previous revisions:
v10: https://lore.kernel.org/lkml/cover.1734034524.git.babu.moger@amd.com/
v9: https://lore.kernel.org/lkml/cover.1730244116.git.babu.moger@amd.com/
v8: https://lore.kernel.org/lkml/cover.1728495588.git.babu.moger@amd.com/
v7: https://lore.kernel.org/lkml/cover.1725488488.git.babu.moger@amd.com/
v6: https://lore.kernel.org/lkml/cover.1722981659.git.babu.moger@amd.com/
v5: https://lore.kernel.org/lkml/cover.1720043311.git.babu.moger@amd.com/
v4: https://lore.kernel.org/lkml/cover.1716552602.git.babu.moger@amd.com/
v3: https://lore.kernel.org/lkml/cover.1711674410.git.babu.moger@amd.com/
v2: https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/
v1: https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/
Babu Moger (23):
x86/resctrl: Add __init attribute to functions called from
resctrl_late_init()
x86/cpufeatures: Add support for Assignable Bandwidth Monitoring
Counters (ABMC)
x86/resctrl: Add ABMC feature in the command line options
x86/resctrl: Consolidate monitoring related data from rdt_resource
x86/resctrl: Detect Assignable Bandwidth Monitoring feature details
x86/resctrl: Add support to enable/disable AMD ABMC feature
x86/resctrl: Introduce the interface to display monitor mode
x86/resctrl: Introduce interface to display number of monitoring
counters
x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg in struct
rdt_hw_mon_domain
x86/resctrl: Remove MSR reading of event configuration value
x86/resctrl: Introduce mbm_cntr_cfg to track assignable counters at
domain
x86/resctrl: Introduce interface to display number of free counters
x86/resctrl: Add data structures and definitions for ABMC assignment
x86/resctrl: Implement resctrl_arch_config_cntr() to assign a counter
with ABMC
x86/resctrl: Add the functionality to assigm MBM events
x86/resctrl: Add the functionality to unassigm MBM events
x86/resctrl: Auto assign/unassign counters when mbm_cntr_assign is
enabled
x86/resctrl: Report "Unassigned" for MBM events in mbm_cntr_assign
mode
x86/resctrl: Introduce the interface to switch between monitor modes
x86/resctrl: Configure mbm_cntr_assign mode if supported
x86/resctrl: Update assignments on event configuration changes
x86/resctrl: Introduce interface to list assignment states of all the
groups
x86/resctrl: Introduce interface to modify assignment states of the
groups
.../admin-guide/kernel-parameters.txt | 2 +-
Documentation/arch/x86/resctrl.rst | 242 +++++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 2 +
arch/x86/kernel/cpu/cpuid-deps.c | 3 +
arch/x86/kernel/cpu/resctrl/core.c | 23 +-
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 13 +
arch/x86/kernel/cpu/resctrl/internal.h | 91 ++-
arch/x86/kernel/cpu/resctrl/monitor.c | 402 +++++++++++-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 620 ++++++++++++++++--
arch/x86/kernel/cpu/scattered.c | 1 +
include/linux/resctrl.h | 34 +-
12 files changed, 1350 insertions(+), 84 deletions(-)
--
2.34.1
Hi Babu, On 22/01/2025 20:20, Babu Moger wrote: > This series adds the support for Assignable Bandwidth Monitoring Counters > (ABMC). It is also called QoS RMID Pinning feature > > Series is written such that it is easier to support other assignable > features supported from different vendors. > > The feature details are documented in the APM listed below [1]. > [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming > Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth > Monitoring (ABMC). The documentation is available at > Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 > > The patches are based on top of commit > d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx' I've rebased the MPAM tree on top of this v11, here: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/abmc/v11 Hopefully this is sufficient evidence that this interface works for MPAM. It would be convenient for MPAM platforms to not have to support a 'default' mode if they are emulating ABMC - this was something that was never supported, and its not a problem that can be solved. (comments on the relevant patches). Thanks, James
Hi there, On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: > > This series adds the support for Assignable Bandwidth Monitoring Counters > (ABMC). It is also called QoS RMID Pinning feature > > Series is written such that it is easier to support other assignable > features supported from different vendors. > > The feature details are documented in the APM listed below [1]. > [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming > Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth > Monitoring (ABMC). The documentation is available at > Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 > > The patches are based on top of commit > d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx' > > # Introduction [...] > # Examples > > a. Check if ABMC support is available > #mount -t resctrl resctrl /sys/fs/resctrl/ > > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode > [mbm_cntr_assign] > default (Nit: can this be called "mbm_counter_assign"? The name is already long, so I wonder whether anything is gained by using a cryptic abbreviation for "counter". Same with all the "cntrs" elsewhere. This is purely cosmetic, though -- the interface works either way.) > ABMC feature is detected and it is enabled. > > b. Check how many ABMC counters are available. > > # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs > 32 Is this file needed? With MPAM, it is more difficult to promise that the same number of counters will be available everywhere. Rather than lie, or report a "safe" value here that may waste some counters, can we just allow the number of counters to be be discovered per domain via available_mbm_cntrs? num_closids and num_rmids are already problematic for MPAM, so it would be good to avoid any more parameters of this sort from being reported to userspace unless there is a clear understanding of why they are needed. Reporting number of counters per monitoring domain is a more natural fit for MPAM, as below: > c. Check how many ABMC counters are available in each domain. > > # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs > 0=30;1=30 For MPAM, this seems supportable. Each monitoring domain will have some counters, and a well-defined number of them will be available for allocation at any one time. > d. Create few resctrl groups. > > # mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp > # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp > # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp > > e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control > to list and modify any group's monitoring states. File provides single place > to list monitoring states of all the resctrl groups. It makes it easier for > user space to learn about the used counters without needing to traverse all > the groups thus reducing the number of file system calls. > > The list follows the following format: > > "<CTRL_MON group>/<MON group>/<domain_id>=<flags>" > > Format for specific type of groups: > > * Default CTRL_MON group: > "//<domain_id>=<flags>" [...] > Flags can be one of the following: > > t MBM total event is enabled. > l MBM local event is enabled. > tl Both total and local MBM events are enabled. > _ None of the MBM events are enabled > > Examples: [...] I think that this basically works for MPAM. The local/total distinction doesn't map in a consistent way onto MPAM, but this problem is not specific to ABMC. It feels sensible for ABMC to be built around the same concepts that resctrl already has elsewhere in the interface. MPAM will do its best to fit (as already). Regarding Peter's use case of assiging multiple counters to a monitoring group [1], I feel that it's probably good enough to make sure that the ABMC interface can be extended in future in a backwards compatible way so as to support this, without trying to support it immediately. [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/ For example, if we added new generic "letters" -- say, "0" to "9", combined with new counter files in resctrlfs, that feels like a possible approach. ABMC (as in this series) should just reject such such assignments, and the new counter files wouldn't exist. Availability of this feature could also be reported as a distinct mode in mbm_assign_mode, say "mbm_cntr_generic", or whatever. A _sketch_ of this follows. This is NOT a proposal -- the key question is whether we are confident that we can extend the interface in this way in the future without breaking anything. If "yes", then the ABMC interface (as proposed by this series) works as a foundation to build on. --8<-- [artists's impression] # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode mbm_cntr_generic [mbm_cntr_assign] default # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type ... # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes etc. -->8-- Any thoughts on this, Peter? [...] Cheers ---Dave
Hi Dave, On 2/12/25 9:46 AM, Dave Martin wrote: > Hi there, > > On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: >> >> This series adds the support for Assignable Bandwidth Monitoring Counters >> (ABMC). It is also called QoS RMID Pinning feature >> >> Series is written such that it is easier to support other assignable >> features supported from different vendors. >> >> The feature details are documented in the APM listed below [1]. >> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming >> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth >> Monitoring (ABMC). The documentation is available at >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 >> >> The patches are based on top of commit >> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx' >> >> # Introduction > > [...] > >> # Examples >> >> a. Check if ABMC support is available >> #mount -t resctrl resctrl /sys/fs/resctrl/ >> >> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode >> [mbm_cntr_assign] >> default > > (Nit: can this be called "mbm_counter_assign"? The name is already > long, so I wonder whether anything is gained by using a cryptic > abbreviation for "counter". Same with all the "cntrs" elsewhere. > This is purely cosmetic, though -- the interface works either way.) > >> ABMC feature is detected and it is enabled. >> >> b. Check how many ABMC counters are available. >> >> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs >> 32 > > Is this file needed? > > With MPAM, it is more difficult to promise that the same number of > counters will be available everywhere. > > Rather than lie, or report a "safe" value here that may waste some > counters, can we just allow the number of counters to be be discovered > per domain via available_mbm_cntrs? This sounds reasonable to me. I think us having trouble with the user documentation of this file so late in development should also have been a sign to rethink its value. For a user to discover the number of counters supported via available_mbm_cntrs would require the file's contents to be captured right after mount. Since we've had scenarios where new userspace needs to discover an up-and-running system's configuration this may not be possible. I thus wonder instead of removing num_mbm_cntrs, it could be modified to return the per-domain supported counters instead of a single value? > num_closids and num_rmids are already problematic for MPAM, so it would > be good to avoid any more parameters of this sort from being reported > to userspace unless there is a clear understanding of why they are > needed. Yes. Appreciate your help in identifying what could be problematic for MPAM. > > Reporting number of counters per monitoring domain is a more natural > fit for MPAM, as below: > >> c. Check how many ABMC counters are available in each domain. >> >> # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs >> 0=30;1=30 > > For MPAM, this seems supportable. Each monitoring domain will have > some counters, and a well-defined number of them will be available for > allocation at any one time. > >> d. Create few resctrl groups. >> >> # mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp >> # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp >> # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp >> >> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control >> to list and modify any group's monitoring states. File provides single place >> to list monitoring states of all the resctrl groups. It makes it easier for >> user space to learn about the used counters without needing to traverse all >> the groups thus reducing the number of file system calls. >> >> The list follows the following format: >> >> "<CTRL_MON group>/<MON group>/<domain_id>=<flags>" >> >> Format for specific type of groups: >> >> * Default CTRL_MON group: >> "//<domain_id>=<flags>" > > [...] > >> Flags can be one of the following: >> >> t MBM total event is enabled. >> l MBM local event is enabled. >> tl Both total and local MBM events are enabled. >> _ None of the MBM events are enabled >> >> Examples: > > [...] > > I think that this basically works for MPAM. > > The local/total distinction doesn't map in a consistent way onto MPAM, > but this problem is not specific to ABMC. It feels sensible for ABMC > to be built around the same concepts that resctrl already has elsewhere > in the interface. MPAM will do its best to fit (as already). > > Regarding Peter's use case of assiging multiple counters to a > monitoring group [1], I feel that it's probably good enough to make > sure that the ABMC interface can be extended in future in a backwards > compatible way so as to support this, without trying to support it > immediately. > > [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/ > I do not think that resctrl's current support of the mbm_total_bytes and mbm_local_bytes should be considered as the "only" two available "slots" into which all possible events should be forced into. "mon_features" exists to guide user space to which events are supported and as I see it new events can be listed here to inform user space of their availability, with their associated event files available in the resource groups. > > For example, if we added new generic "letters" -- say, "0" to "9", > combined with new counter files in resctrlfs, that feels like a > possible approach. ABMC (as in this series) should just reject such > such assignments, and the new counter files wouldn't exist. > > Availability of this feature could also be reported as a distinct mode > in mbm_assign_mode, say "mbm_cntr_generic", or whatever. > > > A _sketch_ of this follows. This is NOT a proposal -- the key > question is whether we are confident that we can extend the interface > in this way in the future without breaking anything. > > If "yes", then the ABMC interface (as proposed by this series) works as > a foundation to build on. > > --8<-- > > [artists's impression] > > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode > mbm_cntr_generic > [mbm_cntr_assign] > default > > # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode > # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control > # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type > # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type > # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type > # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type > > ... > > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes > > etc. > It is not clear to me what additional features such an interface enables. It also looks like user space will need to track and manage counter IDs? It sounds to me as though the issue starts with your statement "The local/total distinction doesn't map in a consistent way onto MPAM". To address this I expect that an MPAM system will not support nor list mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead, it would list the events that are appropriate to the system? Trying to match with what Peter said [1] in the message you refer to, this may be possible: # cat /sys/fs/resctrl/info/L3_MON/mon_features mbm_local_read_bytes mbm_local_write_bytes mbm_local_bytes (*) I am including mbm_local_bytes since it could be an event that can be software defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both counted. I see the support for MPAM events distinct from the support of assignable counters. Once the MPAM events are sorted, I think that they can be assigned with existing interface. Please help me understand if you see it differently. Doing so would need to come up with alphabetical letters for these events, which seems to be needed for your proposal also? If we use possible flags of: mbm_local_read_bytes a mbm_local_write_bytes b Then mbm_assign_control can be used as: # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes <value> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes <sum of mbm_local_read_bytes and mbm_local_write_bytes> One issue would be when resctrl needs to support more than 26 events (no more flags available), assuming that upper case would be used for "shared" counters (unless this interface is defined differently and only few uppercase letters used for it). Would this be too low of a limit? Reinette [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
> I do not think that resctrl's current support of the mbm_total_bytes and > mbm_local_bytes should be considered as the "only" two available "slots" > into which all possible events should be forced into. "mon_features" exists > to guide user space to which events are supported and as I see it new events > can be listed here to inform user space of their availability, with their > associated event files available in the resource groups. 100% I have a number of "events" in the pipeline that do not fit these names. I'm planning on new files with descriptive[1] names for the events they report. -Tony [1] When these are ready to post we can discuss the names I chose and change them if there are better names that work across architectures.
Hi Tony, On Thu, Feb 13, 2025 at 12:11:13AM +0000, Luck, Tony wrote: > > I do not think that resctrl's current support of the mbm_total_bytes and > > mbm_local_bytes should be considered as the "only" two available "slots" > > into which all possible events should be forced into. "mon_features" exists > > to guide user space to which events are supported and as I see it new events > > can be listed here to inform user space of their availability, with their > > associated event files available in the resource groups. > > 100% I have a number of "events" in the pipeline that do not fit these > names. I'm planning on new files with descriptive[1] names for the events > they report. > > -Tony > > [1] When these are ready to post we can discuss the names I chose and > change them if there are better names that work across architectures. Do any of the approaches discussed in [2] look viable for this? (Ideally, reply over there.) Cheers ---Dave [2] https://lore.kernel.org/lkml/Z64tw2NbJXbKpLrH@e133380.arm.com/
-Fenghua (his email address does not work anymore) On 2/12/25 3:33 PM, Reinette Chatre wrote: > Hi Dave, > > On 2/12/25 9:46 AM, Dave Martin wrote: >> Hi there, >> >> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: >>> >>> This series adds the support for Assignable Bandwidth Monitoring Counters >>> (ABMC). It is also called QoS RMID Pinning feature >>> >>> Series is written such that it is easier to support other assignable >>> features supported from different vendors. >>> >>> The feature details are documented in the APM listed below [1]. >>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming >>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth >>> Monitoring (ABMC). The documentation is available at >>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 >>> >>> The patches are based on top of commit >>> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx' >>> >>> # Introduction >> >> [...] >> >>> # Examples >>> >>> a. Check if ABMC support is available >>> #mount -t resctrl resctrl /sys/fs/resctrl/ >>> >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode >>> [mbm_cntr_assign] >>> default >> >> (Nit: can this be called "mbm_counter_assign"? The name is already >> long, so I wonder whether anything is gained by using a cryptic >> abbreviation for "counter". Same with all the "cntrs" elsewhere. >> This is purely cosmetic, though -- the interface works either way.) >> >>> ABMC feature is detected and it is enabled. >>> >>> b. Check how many ABMC counters are available. >>> >>> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs >>> 32 >> >> Is this file needed? >> >> With MPAM, it is more difficult to promise that the same number of >> counters will be available everywhere. >> >> Rather than lie, or report a "safe" value here that may waste some >> counters, can we just allow the number of counters to be be discovered >> per domain via available_mbm_cntrs? > > This sounds reasonable to me. I think us having trouble with the > user documentation of this file so late in development should also have been > a sign to rethink its value. > > For a user to discover the number of counters supported via available_mbm_cntrs > would require the file's contents to be captured right after mount. Since we've > had scenarios where new userspace needs to discover an up-and-running system's > configuration this may not be possible. I thus wonder instead of removing > num_mbm_cntrs, it could be modified to return the per-domain supported counters > instead of a single value? > >> num_closids and num_rmids are already problematic for MPAM, so it would >> be good to avoid any more parameters of this sort from being reported >> to userspace unless there is a clear understanding of why they are >> needed. > > Yes. Appreciate your help in identifying what could be problematic for MPAM. > >> >> Reporting number of counters per monitoring domain is a more natural >> fit for MPAM, as below: >> >>> c. Check how many ABMC counters are available in each domain. >>> >>> # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs >>> 0=30;1=30 >> >> For MPAM, this seems supportable. Each monitoring domain will have >> some counters, and a well-defined number of them will be available for >> allocation at any one time. >> >>> d. Create few resctrl groups. >>> >>> # mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp >>> # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp >>> # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp >>> >>> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control >>> to list and modify any group's monitoring states. File provides single place >>> to list monitoring states of all the resctrl groups. It makes it easier for >>> user space to learn about the used counters without needing to traverse all >>> the groups thus reducing the number of file system calls. >>> >>> The list follows the following format: >>> >>> "<CTRL_MON group>/<MON group>/<domain_id>=<flags>" >>> >>> Format for specific type of groups: >>> >>> * Default CTRL_MON group: >>> "//<domain_id>=<flags>" >> >> [...] >> >>> Flags can be one of the following: >>> >>> t MBM total event is enabled. >>> l MBM local event is enabled. >>> tl Both total and local MBM events are enabled. >>> _ None of the MBM events are enabled >>> >>> Examples: >> >> [...] >> >> I think that this basically works for MPAM. >> >> The local/total distinction doesn't map in a consistent way onto MPAM, >> but this problem is not specific to ABMC. It feels sensible for ABMC >> to be built around the same concepts that resctrl already has elsewhere >> in the interface. MPAM will do its best to fit (as already). >> >> Regarding Peter's use case of assiging multiple counters to a >> monitoring group [1], I feel that it's probably good enough to make >> sure that the ABMC interface can be extended in future in a backwards >> compatible way so as to support this, without trying to support it >> immediately. >> >> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/ >> > > I do not think that resctrl's current support of the mbm_total_bytes and > mbm_local_bytes should be considered as the "only" two available "slots" > into which all possible events should be forced into. "mon_features" exists > to guide user space to which events are supported and as I see it new events > can be listed here to inform user space of their availability, with their > associated event files available in the resource groups. > >> >> For example, if we added new generic "letters" -- say, "0" to "9", >> combined with new counter files in resctrlfs, that feels like a >> possible approach. ABMC (as in this series) should just reject such >> such assignments, and the new counter files wouldn't exist. >> >> Availability of this feature could also be reported as a distinct mode >> in mbm_assign_mode, say "mbm_cntr_generic", or whatever. >> >> >> A _sketch_ of this follows. This is NOT a proposal -- the key >> question is whether we are confident that we can extend the interface >> in this way in the future without breaking anything. >> >> If "yes", then the ABMC interface (as proposed by this series) works as >> a foundation to build on. >> >> --8<-- >> >> [artists's impression] >> >> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode >> mbm_cntr_generic >> [mbm_cntr_assign] >> default >> >> # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode >> # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control >> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type >> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type >> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type >> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type >> >> ... >> >> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes >> >> etc. >> > > It is not clear to me what additional features such an interface enables. It > also looks like user space will need to track and manage counter IDs? > > It sounds to me as though the issue starts with your statement > "The local/total distinction doesn't map in a consistent way onto MPAM". To > address this I expect that an MPAM system will not support nor list > mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead, > it would list the events that are appropriate to the system? Trying to match > with what Peter said [1] in the message you refer to, this may be possible: > > # cat /sys/fs/resctrl/info/L3_MON/mon_features > mbm_local_read_bytes > mbm_local_write_bytes > mbm_local_bytes > > (*) I am including mbm_local_bytes since it could be an event that can be software > defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both > counted. > > I see the support for MPAM events distinct from the support of assignable counters. > Once the MPAM events are sorted, I think that they can be assigned with existing interface. > Please help me understand if you see it differently. > > Doing so would need to come up with alphabetical letters for these events, > which seems to be needed for your proposal also? If we use possible flags of: > > mbm_local_read_bytes a > mbm_local_write_bytes b > > Then mbm_assign_control can be used as: > # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes > <value> > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > <sum of mbm_local_read_bytes and mbm_local_write_bytes> > > One issue would be when resctrl needs to support more than 26 events (no more flags available), > assuming that upper case would be used for "shared" counters (unless this interface is defined > differently and only few uppercase letters used for it). Would this be too low of a limit? > > Reinette > > [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
Hi Reinette,
On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> Hi Dave,
>
> On 2/12/25 9:46 AM, Dave Martin wrote:
> > Hi there,
> >
> > On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>
> >> This series adds the support for Assignable Bandwidth Monitoring Counters
> >> (ABMC). It is also called QoS RMID Pinning feature
> >>
> >> Series is written such that it is easier to support other assignable
> >> features supported from different vendors.
> >>
> >> The feature details are documented in the APM listed below [1].
> >> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
> >> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
> >> Monitoring (ABMC). The documentation is available at
> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
> >>
> >> The patches are based on top of commit
> >> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
[...]
> >> b. Check how many ABMC counters are available.
> >>
> >> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
> >> 32
> >
> > Is this file needed?
> >
> > With MPAM, it is more difficult to promise that the same number of
> > counters will be available everywhere.
> >
> > Rather than lie, or report a "safe" value here that may waste some
> > counters, can we just allow the number of counters to be be discovered
> > per domain via available_mbm_cntrs?
>
> This sounds reasonable to me. I think us having trouble with the
> user documentation of this file so late in development should also have been
> a sign to rethink its value.
>
> For a user to discover the number of counters supported via available_mbm_cntrs
> would require the file's contents to be captured right after mount. Since we've
> had scenarios where new userspace needs to discover an up-and-running system's
> configuration this may not be possible. I thus wonder instead of removing
> num_mbm_cntrs, it could be modified to return the per-domain supported counters
> instead of a single value?
Is it actually useful to be able to discover the number of counters
that exist? A counter that exists but is not available cannot be used,
so perhaps it is not useful to know about it in the first place.
But if we keep this file but make it report the number of counters for
each domain (similarly to mbm_available_cntrs), then I think the MPAM
driver should be able to work with that.
> > num_closids and num_rmids are already problematic for MPAM, so it would
> > be good to avoid any more parameters of this sort from being reported
> > to userspace unless there is a clear understanding of why they are
> > needed.
>
> Yes. Appreciate your help in identifying what could be problematic for MPAM.
For clarity: this is a background issue, mostly orthogonal to this
series.
If this series is merged as-is, with a global per-resource
num_mbm_cntrs property, then this not really worse than the current
situation -- it's just a bit annoying from the MPAM perspective.
In a nutshell, the num_closids / num_rmids parameters seem to expose
RDT-specific hardware semantics to userspace, implying a specific
allocation model for control group and monitoring group identifiers.
The guarantees that userspace is entitled to asssume when resctrl
reports particular values do not seem to be well described and are hard
to map onto the nearest-equivalent MPAM implementation. A combination
of control and monitoring groups that can be created on x86 may not be
creatable on MPAM, even when the number of supportable control and
monitoring partitions is the same.
Even with the ABMC series, we may still be constrained on what we can
report for num_rmids: we can't know in advance whether or not the user
is going to use mbm_cntr_assign mode -- if not, we can't promise to
create more monitoring groups than the number of counters in the
hardware.
It seems natural for the counts reported by "available_mbm_cntrs" to
change dynamically when the ABMC assignment mode is changed, but I
think userspace are likely to expect the global "num_rmids" parameters
to be fixed for the lifetime of the resctrl mount (and possibly fixed
for all time on a given hardware platform -- at least, modulo CDP).
I think it might be possible to tighten up the docmentation of
num_closids in particular in a way that doesn't conflict with x86 and
may make it easier for MPAM to fit in with, but that feels like a
separate conversation.
None of this should be considered a blocker for this series, either way.
> >
> > Reporting number of counters per monitoring domain is a more natural
> > fit for MPAM, as below:
> >
> >> c. Check how many ABMC counters are available in each domain.
> >>
> >> # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
> >> 0=30;1=30
> >
> > For MPAM, this seems supportable. Each monitoring domain will have
> > some counters, and a well-defined number of them will be available for
> > allocation at any one time.
[...]
> >> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
[...]
> >> Flags can be one of the following:
> >>
> >> t MBM total event is enabled.
> >> l MBM local event is enabled.
> >> tl Both total and local MBM events are enabled.
> >> _ None of the MBM events are enabled
> >>
> >> Examples:
> >
> > [...]
> >
> > I think that this basically works for MPAM.
> >
> > The local/total distinction doesn't map in a consistent way onto MPAM,
> > but this problem is not specific to ABMC. It feels sensible for ABMC
> > to be built around the same concepts that resctrl already has elsewhere
> > in the interface. MPAM will do its best to fit (as already).
> >
> > Regarding Peter's use case of assiging multiple counters to a
> > monitoring group [1], I feel that it's probably good enough to make
> > sure that the ABMC interface can be extended in future in a backwards
> > compatible way so as to support this, without trying to support it
> > immediately.
> >
> > [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
> >
>
> I do not think that resctrl's current support of the mbm_total_bytes and
> mbm_local_bytes should be considered as the "only" two available "slots"
> into which all possible events should be forced into. "mon_features" exists
> to guide user space to which events are supported and as I see it new events
> can be listed here to inform user space of their availability, with their
> associated event files available in the resource groups.
That's fair. I wasn't currently sure how (or if) the set of countable
events was expected to grow / evolve via this route.
Either way, I think this confirms that there is at least one viable way
to enable more counters for a single control group, on top of this
series.
(If there is more than one way, that seems fine?)
> >
> > For example, if we added new generic "letters" -- say, "0" to "9",
> > combined with new counter files in resctrlfs, that feels like a
> > possible approach. ABMC (as in this series) should just reject such
> > such assignments, and the new counter files wouldn't exist.
> >
> > Availability of this feature could also be reported as a distinct mode
> > in mbm_assign_mode, say "mbm_cntr_generic", or whatever.
> >
> >
> > A _sketch_ of this follows. This is NOT a proposal -- the key
> > question is whether we are confident that we can extend the interface
> > in this way in the future without breaking anything.
> >
> > If "yes", then the ABMC interface (as proposed by this series) works as
> > a foundation to build on.
> >
> > --8<--
> >
> > [artists's impression]
> >
> > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> > mbm_cntr_generic
> > [mbm_cntr_assign]
> > default
> >
> > # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
> > # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> > # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type
> > # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type
> > # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type
> > # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type
> >
> > ...
> >
> > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes
> >
> > etc.
> >
>
> It is not clear to me what additional features such an interface enables. It
> also looks like user space will need to track and manage counter IDs?
My idea was that for these generic counters, new files could be exposed
to configure what they actually count (the ..._type files shown above;
or possibly via the ..._config files that already exist).
The "IDs" were inteded as abstract; the number only relates the
assignments in mbm_assign_control to the files created elsewhere. This
wouldn't be related to IDs assigned by the hardware.
If there are multiple resctrl users then using numeric IDs might be
problematic; though if we go eventually in the direction of making
resctrlfs multi-mountable then each mount could have its own namespace.
Allowing counters to be named and configured with a mkdir()-style
interface might be possible too; that might make it easier for users to
coexist within a single resctrl mount (if we think that's important
enough).
> It sounds to me as though the issue starts with your statement
> "The local/total distinction doesn't map in a consistent way onto MPAM". To
> address this I expect that an MPAM system will not support nor list
> mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead,
> it would list the events that are appropriate to the system? Trying to match
> with what Peter said [1] in the message you refer to, this may be possible:
>
> # cat /sys/fs/resctrl/info/L3_MON/mon_features
> mbm_local_read_bytes
> mbm_local_write_bytes
> mbm_local_bytes
>
> (*) I am including mbm_local_bytes since it could be an event that can be software
> defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both
> counted.
>
> I see the support for MPAM events distinct from the support of assignable counters.
> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> Please help me understand if you see it differently.
>
> Doing so would need to come up with alphabetical letters for these events,
> which seems to be needed for your proposal also? If we use possible flags of:
>
> mbm_local_read_bytes a
> mbm_local_write_bytes b
>
> Then mbm_assign_control can be used as:
> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> <value>
> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>
> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> assuming that upper case would be used for "shared" counters (unless this interface is defined
> differently and only few uppercase letters used for it). Would this be too low of a limit?
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
That approach would also work, where an MPAM system has events are not
a reasonable approximation of the generic "total" or "local".
For now we would probably stick with "total" and "local" anyway though,
because the MPAM architecture doesn't natively allow the mapping onto
the memory system topology to be discovered, and the information in
ACPI / device tree is insufficient to tell us everything we'd need to
know. But I guess what counts as "local" in particular will be quite
hardware and topology dependent even on x86, so perhaps we shouldn't
worry about having the behaviour match exactly (?)
Regarding the code letters, my idea was that the event type might be
configured by a separate file, instead of in mbm_assign_control
directly, in which case running out of letters wouldn't be a problem.
Alternatively, if we want to be able to expand beyond single letters,
could we reserve one or more characters for extension purposes?
If braces are forbidden by the syntax today, could we add support for
something like the following later on, without breaking anything?
# echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
For now, my main concern would be whether this series prevents that
sort of thing being added in a backwards compatible way later.
I don't really see anything that is a blocker.
What do you think?
Cheers
---Dave
Hi Dave,
On 2/13/25 9:37 AM, Dave Martin wrote:
> Hi Reinette,
>
> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>> Hi Dave,
>>
>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>> Hi there,
>>>
>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>
>>>> This series adds the support for Assignable Bandwidth Monitoring Counters
>>>> (ABMC). It is also called QoS RMID Pinning feature
>>>>
>>>> Series is written such that it is easier to support other assignable
>>>> features supported from different vendors.
>>>>
>>>> The feature details are documented in the APM listed below [1].
>>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>>> Monitoring (ABMC). The documentation is available at
>>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>>>
>>>> The patches are based on top of commit
>>>> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
>
> [...]
>
>>>> b. Check how many ABMC counters are available.
>>>>
>>>> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
>>>> 32
>>>
>>> Is this file needed?
>>>
>>> With MPAM, it is more difficult to promise that the same number of
>>> counters will be available everywhere.
>>>
>>> Rather than lie, or report a "safe" value here that may waste some
>>> counters, can we just allow the number of counters to be be discovered
>>> per domain via available_mbm_cntrs?
>>
>> This sounds reasonable to me. I think us having trouble with the
>> user documentation of this file so late in development should also have been
>> a sign to rethink its value.
>>
>> For a user to discover the number of counters supported via available_mbm_cntrs
>> would require the file's contents to be captured right after mount. Since we've
>> had scenarios where new userspace needs to discover an up-and-running system's
>> configuration this may not be possible. I thus wonder instead of removing
>> num_mbm_cntrs, it could be modified to return the per-domain supported counters
>> instead of a single value?
>
> Is it actually useful to be able to discover the number of counters
> that exist? A counter that exists but is not available cannot be used,
> so perhaps it is not useful to know about it in the first place.
An alternative perspective of what "available" means is "how many counters
could I possibly get to do this new monitoring task". A user may be willing
to re-assign counters if the new monitoring task is important. Knowing
how many counters are already free and available for assignment would be
easy from available_mbm_cntrs but to get an idea of how many counters
could be re-assigned to help out with the new task would require
some intricate parsing of mbm_assign_control.
> But if we keep this file but make it report the number of counters for
> each domain (similarly to mbm_available_cntrs), then I think the MPAM
> driver should be able to work with that.
>
>>> num_closids and num_rmids are already problematic for MPAM, so it would
>>> be good to avoid any more parameters of this sort from being reported
>>> to userspace unless there is a clear understanding of why they are
>>> needed.
>>
>> Yes. Appreciate your help in identifying what could be problematic for MPAM.
>
> For clarity: this is a background issue, mostly orthogonal to this
> series.
>
> If this series is merged as-is, with a global per-resource
> num_mbm_cntrs property, then this not really worse than the current
> situation -- it's just a bit annoying from the MPAM perspective.
>
>
> In a nutshell, the num_closids / num_rmids parameters seem to expose
> RDT-specific hardware semantics to userspace, implying a specific
> allocation model for control group and monitoring group identifiers.
>
> The guarantees that userspace is entitled to asssume when resctrl
> reports particular values do not seem to be well described and are hard
> to map onto the nearest-equivalent MPAM implementation. A combination
> of control and monitoring groups that can be created on x86 may not be
> creatable on MPAM, even when the number of supportable control and
> monitoring partitions is the same.
I understand. This interface was created almost a decade ago. It would have been
wonderful if the user interface could have been created with a clear vision
of all the use cases it would end up needing to support. I am trying to be
very careful with this new user interface as I try to consider all the things I
learned while working on resctrl. All help get this new interface right is
greatly appreciated.
Since your specifically mention issues that MPAM has with num_rmids, please
note that we have been trying (see [1], but maybe start reading thread at [2])
to find ways to make this work with MPAM but no word from MPAM side.
I see that you were not cc'd on the discussion so this is not a criticism of
you personally but I would like to highlight that we do try to make things
work well for MPAM but so far this work seems ignored, yet critisized
for not being done. I expect the more use cases are thrown at an interface
as it is developed the better it would get and I would gladly work with MPAM
folks to improve things.
> Even with the ABMC series, we may still be constrained on what we can
> report for num_rmids: we can't know in advance whether or not the user
> is going to use mbm_cntr_assign mode -- if not, we can't promise to
> create more monitoring groups than the number of counters in the
> hardware.
It is the architecture that decides which modes are supported and
which is default.
> It seems natural for the counts reported by "available_mbm_cntrs" to
> change dynamically when the ABMC assignment mode is changed, but I
> think userspace are likely to expect the global "num_rmids" parameters
> to be fixed for the lifetime of the resctrl mount (and possibly fixed
> for all time on a given hardware platform -- at least, modulo CDP).
>
>
> I think it might be possible to tighten up the docmentation of
> num_closids in particular in a way that doesn't conflict with x86 and
> may make it easier for MPAM to fit in with, but that feels like a
> separate conversation.
>
> None of this should be considered a blocker for this series, either way.
>
>>>
>>> Reporting number of counters per monitoring domain is a more natural
>>> fit for MPAM, as below:
>>>
>>>> c. Check how many ABMC counters are available in each domain.
>>>>
>>>> # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
>>>> 0=30;1=30
>>>
>>> For MPAM, this seems supportable. Each monitoring domain will have
>>> some counters, and a well-defined number of them will be available for
>>> allocation at any one time.
>
> [...]
>
>>>> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> [...]
>
>>>> Flags can be one of the following:
>>>>
>>>> t MBM total event is enabled.
>>>> l MBM local event is enabled.
>>>> tl Both total and local MBM events are enabled.
>>>> _ None of the MBM events are enabled
>>>>
>>>> Examples:
>>>
>>> [...]
>>>
>>> I think that this basically works for MPAM.
>>>
>>> The local/total distinction doesn't map in a consistent way onto MPAM,
>>> but this problem is not specific to ABMC. It feels sensible for ABMC
>>> to be built around the same concepts that resctrl already has elsewhere
>>> in the interface. MPAM will do its best to fit (as already).
>>>
>>> Regarding Peter's use case of assiging multiple counters to a
>>> monitoring group [1], I feel that it's probably good enough to make
>>> sure that the ABMC interface can be extended in future in a backwards
>>> compatible way so as to support this, without trying to support it
>>> immediately.
>>>
>>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
>>>
>>
>> I do not think that resctrl's current support of the mbm_total_bytes and
>> mbm_local_bytes should be considered as the "only" two available "slots"
>> into which all possible events should be forced into. "mon_features" exists
>> to guide user space to which events are supported and as I see it new events
>> can be listed here to inform user space of their availability, with their
>> associated event files available in the resource groups.
>
> That's fair. I wasn't currently sure how (or if) the set of countable
> events was expected to grow / evolve via this route.
>
> Either way, I think this confirms that there is at least one viable way
> to enable more counters for a single control group, on top of this
> series.
>
> (If there is more than one way, that seems fine?)
>
>>>
>>> For example, if we added new generic "letters" -- say, "0" to "9",
>>> combined with new counter files in resctrlfs, that feels like a
>>> possible approach. ABMC (as in this series) should just reject such
>>> such assignments, and the new counter files wouldn't exist.
>>>
>>> Availability of this feature could also be reported as a distinct mode
>>> in mbm_assign_mode, say "mbm_cntr_generic", or whatever.
>>>
>>>
>>> A _sketch_ of this follows. This is NOT a proposal -- the key
>>> question is whether we are confident that we can extend the interface
>>> in this way in the future without breaking anything.
>>>
>>> If "yes", then the ABMC interface (as proposed by this series) works as
>>> a foundation to build on.
>>>
>>> --8<--
>>>
>>> [artists's impression]
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>> mbm_cntr_generic
>>> [mbm_cntr_assign]
>>> default
>>>
>>> # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>> # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type
>>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type
>>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type
>>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type
>>>
>>> ...
>>>
>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes
>>>
>>> etc.
>>>
>>
>> It is not clear to me what additional features such an interface enables. It
>> also looks like user space will need to track and manage counter IDs?
>
> My idea was that for these generic counters, new files could be exposed
> to configure what they actually count (the ..._type files shown above;
> or possibly via the ..._config files that already exist).
>
> The "IDs" were inteded as abstract; the number only relates the
> assignments in mbm_assign_control to the files created elsewhere. This
> wouldn't be related to IDs assigned by the hardware.
I see. Yes, this sounds related to and a generalization of the AMD
configurable event feature.
>
> If there are multiple resctrl users then using numeric IDs might be
> problematic; though if we go eventually in the direction of making
> resctrlfs multi-mountable then each mount could have its own namespace.
I am not aware of "multi-mountable" direction.
>
> Allowing counters to be named and configured with a mkdir()-style
> interface might be possible too; that might make it easier for users to
> coexist within a single resctrl mount (if we think that's important
> enough).
>
>> It sounds to me as though the issue starts with your statement
>> "The local/total distinction doesn't map in a consistent way onto MPAM". To
>> address this I expect that an MPAM system will not support nor list
>> mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead,
>> it would list the events that are appropriate to the system? Trying to match
>> with what Peter said [1] in the message you refer to, this may be possible:
>>
>> # cat /sys/fs/resctrl/info/L3_MON/mon_features
>> mbm_local_read_bytes
>> mbm_local_write_bytes
>> mbm_local_bytes
>>
>> (*) I am including mbm_local_bytes since it could be an event that can be software
>> defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both
>> counted.
>>
>> I see the support for MPAM events distinct from the support of assignable counters.
>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>> Please help me understand if you see it differently.
>>
>> Doing so would need to come up with alphabetical letters for these events,
>> which seems to be needed for your proposal also? If we use possible flags of:
>>
>> mbm_local_read_bytes a
>> mbm_local_write_bytes b
>>
>> Then mbm_assign_control can be used as:
>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>> <value>
>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>
>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>
>> Reinette
>>
>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
>
> That approach would also work, where an MPAM system has events are not
> a reasonable approximation of the generic "total" or "local".
>
> For now we would probably stick with "total" and "local" anyway though,
> because the MPAM architecture doesn't natively allow the mapping onto
> the memory system topology to be discovered, and the information in
> ACPI / device tree is insufficient to tell us everything we'd need to
> know. But I guess what counts as "local" in particular will be quite
> hardware and topology dependent even on x86, so perhaps we shouldn't
> worry about having the behaviour match exactly (?)
>
> Regarding the code letters, my idea was that the event type might be
> configured by a separate file, instead of in mbm_assign_control
> directly, in which case running out of letters wouldn't be a problem.
This work started with individual files for counters but the issue was
raised that this will require a large number of filesystem calls when, for
example, a user wants to move a group of counters associated with the events
of one set of monitoring groups to another set of monitoring groups. This
is for the use case where there are a significant number of monitor groups
for which there are not sufficient counters. With mbm_assign_control this
can be done in a single write and such a monitoring transition can thus
be accomplished more efficiently.
>
> Alternatively, if we want to be able to expand beyond single letters,
> could we reserve one or more characters for extension purposes?
>
> If braces are forbidden by the syntax today, could we add support for
> something like the following later on, without breaking anything?
>
> # echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
Thank you for the suggestion. I think we may need something like this.
Babu, what do you think?
>
> For now, my main concern would be whether this series prevents that
> sort of thing being added in a backwards compatible way later.
>
> I don't really see anything that is a blocker.
>
> What do you think?
I do not fully understand the MPAM counter feature. It almost sounds like
every counter could be configured independently with the expectation to
configure and assign each counter independently to a domain. As I understand
these capabilities match AMD's ABMC feature, but the planned implementation
to support ABMC first configures events per-domain and then assign these
events to counters. hmmm ... but in your example a file like
"mbm_counter0_bytes_type" is global. Could you please elaborate how in
your example writing a single letter to that file will be interpreted?
Reinette
[1] https://lore.kernel.org/lkml/46767ca7-1f1b-48e8-8ce6-be4b00d129f9@intel.com/
[2] https://lore.kernel.org/lkml/CALPaoChad6=xqz+BQQd=dB915xhj1gusmcrS9ya+T2GyhTQc5Q@mail.gmail.com/
Hi Dave/Reinette,
On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> Hi Dave,
>
> On 2/13/25 9:37 AM, Dave Martin wrote:
>> Hi Reinette,
>>
>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>> Hi Dave,
>>>
>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>> Hi there,
>>>>
>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>
>>>>> This series adds the support for Assignable Bandwidth Monitoring Counters
>>>>> (ABMC). It is also called QoS RMID Pinning feature
>>>>>
>>>>> Series is written such that it is easier to support other assignable
>>>>> features supported from different vendors.
>>>>>
>>>>> The feature details are documented in the APM listed below [1].
>>>>> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
>>>>> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
>>>>> Monitoring (ABMC). The documentation is available at
>>>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>>>>>
>>>>> The patches are based on top of commit
>>>>> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx'
>>
>> [...]
>>
>>>>> b. Check how many ABMC counters are available.
>>>>>
>>>>> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs
>>>>> 32
>>>>
>>>> Is this file needed?
>>>>
>>>> With MPAM, it is more difficult to promise that the same number of
>>>> counters will be available everywhere.
>>>>
>>>> Rather than lie, or report a "safe" value here that may waste some
>>>> counters, can we just allow the number of counters to be be discovered
>>>> per domain via available_mbm_cntrs?
>>>
>>> This sounds reasonable to me. I think us having trouble with the
>>> user documentation of this file so late in development should also have been
>>> a sign to rethink its value.
>>>
>>> For a user to discover the number of counters supported via available_mbm_cntrs
>>> would require the file's contents to be captured right after mount. Since we've
>>> had scenarios where new userspace needs to discover an up-and-running system's
>>> configuration this may not be possible. I thus wonder instead of removing
>>> num_mbm_cntrs, it could be modified to return the per-domain supported counters
>>> instead of a single value?
>>
>> Is it actually useful to be able to discover the number of counters
>> that exist? A counter that exists but is not available cannot be used,
>> so perhaps it is not useful to know about it in the first place.
>
> An alternative perspective of what "available" means is "how many counters
> could I possibly get to do this new monitoring task". A user may be willing
> to re-assign counters if the new monitoring task is important. Knowing
> how many counters are already free and available for assignment would be
> easy from available_mbm_cntrs but to get an idea of how many counters
> could be re-assigned to help out with the new task would require
> some intricate parsing of mbm_assign_control.
>
>
>> But if we keep this file but make it report the number of counters for
>> each domain (similarly to mbm_available_cntrs), then I think the MPAM
>> driver should be able to work with that.
>>
>>>> num_closids and num_rmids are already problematic for MPAM, so it would
>>>> be good to avoid any more parameters of this sort from being reported
>>>> to userspace unless there is a clear understanding of why they are
>>>> needed.
>>>
>>> Yes. Appreciate your help in identifying what could be problematic for MPAM.
>>
>> For clarity: this is a background issue, mostly orthogonal to this
>> series.
>>
>> If this series is merged as-is, with a global per-resource
>> num_mbm_cntrs property, then this not really worse than the current
>> situation -- it's just a bit annoying from the MPAM perspective.
>>
>>
>> In a nutshell, the num_closids / num_rmids parameters seem to expose
>> RDT-specific hardware semantics to userspace, implying a specific
>> allocation model for control group and monitoring group identifiers.
>>
>> The guarantees that userspace is entitled to asssume when resctrl
>> reports particular values do not seem to be well described and are hard
>> to map onto the nearest-equivalent MPAM implementation. A combination
>> of control and monitoring groups that can be created on x86 may not be
>> creatable on MPAM, even when the number of supportable control and
>> monitoring partitions is the same.
>
> I understand. This interface was created almost a decade ago. It would have been
> wonderful if the user interface could have been created with a clear vision
> of all the use cases it would end up needing to support. I am trying to be
> very careful with this new user interface as I try to consider all the things I
> learned while working on resctrl. All help get this new interface right is
> greatly appreciated.
>
> Since your specifically mention issues that MPAM has with num_rmids, please
> note that we have been trying (see [1], but maybe start reading thread at [2])
> to find ways to make this work with MPAM but no word from MPAM side.
> I see that you were not cc'd on the discussion so this is not a criticism of
> you personally but I would like to highlight that we do try to make things
> work well for MPAM but so far this work seems ignored, yet critisized
> for not being done. I expect the more use cases are thrown at an interface
> as it is developed the better it would get and I would gladly work with MPAM
> folks to improve things.
>
>> Even with the ABMC series, we may still be constrained on what we can
>> report for num_rmids: we can't know in advance whether or not the user
>> is going to use mbm_cntr_assign mode -- if not, we can't promise to
>> create more monitoring groups than the number of counters in the
>> hardware.
>
> It is the architecture that decides which modes are supported and
> which is default.
>
>> It seems natural for the counts reported by "available_mbm_cntrs" to
>> change dynamically when the ABMC assignment mode is changed, but I
>> think userspace are likely to expect the global "num_rmids" parameters
>> to be fixed for the lifetime of the resctrl mount (and possibly fixed
>> for all time on a given hardware platform -- at least, modulo CDP).
>>
>>
>> I think it might be possible to tighten up the docmentation of
>> num_closids in particular in a way that doesn't conflict with x86 and
>> may make it easier for MPAM to fit in with, but that feels like a
>> separate conversation.
>>
>> None of this should be considered a blocker for this series, either way.
>>
>>>>
>>>> Reporting number of counters per monitoring domain is a more natural
>>>> fit for MPAM, as below:
>>>>
>>>>> c. Check how many ABMC counters are available in each domain.
>>>>>
>>>>> # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
>>>>> 0=30;1=30
>>>>
>>>> For MPAM, this seems supportable. Each monitoring domain will have
>>>> some counters, and a well-defined number of them will be available for
>>>> allocation at any one time.
>>
>> [...]
>>
>>>>> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> [...]
>>
>>>>> Flags can be one of the following:
>>>>>
>>>>> t MBM total event is enabled.
>>>>> l MBM local event is enabled.
>>>>> tl Both total and local MBM events are enabled.
>>>>> _ None of the MBM events are enabled
>>>>>
>>>>> Examples:
>>>>
>>>> [...]
>>>>
>>>> I think that this basically works for MPAM.
>>>>
>>>> The local/total distinction doesn't map in a consistent way onto MPAM,
>>>> but this problem is not specific to ABMC. It feels sensible for ABMC
>>>> to be built around the same concepts that resctrl already has elsewhere
>>>> in the interface. MPAM will do its best to fit (as already).
>>>>
>>>> Regarding Peter's use case of assiging multiple counters to a
>>>> monitoring group [1], I feel that it's probably good enough to make
>>>> sure that the ABMC interface can be extended in future in a backwards
>>>> compatible way so as to support this, without trying to support it
>>>> immediately.
>>>>
>>>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
>>>>
>>>
>>> I do not think that resctrl's current support of the mbm_total_bytes and
>>> mbm_local_bytes should be considered as the "only" two available "slots"
>>> into which all possible events should be forced into. "mon_features" exists
>>> to guide user space to which events are supported and as I see it new events
>>> can be listed here to inform user space of their availability, with their
>>> associated event files available in the resource groups.
>>
>> That's fair. I wasn't currently sure how (or if) the set of countable
>> events was expected to grow / evolve via this route.
>>
>> Either way, I think this confirms that there is at least one viable way
>> to enable more counters for a single control group, on top of this
>> series.
>>
>> (If there is more than one way, that seems fine?)
>>
>>>>
>>>> For example, if we added new generic "letters" -- say, "0" to "9",
>>>> combined with new counter files in resctrlfs, that feels like a
>>>> possible approach. ABMC (as in this series) should just reject such
>>>> such assignments, and the new counter files wouldn't exist.
>>>>
>>>> Availability of this feature could also be reported as a distinct mode
>>>> in mbm_assign_mode, say "mbm_cntr_generic", or whatever.
>>>>
>>>>
>>>> A _sketch_ of this follows. This is NOT a proposal -- the key
>>>> question is whether we are confident that we can extend the interface
>>>> in this way in the future without breaking anything.
>>>>
>>>> If "yes", then the ABMC interface (as proposed by this series) works as
>>>> a foundation to build on.
>>>>
>>>> --8<--
>>>>
>>>> [artists's impression]
>>>>
>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>> mbm_cntr_generic
>>>> [mbm_cntr_assign]
>>>> default
>>>>
>>>> # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode
>>>> # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type
>>>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type
>>>> # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type
>>>> # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type
>>>>
>>>> ...
>>>>
>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes
>>>>
>>>> etc.
>>>>
>>>
>>> It is not clear to me what additional features such an interface enables. It
>>> also looks like user space will need to track and manage counter IDs?
>>
>> My idea was that for these generic counters, new files could be exposed
>> to configure what they actually count (the ..._type files shown above;
>> or possibly via the ..._config files that already exist).
>>
>> The "IDs" were inteded as abstract; the number only relates the
>> assignments in mbm_assign_control to the files created elsewhere. This
>> wouldn't be related to IDs assigned by the hardware.
>
> I see. Yes, this sounds related to and a generalization of the AMD
> configurable event feature.
>
>>
>> If there are multiple resctrl users then using numeric IDs might be
>> problematic; though if we go eventually in the direction of making
>> resctrlfs multi-mountable then each mount could have its own namespace.
>
> I am not aware of "multi-mountable" direction.
>
>>
>> Allowing counters to be named and configured with a mkdir()-style
>> interface might be possible too; that might make it easier for users to
>> coexist within a single resctrl mount (if we think that's important
>> enough).
>>
>>> It sounds to me as though the issue starts with your statement
>>> "The local/total distinction doesn't map in a consistent way onto MPAM". To
>>> address this I expect that an MPAM system will not support nor list
>>> mbm_total_bytes and/or mbm_local_bytes in its mon_features file (*)? Instead,
>>> it would list the events that are appropriate to the system? Trying to match
>>> with what Peter said [1] in the message you refer to, this may be possible:
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mon_features
>>> mbm_local_read_bytes
>>> mbm_local_write_bytes
>>> mbm_local_bytes
>>>
>>> (*) I am including mbm_local_bytes since it could be an event that can be software
>>> defined as a sum of mbm_local_read_bytes and mbm_local_write_bytes when they are both
>>> counted.
>>>
>>> I see the support for MPAM events distinct from the support of assignable counters.
>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>> Please help me understand if you see it differently.
>>>
>>> Doing so would need to come up with alphabetical letters for these events,
>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>
>>> mbm_local_read_bytes a
>>> mbm_local_write_bytes b
>>>
>>> Then mbm_assign_control can be used as:
>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>> <value>
>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>
>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>
>>> Reinette
>>>
>>> [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/
>>
>> That approach would also work, where an MPAM system has events are not
>> a reasonable approximation of the generic "total" or "local".
>>
>> For now we would probably stick with "total" and "local" anyway though,
>> because the MPAM architecture doesn't natively allow the mapping onto
>> the memory system topology to be discovered, and the information in
>> ACPI / device tree is insufficient to tell us everything we'd need to
>> know. But I guess what counts as "local" in particular will be quite
>> hardware and topology dependent even on x86, so perhaps we shouldn't
>> worry about having the behaviour match exactly (?)
>>
>> Regarding the code letters, my idea was that the event type might be
>> configured by a separate file, instead of in mbm_assign_control
>> directly, in which case running out of letters wouldn't be a problem.
>
> This work started with individual files for counters but the issue was
> raised that this will require a large number of filesystem calls when, for
> example, a user wants to move a group of counters associated with the events
> of one set of monitoring groups to another set of monitoring groups. This
> is for the use case where there are a significant number of monitor groups
> for which there are not sufficient counters. With mbm_assign_control this
> can be done in a single write and such a monitoring transition can thus
> be accomplished more efficiently.
>
>>
>> Alternatively, if we want to be able to expand beyond single letters,
>> could we reserve one or more characters for extension purposes?
>>
>> If braces are forbidden by the syntax today, could we add support for
>> something like the following later on, without breaking anything?
>>
>> # echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>
> Thank you for the suggestion. I think we may need something like this.
> Babu, what do you think?
I'm not quite clear on this. Do we know what 'foo' and 'bar' refer to?
It is a random text?
In his example from
https://lore.kernel.org/lkml/Z643WdXYARTADSBy@e133380.arm.com/
--------------------------------------------------------------
The numbers are not supposed to have an hardware significance.
'//0=6'
just "means assign some unused counter for domain 0, and create files
in resctrl so I can configure and read it".
The "6" is really just a tag for labelling the resulting resctrl
file names so that the user can tell them apart. It's not supposed
to imply any specific hardware counter or event.
------------------------------------------------------------------
It seems that 'foo' and 'bar' are tags used to create files in
/sys/fs/resctrl/info/L3_MON/.
Given that, it looks like we're discussing entirely different things.
>
>>
>> For now, my main concern would be whether this series prevents that
>> sort of thing being added in a backwards compatible way later.
>>
>> I don't really see anything that is a blocker.
>>
>> What do you think?
>
> I do not fully understand the MPAM counter feature. It almost sounds like
> every counter could be configured independently with the expectation to
> configure and assign each counter independently to a domain. As I understand
> these capabilities match AMD's ABMC feature, but the planned implementation
> to support ABMC first configures events per-domain and then assign these
> events to counters. hmmm ... but in your example a file like
> "mbm_counter0_bytes_type" is global. Could you please elaborate how in
> your example writing a single letter to that file will be interpreted?
>
>
> Reinette
>
> [1] https://lore.kernel.org/lkml/46767ca7-1f1b-48e8-8ce6-be4b00d129f9@intel.com/
> [2] https://lore.kernel.org/lkml/CALPaoChad6=xqz+BQQd=dB915xhj1gusmcrS9ya+T2GyhTQc5Q@mail.gmail.com/
>
Hi Babu,
On 2/14/25 10:31 AM, Moger, Babu wrote:
> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
(quoting relevant parts with goal to focus discussion on new possible syntax)
>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>> Please help me understand if you see it differently.
>>>>
>>>> Doing so would need to come up with alphabetical letters for these events,
>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>
>>>> mbm_local_read_bytes a
>>>> mbm_local_write_bytes b
>>>>
>>>> Then mbm_assign_control can be used as:
>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>> <value>
>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>
>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
As mentioned above, one possible issue with existing interface is that
it is limited to 26 events (assuming only lower case letters are used). The limit
is low enough to be of concern.
....
>>>
>>> Alternatively, if we want to be able to expand beyond single letters,
>>> could we reserve one or more characters for extension purposes?
>>>
>>> If braces are forbidden by the syntax today, could we add support for
>>> something like the following later on, without breaking anything?
>>>
>>> # echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>
Dave proposed a change in syntax that can (a) support unlimited events,
(b) be more intuitive than the one letter flags that may be hard to match
to the events they correspond to.
>> Thank you for the suggestion. I think we may need something like this.
>> Babu, what do you think?
>
> I'm not quite clear on this. Do we know what 'foo' and 'bar' refer to?
> It is a random text?
Not random text. It refers to the events.
I do not know if braces is what will be settled on but a slight change in
example to make it match your series can be:
# echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
With syntax like above there is no concern that we will run out of
flags and the events assigned are clear without needing to parse separate flags.
For a system with a lot of events and domains this will become quite a lot
to parse though.
>
> In his example from
> https://lore.kernel.org/lkml/Z643WdXYARTADSBy@e133380.arm.com/
> --------------------------------------------------------------
> The numbers are not supposed to have an hardware significance.
>
> '//0=6'
>
> just "means assign some unused counter for domain 0, and create files
> in resctrl so I can configure and read it".
Thanks for pointing this out. I missed that the idea was that the
configuration files are dynamically created.
>
> The "6" is really just a tag for labelling the resulting resctrl
> file names so that the user can tell them apart. It's not supposed
> to imply any specific hardware counter or event.
Right.
> ------------------------------------------------------------------
>
> It seems that 'foo' and 'bar' are tags used to create files in /sys/fs/resctrl/info/L3_MON/.
>
> Given that, it looks like we're discussing entirely different things.
I am still trying to understand how MPAM counters can be supported.
Reinette
Hi Reinette, On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > > Hi Babu, > > On 2/14/25 10:31 AM, Moger, Babu wrote: > > On 2/14/2025 12:26 AM, Reinette Chatre wrote: > >> On 2/13/25 9:37 AM, Dave Martin wrote: > >>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: > >>>> On 2/12/25 9:46 AM, Dave Martin wrote: > >>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: > > (quoting relevant parts with goal to focus discussion on new possible syntax) > > >>>> I see the support for MPAM events distinct from the support of assignable counters. > >>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. > >>>> Please help me understand if you see it differently. > >>>> > >>>> Doing so would need to come up with alphabetical letters for these events, > >>>> which seems to be needed for your proposal also? If we use possible flags of: > >>>> > >>>> mbm_local_read_bytes a > >>>> mbm_local_write_bytes b > >>>> > >>>> Then mbm_assign_control can be used as: > >>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control > >>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes > >>>> <value> > >>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > >>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> > >>>> > >>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), > >>>> assuming that upper case would be used for "shared" counters (unless this interface is defined > >>>> differently and only few uppercase letters used for it). Would this be too low of a limit? > > As mentioned above, one possible issue with existing interface is that > it is limited to 26 events (assuming only lower case letters are used). The limit > is low enough to be of concern. The events which can be monitored by a single counter on ABMC and MPAM so far are combinable, so 26 counters per group today means it limits breaking down MBM traffic for each group 26 ways. If a user complained that a 26-way breakdown of a group's MBM traffic was limiting their investigation, I would question whether they know what they're looking for. -Peter
Hi Peter, On 2/17/25 2:26 AM, Peter Newman wrote: > Hi Reinette, > > On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> >> Hi Babu, >> >> On 2/14/25 10:31 AM, Moger, Babu wrote: >>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: >>>> On 2/13/25 9:37 AM, Dave Martin wrote: >>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: >>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: >>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: >> >> (quoting relevant parts with goal to focus discussion on new possible syntax) >> >>>>>> I see the support for MPAM events distinct from the support of assignable counters. >>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. >>>>>> Please help me understand if you see it differently. >>>>>> >>>>>> Doing so would need to come up with alphabetical letters for these events, >>>>>> which seems to be needed for your proposal also? If we use possible flags of: >>>>>> >>>>>> mbm_local_read_bytes a >>>>>> mbm_local_write_bytes b >>>>>> >>>>>> Then mbm_assign_control can be used as: >>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control >>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes >>>>>> <value> >>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> >>>>>> >>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), >>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined >>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? >> >> As mentioned above, one possible issue with existing interface is that >> it is limited to 26 events (assuming only lower case letters are used). The limit >> is low enough to be of concern. > > The events which can be monitored by a single counter on ABMC and MPAM > so far are combinable, so 26 counters per group today means it limits > breaking down MBM traffic for each group 26 ways. If a user complained > that a 26-way breakdown of a group's MBM traffic was limiting their > investigation, I would question whether they know what they're looking > for. The key here is "so far" as well as the focus on MBM only. It is impossible for me to predict what we will see in a couple of years from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface to support their users. Just looking at the Intel RDT spec the event register has space for 32 events for each "CPU agent" resource. That does not take into account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned that he is working on patches [1] that will add new events and shared the idea that we may be trending to support "perf" like events associated with RMID. I expect AMD PQoS and Arm MPAM to provide related enhancements to support their customers. This all makes me think that resctrl should be ready to support more events than 26. My goal is for resctrl to have a user interface that can as much as possible be ready for whatever may be required from it years down the line. Of course, I may be wrong and resctrl would never need to support more than 26 events per resource (*). The risk is that resctrl *may* need to support more than 26 events and how could resctrl support that? What is the risk of supporting more than 26 events? As I highlighted earlier the interface I used as demonstration may become unwieldy to parse on a system with many domains that supports many events. This is a concern for me. Any suggestions will be appreciated, especially from you since I know that you are very familiar with issues related to large scale use of resctrl interfaces. Reinette [1] https://lore.kernel.org/lkml/SJ1PR11MB6083759CCE59FF2FE931471EFCFF2@SJ1PR11MB6083.namprd11.prod.outlook.com/ (*) There is also the scenario where combined between resources there may be more than 26 events supported that will require the same one letter flag to be used for different events of different resources. This may potentially be confusing.
Hi Reinette, On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > > Hi Peter, > > On 2/17/25 2:26 AM, Peter Newman wrote: > > Hi Reinette, > > > > On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre > > <reinette.chatre@intel.com> wrote: > >> > >> Hi Babu, > >> > >> On 2/14/25 10:31 AM, Moger, Babu wrote: > >>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: > >>>> On 2/13/25 9:37 AM, Dave Martin wrote: > >>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: > >>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: > >>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: > >> > >> (quoting relevant parts with goal to focus discussion on new possible syntax) > >> > >>>>>> I see the support for MPAM events distinct from the support of assignable counters. > >>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. > >>>>>> Please help me understand if you see it differently. > >>>>>> > >>>>>> Doing so would need to come up with alphabetical letters for these events, > >>>>>> which seems to be needed for your proposal also? If we use possible flags of: > >>>>>> > >>>>>> mbm_local_read_bytes a > >>>>>> mbm_local_write_bytes b > >>>>>> > >>>>>> Then mbm_assign_control can be used as: > >>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control > >>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes > >>>>>> <value> > >>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > >>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> > >>>>>> > >>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), > >>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined > >>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? > >> > >> As mentioned above, one possible issue with existing interface is that > >> it is limited to 26 events (assuming only lower case letters are used). The limit > >> is low enough to be of concern. > > > > The events which can be monitored by a single counter on ABMC and MPAM > > so far are combinable, so 26 counters per group today means it limits > > breaking down MBM traffic for each group 26 ways. If a user complained > > that a 26-way breakdown of a group's MBM traffic was limiting their > > investigation, I would question whether they know what they're looking > > for. > > The key here is "so far" as well as the focus on MBM only. > > It is impossible for me to predict what we will see in a couple of years > from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface > to support their users. Just looking at the Intel RDT spec the event register > has space for 32 events for each "CPU agent" resource. That does not take into > account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned > that he is working on patches [1] that will add new events and shared the idea > that we may be trending to support "perf" like events associated with RMID. I > expect AMD PQoS and Arm MPAM to provide related enhancements to support their > customers. > This all makes me think that resctrl should be ready to support more events than 26. I was thinking of the letters as representing a reusable, user-defined event-set for applying to a single counter rather than as individual events, since MPAM and ABMC allow us to choose the set of events each one counts. Wherever we define the letters, we could use more symbolic event names. In the letters as events model, choosing the events assigned to a group wouldn't be enough information, since we would want to control which events should share a counter and which should be counted by separate counters. I think the amount of information that would need to be encoded into mbm_assign_control to represent the level of configurability supported by hardware would quickly get out of hand. Maybe as an example, one counter for all reads, one counter for all writes in ABMC would look like... (L3_QOS_ABMC_CFG.BwType field names below) (per domain) group 0: counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill counter 1: VictimBW,LclNTWr,RmtNTWr group 1: counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill counter 3: VictimBW,LclNTWr,RmtNTWr ... I assume packing all of this info for a group's desired counter configuration into a single line (with 32 domains per line on many dual-socket AMD configurations I see) would be difficult to look at, even if we could settle on a single letter to represent each universally. > > My goal is for resctrl to have a user interface that can as much as possible > be ready for whatever may be required from it years down the line. Of course, > I may be wrong and resctrl would never need to support more than 26 events per > resource (*). The risk is that resctrl *may* need to support more than 26 events > and how could resctrl support that? > > What is the risk of supporting more than 26 events? As I highlighted earlier > the interface I used as demonstration may become unwieldy to parse on a system > with many domains that supports many events. This is a concern for me. Any suggestions > will be appreciated, especially from you since I know that you are very familiar with > issues related to large scale use of resctrl interfaces. It's mainly just the unwieldiness of all the information in one file. It's already at the limit of what I can visually look through. I believe that shared assignments will take care of all the high-frequency and performance-intensive batch configuration updates I was originally concerned about, so I no longer see much benefit in finding ways to textually encode all this information in a single file when it would be more manageable to distribute it around the filesystem hierarchy. -Peter > > Reinette > > [1] https://lore.kernel.org/lkml/SJ1PR11MB6083759CCE59FF2FE931471EFCFF2@SJ1PR11MB6083.namprd11.prod.outlook.com/ > > (*) There is also the scenario where combined between resources there may be > more than 26 events supported that will require the same one letter flag to be > used for different events of different resources. This may potentially be > confusing.
Hi Peter, On 2/19/25 3:28 AM, Peter Newman wrote: > Hi Reinette, > > On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> >> Hi Peter, >> >> On 2/17/25 2:26 AM, Peter Newman wrote: >>> Hi Reinette, >>> >>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre >>> <reinette.chatre@intel.com> wrote: >>>> >>>> Hi Babu, >>>> >>>> On 2/14/25 10:31 AM, Moger, Babu wrote: >>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: >>>>>> On 2/13/25 9:37 AM, Dave Martin wrote: >>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: >>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: >>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: >>>> >>>> (quoting relevant parts with goal to focus discussion on new possible syntax) >>>> >>>>>>>> I see the support for MPAM events distinct from the support of assignable counters. >>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. >>>>>>>> Please help me understand if you see it differently. >>>>>>>> >>>>>>>> Doing so would need to come up with alphabetical letters for these events, >>>>>>>> which seems to be needed for your proposal also? If we use possible flags of: >>>>>>>> >>>>>>>> mbm_local_read_bytes a >>>>>>>> mbm_local_write_bytes b >>>>>>>> >>>>>>>> Then mbm_assign_control can be used as: >>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control >>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes >>>>>>>> <value> >>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> >>>>>>>> >>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), >>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined >>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? >>>> >>>> As mentioned above, one possible issue with existing interface is that >>>> it is limited to 26 events (assuming only lower case letters are used). The limit >>>> is low enough to be of concern. >>> >>> The events which can be monitored by a single counter on ABMC and MPAM >>> so far are combinable, so 26 counters per group today means it limits >>> breaking down MBM traffic for each group 26 ways. If a user complained >>> that a 26-way breakdown of a group's MBM traffic was limiting their >>> investigation, I would question whether they know what they're looking >>> for. >> >> The key here is "so far" as well as the focus on MBM only. >> >> It is impossible for me to predict what we will see in a couple of years >> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface >> to support their users. Just looking at the Intel RDT spec the event register >> has space for 32 events for each "CPU agent" resource. That does not take into >> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned >> that he is working on patches [1] that will add new events and shared the idea >> that we may be trending to support "perf" like events associated with RMID. I >> expect AMD PQoS and Arm MPAM to provide related enhancements to support their >> customers. >> This all makes me think that resctrl should be ready to support more events than 26. > > I was thinking of the letters as representing a reusable, user-defined > event-set for applying to a single counter rather than as individual > events, since MPAM and ABMC allow us to choose the set of events each > one counts. Wherever we define the letters, we could use more symbolic > event names. Thank you for clarifying. > > In the letters as events model, choosing the events assigned to a > group wouldn't be enough information, since we would want to control > which events should share a counter and which should be counted by > separate counters. I think the amount of information that would need > to be encoded into mbm_assign_control to represent the level of > configurability supported by hardware would quickly get out of hand. > > Maybe as an example, one counter for all reads, one counter for all > writes in ABMC would look like... > > (L3_QOS_ABMC_CFG.BwType field names below) > > (per domain) > group 0: > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 1: VictimBW,LclNTWr,RmtNTWr > group 1: > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 3: VictimBW,LclNTWr,RmtNTWr > ... > I think this may also be what Dave was heading towards in [2] but in that example and above the counter configuration appears to be global. You do mention "configurability supported by hardware" so I wonder if per-domain counter configuration is a requirement? Until now I viewed counter configuration separate from counter assignment, similar to how AMD's counters can be configured via mbm_total_bytes_config and mbm_local_bytes_config before they are assigned. That is still per-domain counter configuration though, not per-counter. > I assume packing all of this info for a group's desired counter > configuration into a single line (with 32 domains per line on many > dual-socket AMD configurations I see) would be difficult to look at, > even if we could settle on a single letter to represent each > universally. > >> >> My goal is for resctrl to have a user interface that can as much as possible >> be ready for whatever may be required from it years down the line. Of course, >> I may be wrong and resctrl would never need to support more than 26 events per >> resource (*). The risk is that resctrl *may* need to support more than 26 events >> and how could resctrl support that? >> >> What is the risk of supporting more than 26 events? As I highlighted earlier >> the interface I used as demonstration may become unwieldy to parse on a system >> with many domains that supports many events. This is a concern for me. Any suggestions >> will be appreciated, especially from you since I know that you are very familiar with >> issues related to large scale use of resctrl interfaces. > > It's mainly just the unwieldiness of all the information in one file. > It's already at the limit of what I can visually look through. I agree. > > I believe that shared assignments will take care of all the > high-frequency and performance-intensive batch configuration updates I > was originally concerned about, so I no longer see much benefit in > finding ways to textually encode all this information in a single file > when it would be more manageable to distribute it around the > filesystem hierarchy. This is significant. The motivation for the single file was to support the "high-frequency and performance-intensive" usage. Would "shared assignments" not also depend on the same files that, if distributed, will require many filesystem operations? Having the files distributed will be significantly simpler while also avoiding the file size issue that Dave Martin exposed. Reinette >> [1] https://lore.kernel.org/lkml/SJ1PR11MB6083759CCE59FF2FE931471EFCFF2@SJ1PR11MB6083.namprd11.prod.outlook.com/ >> >> (*) There is also the scenario where combined between resources there may be >> more than 26 events supported that will require the same one letter flag to be >> used for different events of different resources. This may potentially be >> confusing. [2] https://lore.kernel.org/lkml/Z6zeXby8ajh0ax6i@e133380.arm.com/
Hi, On Wed, Feb 19, 2025 at 09:56:29AM -0800, Reinette Chatre wrote: > Hi Peter, > > On 2/19/25 3:28 AM, Peter Newman wrote: [...] > > In the letters as events model, choosing the events assigned to a > > group wouldn't be enough information, since we would want to control > > which events should share a counter and which should be counted by > > separate counters. I think the amount of information that would need > > to be encoded into mbm_assign_control to represent the level of > > configurability supported by hardware would quickly get out of hand. > > > > Maybe as an example, one counter for all reads, one counter for all > > writes in ABMC would look like... > > > > (L3_QOS_ABMC_CFG.BwType field names below) > > > > (per domain) > > group 0: > > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > > counter 1: VictimBW,LclNTWr,RmtNTWr > > group 1: > > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > > counter 3: VictimBW,LclNTWr,RmtNTWr > > ... > > > > I think this may also be what Dave was heading towards in [2] but in that > example and above the counter configuration appears to be global. You do mention > "configurability supported by hardware" so I wonder if per-domain counter > configuration is a requirement? > > Until now I viewed counter configuration separate from counter assignment, > similar to how AMD's counters can be configured via mbm_total_bytes_config and > mbm_local_bytes_config before they are assigned. That is still per-domain > counter configuration though, not per-counter. I hadn't tried to work the design through in any detail: it wasn't intended as a suggestion for something we should definitely do right now; rather, it was just an incomplete sketch of one possible future evolution of the interface. Either way these feel like future concerns, if the first iteration of ABMC is just to provide the basics so that ABMC hardware can implement resctrl without userspace seeing counters randomly stopping and resetting... Peter, can you give a view on whether the ABMC as proposed in this series is a useful stepping-stone? Or are there things that you need that you feel could not be added as a later extension without ABI breakage? [...] > > I believe that shared assignments will take care of all the > > high-frequency and performance-intensive batch configuration updates I > > was originally concerned about, so I no longer see much benefit in > > finding ways to textually encode all this information in a single file > > when it would be more manageable to distribute it around the > > filesystem hierarchy. > > This is significant. The motivation for the single file was to support > the "high-frequency and performance-intensive" usage. Would "shared assignments" > not also depend on the same files that, if distributed, will require many > filesystem operations? > Having the files distributed will be significantly simpler while also > avoiding the file size issue that Dave Martin exposed. > > Reinette I still haven't fully understood the "shared assignments" proposal; I need to go back and look at it. If we split the file, it will be more closely aligned with the design of the rest of the resctrlfs interface. OTOH, the current interface seems workable and I think the file size issue can be addressed without major re-engineering. So, from my side, I would not consider the current interface design a blocker. [...] Cheers ---Dave
Hi again, On Thu, Feb 20, 2025 at 04:46:40PM +0000, Dave Martin wrote: > Hi, > > On Wed, Feb 19, 2025 at 09:56:29AM -0800, Reinette Chatre wrote: > > Hi Peter, > > > > On 2/19/25 3:28 AM, Peter Newman wrote: > > [...] > > > > In the letters as events model, choosing the events assigned to a > > > group wouldn't be enough information, since we would want to control > > > which events should share a counter and which should be counted by > > > separate counters. I think the amount of information that would need > > > to be encoded into mbm_assign_control to represent the level of > > > configurability supported by hardware would quickly get out of hand. > > > > > > Maybe as an example, one counter for all reads, one counter for all > > > writes in ABMC would look like... > > > > > > (L3_QOS_ABMC_CFG.BwType field names below) > > > > > > (per domain) > > > group 0: > > > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > > > counter 1: VictimBW,LclNTWr,RmtNTWr > > > group 1: > > > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > > > counter 3: VictimBW,LclNTWr,RmtNTWr > > > ... > > > > > > > I think this may also be what Dave was heading towards in [2] but in that > > example and above the counter configuration appears to be global. You do mention > > "configurability supported by hardware" so I wonder if per-domain counter > > configuration is a requirement? > > > > Until now I viewed counter configuration separate from counter assignment, > > similar to how AMD's counters can be configured via mbm_total_bytes_config and > > mbm_local_bytes_config before they are assigned. That is still per-domain > > counter configuration though, not per-counter. > > I hadn't tried to work the design through in any detail: it wasn't > intended as a suggestion for something we should definitely do right > now; rather, it was just an incomplete sketch of one possible future > evolution of the interface. > > Either way these feel like future concerns, if the first iteration of > ABMC is just to provide the basics so that ABMC hardware can implement > resctrl without userspace seeing counters randomly stopping and > resetting... > > Peter, can you give a view on whether the ABMC as proposed in this series > is a useful stepping-stone? Or are there things that you need that you > feel could not be added as a later extension without ABI breakage? > > [...] > > > > I believe that shared assignments will take care of all the > > > high-frequency and performance-intensive batch configuration updates I > > > was originally concerned about, so I no longer see much benefit in > > > finding ways to textually encode all this information in a single file jjjk> > > when it would be more manageable to distribute it around the > > > filesystem hierarchy. > > > > This is significant. The motivation for the single file was to support > > the "high-frequency and performance-intensive" usage. Would "shared assignments" > > not also depend on the same files that, if distributed, will require many > > filesystem operations? > > Having the files distributed will be significantly simpler while also > > avoiding the file size issue that Dave Martin exposed. > > > > Reinette > > I still haven't fully understood the "shared assignments" proposal; > I need to go back and look at it. Having taken a quick look at that now, this all seems to duplicate perf's design journey (again). "rate" events make some sense. The perf equivalent is to keep an accumulated count of the amount of time a counter has been assigned to an event, and another accumulated count of the events counted by the counter during assignment. Only userspace knows what it wants to do with this information: perf exposes the raw accumulated counts. Perf events can be also pinned so that they are prioritised for assignment to counters; that sounds a lot like the regular, non-shared resctrl counters. Playing devil's advocate: It does feel like we are doomed to reinvent perf if we go too far down this road... > If we split the file, it will be more closely aligned with the design > of the rest of the resctrlfs interface. > > OTOH, the current interface seems workable and I think the file size > issue can be addressed without major re-engineering. > > So, from my side, I would not consider the current interface design > a blocker. ...so, drawing a hard line around the use cases that we intend to address with this interface and avoiding feature creep seems desirable. resctrlfs is already in the wild, so providing reasonable baseline compatiblity with that interface for ABMC hardware is a sensible goal. The current series does that. But I wonder how much additional functionality we should really be adding via the mbm_assign_control interface, once this series is settled. Cheers ---Dave
Hi Dave, On 2/20/25 9:46 AM, Dave Martin wrote: > Hi again, > > On Thu, Feb 20, 2025 at 04:46:40PM +0000, Dave Martin wrote: >> Hi, >> >> On Wed, Feb 19, 2025 at 09:56:29AM -0800, Reinette Chatre wrote: >>> Hi Peter, >>> >>> On 2/19/25 3:28 AM, Peter Newman wrote: >> >> [...] >> >>>> In the letters as events model, choosing the events assigned to a >>>> group wouldn't be enough information, since we would want to control >>>> which events should share a counter and which should be counted by >>>> separate counters. I think the amount of information that would need >>>> to be encoded into mbm_assign_control to represent the level of >>>> configurability supported by hardware would quickly get out of hand. >>>> >>>> Maybe as an example, one counter for all reads, one counter for all >>>> writes in ABMC would look like... >>>> >>>> (L3_QOS_ABMC_CFG.BwType field names below) >>>> >>>> (per domain) >>>> group 0: >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>> group 1: >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>> ... >>>> >>> >>> I think this may also be what Dave was heading towards in [2] but in that >>> example and above the counter configuration appears to be global. You do mention >>> "configurability supported by hardware" so I wonder if per-domain counter >>> configuration is a requirement? >>> >>> Until now I viewed counter configuration separate from counter assignment, >>> similar to how AMD's counters can be configured via mbm_total_bytes_config and >>> mbm_local_bytes_config before they are assigned. That is still per-domain >>> counter configuration though, not per-counter. >> >> I hadn't tried to work the design through in any detail: it wasn't >> intended as a suggestion for something we should definitely do right >> now; rather, it was just an incomplete sketch of one possible future >> evolution of the interface. >> >> Either way these feel like future concerns, if the first iteration of >> ABMC is just to provide the basics so that ABMC hardware can implement >> resctrl without userspace seeing counters randomly stopping and >> resetting... >> >> Peter, can you give a view on whether the ABMC as proposed in this series >> is a useful stepping-stone? Or are there things that you need that you >> feel could not be added as a later extension without ABI breakage? >> >> [...] >> >>>> I believe that shared assignments will take care of all the >>>> high-frequency and performance-intensive batch configuration updates I >>>> was originally concerned about, so I no longer see much benefit in >>>> finding ways to textually encode all this information in a single file > jjjk> > > when it would be more manageable to distribute it around the >>>> filesystem hierarchy. >>> >>> This is significant. The motivation for the single file was to support >>> the "high-frequency and performance-intensive" usage. Would "shared assignments" >>> not also depend on the same files that, if distributed, will require many >>> filesystem operations? >>> Having the files distributed will be significantly simpler while also >>> avoiding the file size issue that Dave Martin exposed. >>> >>> Reinette >> >> I still haven't fully understood the "shared assignments" proposal; >> I need to go back and look at it. > > Having taken a quick look at that now, this all seems to duplicate > perf's design journey (again). > > "rate" events make some sense. The perf equivalent is to keep an > accumulated count of the amount of time a counter has been assigned to > an event, and another accumulated count of the events counted by the > counter during assignment. Only userspace knows what it wants to do > with this information: perf exposes the raw accumulated counts. > > Perf events can be also pinned so that they are prioritised for > assignment to counters; that sounds a lot like the regular, non-shared > resctrl counters. > > > Playing devil's advocate: > > It does feel like we are doomed to reinvent perf if we go too far down > this road... > >> If we split the file, it will be more closely aligned with the design >> of the rest of the resctrlfs interface. >> >> OTOH, the current interface seems workable and I think the file size >> issue can be addressed without major re-engineering. >> >> So, from my side, I would not consider the current interface design >> a blocker. > > ...so, drawing a hard line around the use cases that we intend to > address with this interface and avoiding feature creep seems desirable. This is exactly what I am trying to do ... to understand what use cases the interface is expected to support. You have mentioned a couple of times now that this interface is sufficient but at the same time you hinted at some features from MPAM that I do not see possible to accommodate with this interface. > resctrlfs is already in the wild, so providing reasonable baseline > compatiblity with that interface for ABMC hardware is a sensible goal. > The current series does that. > > But I wonder how much additional functionality we should really be > adding via the mbm_assign_control interface, once this series is > settled. Are you speculating that MPAM counters may not make use of this interface? Reinette
Hi Reinette, On Thu, Feb 20, 2025 at 10:36:18AM -0800, Reinette Chatre wrote: > Hi Dave, > > On 2/20/25 9:46 AM, Dave Martin wrote: > > Hi again, > > > > On Thu, Feb 20, 2025 at 04:46:40PM +0000, Dave Martin wrote: [...] > > Having taken a quick look at that now, this all seems to duplicate > > perf's design journey (again). > > > > "rate" events make some sense. The perf equivalent is to keep an > > accumulated count of the amount of time a counter has been assigned to > > an event, and another accumulated count of the events counted by the > > counter during assignment. Only userspace knows what it wants to do > > with this information: perf exposes the raw accumulated counts. > > > > Perf events can be also pinned so that they are prioritised for > > assignment to counters; that sounds a lot like the regular, non-shared > > resctrl counters. > > > > > > Playing devil's advocate: > > > > It does feel like we are doomed to reinvent perf if we go too far down > > this road... > > > >> If we split the file, it will be more closely aligned with the design > >> of the rest of the resctrlfs interface. > >> > >> OTOH, the current interface seems workable and I think the file size > >> issue can be addressed without major re-engineering. > >> > >> So, from my side, I would not consider the current interface design > >> a blocker. > > > > ...so, drawing a hard line around the use cases that we intend to > > address with this interface and avoiding feature creep seems desirable. > > This is exactly what I am trying to do ... to understand what use cases > the interface is expected to support. > > You have mentioned a couple of times now that this interface is sufficient but > at the same time you hinted at some features from MPAM that I do not see > possible to accommodate with this interface. It's kind of both. I think the interface is sufficient to be useful, and therefore has value. The problem being addressed here (shortage of counters) is fully relevant to MPAM (at last on some hardware). Any architecture may define new metrics and types of event that can be counted, and they're not going to match up exactly between arches -- so I don't think we can expect everything to fit perfectly within a generic interface. But having a generic interface is still useful for making common features convenient to use. So the interface is useful but not universal, but that doesn't feel like a bug. Hopefully that makes my position a bit clearer. > > resctrlfs is already in the wild, so providing reasonable baseline > > compatiblity with that interface for ABMC hardware is a sensible goal. > > The current series does that. > > > > But I wonder how much additional functionality we should really be > > adding via the mbm_assign_control interface, once this series is > > settled. > > Are you speculating that MPAM counters may not make use of this interface? > > Reinette No, I think it makes sense for MPAM to follow this interface, as least as far as what has been proposed so far here. I think James got his updated rebase working. [1] perf support would be for the future if we do it, but the ABMC interface may be a useful starting point anyway, because it allows counters to be assigned explicitly -- that provides a natural way to hand over some counters to perf, either because that interface may be a more natural fit for what the user is trying to do, or perhaps to count weird, platform-specific event types that do not merit the effort of integration into resctrlfs proper. Does that make sense? Cheers ---Dave [1] https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/abmc/v11
Hi Dave, On 2/21/25 8:47 AM, Dave Martin wrote: > Hi Reinette, > > On Thu, Feb 20, 2025 at 10:36:18AM -0800, Reinette Chatre wrote: >> Hi Dave, >> >> On 2/20/25 9:46 AM, Dave Martin wrote: >>> Hi again, >>> >>> On Thu, Feb 20, 2025 at 04:46:40PM +0000, Dave Martin wrote: > > [...] > >>> Having taken a quick look at that now, this all seems to duplicate >>> perf's design journey (again). >>> >>> "rate" events make some sense. The perf equivalent is to keep an >>> accumulated count of the amount of time a counter has been assigned to >>> an event, and another accumulated count of the events counted by the >>> counter during assignment. Only userspace knows what it wants to do >>> with this information: perf exposes the raw accumulated counts. >>> >>> Perf events can be also pinned so that they are prioritised for >>> assignment to counters; that sounds a lot like the regular, non-shared >>> resctrl counters. >>> >>> >>> Playing devil's advocate: >>> >>> It does feel like we are doomed to reinvent perf if we go too far down >>> this road... >>> >>>> If we split the file, it will be more closely aligned with the design >>>> of the rest of the resctrlfs interface. >>>> >>>> OTOH, the current interface seems workable and I think the file size >>>> issue can be addressed without major re-engineering. >>>> >>>> So, from my side, I would not consider the current interface design >>>> a blocker. >>> >>> ...so, drawing a hard line around the use cases that we intend to >>> address with this interface and avoiding feature creep seems desirable. >> >> This is exactly what I am trying to do ... to understand what use cases >> the interface is expected to support. >> >> You have mentioned a couple of times now that this interface is sufficient but >> at the same time you hinted at some features from MPAM that I do not see >> possible to accommodate with this interface. > > It's kind of both. > > I think the interface is sufficient to be useful, and therefore has > value. > > The problem being addressed here (shortage of counters) is fully > relevant to MPAM (at last on some hardware). > > Any architecture may define new metrics and types of event that can be > counted, and they're not going to match up exactly between arches -- so > I don't think we can expect everything to fit perfectly within a > generic interface. But having a generic interface is still useful for > making common features convenient to use. > > So the interface is useful but not universal, but that doesn't feel > like a bug. > > Hopefully that makes my position a bit clearer. > >>> resctrlfs is already in the wild, so providing reasonable baseline >>> compatiblity with that interface for ABMC hardware is a sensible goal. >>> The current series does that. >>> >>> But I wonder how much additional functionality we should really be >>> adding via the mbm_assign_control interface, once this series is >>> settled. >> >> Are you speculating that MPAM counters may not make use of this interface? >> >> Reinette > > No, I think it makes sense for MPAM to follow this interface, as least > as far as what has been proposed so far here. > > I think James got his updated rebase working. [1] > > > perf support would be for the future if we do it, but the ABMC > interface may be a useful starting point anyway, because it allows > counters to be assigned explicitly -- that provides a natural way to > hand over some counters to perf, either because that interface may be a > more natural fit for what the user is trying to do, or perhaps to count > weird, platform-specific event types that do not merit the effort of > integration into resctrlfs proper. > > Does that make sense? > This is reasonable. You did state earlier that we should aim to draw hard lines around the use cases we aim to address and I think one way this work is doing this is by being explicit in user interface that this is all about "memory bandwidth monitoring". This is not intended to be a fully generic interface for all possible counters for all possible resources. Apart from that time will tell how many blind spots there were while creating this interface. Thank you very much for all your very valuable insights. Reinette
Hi Reinette, On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > > Hi Peter, > > On 2/19/25 3:28 AM, Peter Newman wrote: > > Hi Reinette, > > > > On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre > > <reinette.chatre@intel.com> wrote: > >> > >> Hi Peter, > >> > >> On 2/17/25 2:26 AM, Peter Newman wrote: > >>> Hi Reinette, > >>> > >>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre > >>> <reinette.chatre@intel.com> wrote: > >>>> > >>>> Hi Babu, > >>>> > >>>> On 2/14/25 10:31 AM, Moger, Babu wrote: > >>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: > >>>>>> On 2/13/25 9:37 AM, Dave Martin wrote: > >>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: > >>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: > >>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: > >>>> > >>>> (quoting relevant parts with goal to focus discussion on new possible syntax) > >>>> > >>>>>>>> I see the support for MPAM events distinct from the support of assignable counters. > >>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. > >>>>>>>> Please help me understand if you see it differently. > >>>>>>>> > >>>>>>>> Doing so would need to come up with alphabetical letters for these events, > >>>>>>>> which seems to be needed for your proposal also? If we use possible flags of: > >>>>>>>> > >>>>>>>> mbm_local_read_bytes a > >>>>>>>> mbm_local_write_bytes b > >>>>>>>> > >>>>>>>> Then mbm_assign_control can be used as: > >>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control > >>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes > >>>>>>>> <value> > >>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > >>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> > >>>>>>>> > >>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), > >>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined > >>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? > >>>> > >>>> As mentioned above, one possible issue with existing interface is that > >>>> it is limited to 26 events (assuming only lower case letters are used). The limit > >>>> is low enough to be of concern. > >>> > >>> The events which can be monitored by a single counter on ABMC and MPAM > >>> so far are combinable, so 26 counters per group today means it limits > >>> breaking down MBM traffic for each group 26 ways. If a user complained > >>> that a 26-way breakdown of a group's MBM traffic was limiting their > >>> investigation, I would question whether they know what they're looking > >>> for. > >> > >> The key here is "so far" as well as the focus on MBM only. > >> > >> It is impossible for me to predict what we will see in a couple of years > >> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface > >> to support their users. Just looking at the Intel RDT spec the event register > >> has space for 32 events for each "CPU agent" resource. That does not take into > >> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned > >> that he is working on patches [1] that will add new events and shared the idea > >> that we may be trending to support "perf" like events associated with RMID. I > >> expect AMD PQoS and Arm MPAM to provide related enhancements to support their > >> customers. > >> This all makes me think that resctrl should be ready to support more events than 26. > > > > I was thinking of the letters as representing a reusable, user-defined > > event-set for applying to a single counter rather than as individual > > events, since MPAM and ABMC allow us to choose the set of events each > > one counts. Wherever we define the letters, we could use more symbolic > > event names. > > Thank you for clarifying. > > > > > In the letters as events model, choosing the events assigned to a > > group wouldn't be enough information, since we would want to control > > which events should share a counter and which should be counted by > > separate counters. I think the amount of information that would need > > to be encoded into mbm_assign_control to represent the level of > > configurability supported by hardware would quickly get out of hand. > > > > Maybe as an example, one counter for all reads, one counter for all > > writes in ABMC would look like... > > > > (L3_QOS_ABMC_CFG.BwType field names below) > > > > (per domain) > > group 0: > > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > > counter 1: VictimBW,LclNTWr,RmtNTWr > > group 1: > > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > > counter 3: VictimBW,LclNTWr,RmtNTWr > > ... > > > > I think this may also be what Dave was heading towards in [2] but in that > example and above the counter configuration appears to be global. You do mention > "configurability supported by hardware" so I wonder if per-domain counter > configuration is a requirement? If it's global and we want a particular group to be watched by more counters, I wouldn't want this to result in allocating more counters for that group in all domains, or allocating counters in domains where they're not needed. I want to encourage my users to avoid allocating monitoring resources in domains where a job is not allowed to run so there's less pressure on the counters. In Dave's proposal it looks like global configuration means globally-defined "named counter configurations", which works because it's really per-domain assignment of the configurations to however many counters the group needs in each domain. > > Until now I viewed counter configuration separate from counter assignment, > similar to how AMD's counters can be configured via mbm_total_bytes_config and > mbm_local_bytes_config before they are assigned. That is still per-domain > counter configuration though, not per-counter. > > > I assume packing all of this info for a group's desired counter > > configuration into a single line (with 32 domains per line on many > > dual-socket AMD configurations I see) would be difficult to look at, > > even if we could settle on a single letter to represent each > > universally. > > > >> > >> My goal is for resctrl to have a user interface that can as much as possible > >> be ready for whatever may be required from it years down the line. Of course, > >> I may be wrong and resctrl would never need to support more than 26 events per > >> resource (*). The risk is that resctrl *may* need to support more than 26 events > >> and how could resctrl support that? > >> > >> What is the risk of supporting more than 26 events? As I highlighted earlier > >> the interface I used as demonstration may become unwieldy to parse on a system > >> with many domains that supports many events. This is a concern for me. Any suggestions > >> will be appreciated, especially from you since I know that you are very familiar with > >> issues related to large scale use of resctrl interfaces. > > > > It's mainly just the unwieldiness of all the information in one file. > > It's already at the limit of what I can visually look through. > > I agree. > > > > > I believe that shared assignments will take care of all the > > high-frequency and performance-intensive batch configuration updates I > > was originally concerned about, so I no longer see much benefit in > > finding ways to textually encode all this information in a single file > > when it would be more manageable to distribute it around the > > filesystem hierarchy. > > This is significant. The motivation for the single file was to support > the "high-frequency and performance-intensive" usage. Would "shared assignments" > not also depend on the same files that, if distributed, will require many > filesystem operations? > Having the files distributed will be significantly simpler while also > avoiding the file size issue that Dave Martin exposed. The remaining filesystem operations will be assigning or removing shared counter assignments in the applicable domains, which would normally correspond to mkdir/rmdir of groups or changing their CPU affinity. The shared assignments are more "program and forget", while the exclusive assignment approach requires updates for every counter (in every domain) every few seconds to cover a large number of groups. When they want to pay extra attention to a particular group, I expect they'll ask for exclusive counters and leave them assigned for a while as they collect extra data. -Peter
Hi Peter, On 2/20/25 6:53 AM, Peter Newman wrote: > Hi Reinette, > > On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> >> Hi Peter, >> >> On 2/19/25 3:28 AM, Peter Newman wrote: >>> Hi Reinette, >>> >>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre >>> <reinette.chatre@intel.com> wrote: >>>> >>>> Hi Peter, >>>> >>>> On 2/17/25 2:26 AM, Peter Newman wrote: >>>>> Hi Reinette, >>>>> >>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre >>>>> <reinette.chatre@intel.com> wrote: >>>>>> >>>>>> Hi Babu, >>>>>> >>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote: >>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: >>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote: >>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: >>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: >>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: >>>>>> >>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax) >>>>>> >>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters. >>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. >>>>>>>>>> Please help me understand if you see it differently. >>>>>>>>>> >>>>>>>>>> Doing so would need to come up with alphabetical letters for these events, >>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of: >>>>>>>>>> >>>>>>>>>> mbm_local_read_bytes a >>>>>>>>>> mbm_local_write_bytes b >>>>>>>>>> >>>>>>>>>> Then mbm_assign_control can be used as: >>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control >>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes >>>>>>>>>> <value> >>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> >>>>>>>>>> >>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), >>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined >>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? >>>>>> >>>>>> As mentioned above, one possible issue with existing interface is that >>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit >>>>>> is low enough to be of concern. >>>>> >>>>> The events which can be monitored by a single counter on ABMC and MPAM >>>>> so far are combinable, so 26 counters per group today means it limits >>>>> breaking down MBM traffic for each group 26 ways. If a user complained >>>>> that a 26-way breakdown of a group's MBM traffic was limiting their >>>>> investigation, I would question whether they know what they're looking >>>>> for. >>>> >>>> The key here is "so far" as well as the focus on MBM only. >>>> >>>> It is impossible for me to predict what we will see in a couple of years >>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface >>>> to support their users. Just looking at the Intel RDT spec the event register >>>> has space for 32 events for each "CPU agent" resource. That does not take into >>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned >>>> that he is working on patches [1] that will add new events and shared the idea >>>> that we may be trending to support "perf" like events associated with RMID. I >>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their >>>> customers. >>>> This all makes me think that resctrl should be ready to support more events than 26. >>> >>> I was thinking of the letters as representing a reusable, user-defined >>> event-set for applying to a single counter rather than as individual >>> events, since MPAM and ABMC allow us to choose the set of events each >>> one counts. Wherever we define the letters, we could use more symbolic >>> event names. >> >> Thank you for clarifying. >> >>> >>> In the letters as events model, choosing the events assigned to a >>> group wouldn't be enough information, since we would want to control >>> which events should share a counter and which should be counted by >>> separate counters. I think the amount of information that would need >>> to be encoded into mbm_assign_control to represent the level of >>> configurability supported by hardware would quickly get out of hand. >>> >>> Maybe as an example, one counter for all reads, one counter for all >>> writes in ABMC would look like... >>> >>> (L3_QOS_ABMC_CFG.BwType field names below) >>> >>> (per domain) >>> group 0: >>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>> counter 1: VictimBW,LclNTWr,RmtNTWr >>> group 1: >>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>> counter 3: VictimBW,LclNTWr,RmtNTWr >>> ... >>> >> >> I think this may also be what Dave was heading towards in [2] but in that >> example and above the counter configuration appears to be global. You do mention >> "configurability supported by hardware" so I wonder if per-domain counter >> configuration is a requirement? > > If it's global and we want a particular group to be watched by more > counters, I wouldn't want this to result in allocating more counters > for that group in all domains, or allocating counters in domains where > they're not needed. I want to encourage my users to avoid allocating > monitoring resources in domains where a job is not allowed to run so > there's less pressure on the counters. > > In Dave's proposal it looks like global configuration means > globally-defined "named counter configurations", which works because > it's really per-domain assignment of the configurations to however > many counters the group needs in each domain. I think I am becoming lost. Would a global configuration not break your view of "event-set applied to a single counter"? If a counter is configured globally then it would not make it possible to support the full configurability of the hardware. Before I add more confusion, let me try with an example that builds on your earlier example copied below: >>> (per domain) >>> group 0: >>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>> counter 1: VictimBW,LclNTWr,RmtNTWr >>> group 1: >>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>> counter 3: VictimBW,LclNTWr,RmtNTWr >>> ... Since the above states "per domain" I rewrite the example to highlight that as I understand it: group 0: domain 0: counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill counter 1: VictimBW,LclNTWr,RmtNTWr domain 1: counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill counter 1: VictimBW,LclNTWr,RmtNTWr group 1: domain 0: counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill counter 3: VictimBW,LclNTWr,RmtNTWr domain 1: counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill counter 3: VictimBW,LclNTWr,RmtNTWr You mention that you do not want counters to be allocated in domains that they are not needed in. So, let's say group 0 does not need counter 0 and counter 1 in domain 1, resulting in: group 0: domain 0: counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill counter 1: VictimBW,LclNTWr,RmtNTWr group 1: domain 0: counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill counter 3: VictimBW,LclNTWr,RmtNTWr domain 1: counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill counter 3: VictimBW,LclNTWr,RmtNTWr With counter 0 and counter 1 available in domain 1, these counters could theoretically be configured to give group 1 more data in domain 1: group 0: domain 0: counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill counter 1: VictimBW,LclNTWr,RmtNTWr group 1: domain 0: counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill counter 3: VictimBW,LclNTWr,RmtNTWr domain 1: counter 0: LclFill,RmtFill counter 1: LclNTWr,RmtNTWr counter 2: LclSlowFill,RmtSlowFill counter 3: VictimBW The counters are shown with different per-domain configurations that seems to match with earlier goals of (a) choose events counted by each counter and (b) do not allocate counters in domains where they are not needed. As I understand the above does contradict global counter configuration though. Or do you mean that only the *name* of the counter is global and then that it is reconfigured as part of every assignment? >> Until now I viewed counter configuration separate from counter assignment, >> similar to how AMD's counters can be configured via mbm_total_bytes_config and >> mbm_local_bytes_config before they are assigned. That is still per-domain >> counter configuration though, not per-counter. >> >>> I assume packing all of this info for a group's desired counter >>> configuration into a single line (with 32 domains per line on many >>> dual-socket AMD configurations I see) would be difficult to look at, >>> even if we could settle on a single letter to represent each >>> universally. >>> >>>> >>>> My goal is for resctrl to have a user interface that can as much as possible >>>> be ready for whatever may be required from it years down the line. Of course, >>>> I may be wrong and resctrl would never need to support more than 26 events per >>>> resource (*). The risk is that resctrl *may* need to support more than 26 events >>>> and how could resctrl support that? >>>> >>>> What is the risk of supporting more than 26 events? As I highlighted earlier >>>> the interface I used as demonstration may become unwieldy to parse on a system >>>> with many domains that supports many events. This is a concern for me. Any suggestions >>>> will be appreciated, especially from you since I know that you are very familiar with >>>> issues related to large scale use of resctrl interfaces. >>> >>> It's mainly just the unwieldiness of all the information in one file. >>> It's already at the limit of what I can visually look through. >> >> I agree. >> >>> >>> I believe that shared assignments will take care of all the >>> high-frequency and performance-intensive batch configuration updates I >>> was originally concerned about, so I no longer see much benefit in >>> finding ways to textually encode all this information in a single file >>> when it would be more manageable to distribute it around the >>> filesystem hierarchy. >> >> This is significant. The motivation for the single file was to support >> the "high-frequency and performance-intensive" usage. Would "shared assignments" >> not also depend on the same files that, if distributed, will require many >> filesystem operations? >> Having the files distributed will be significantly simpler while also >> avoiding the file size issue that Dave Martin exposed. > > The remaining filesystem operations will be assigning or removing > shared counter assignments in the applicable domains, which would > normally correspond to mkdir/rmdir of groups or changing their CPU > affinity. The shared assignments are more "program and forget", while > the exclusive assignment approach requires updates for every counter > (in every domain) every few seconds to cover a large number of groups. > > When they want to pay extra attention to a particular group, I expect > they'll ask for exclusive counters and leave them assigned for a while > as they collect extra data. The single file approach is already unwieldy. The demands that will be placed on it to support the usages currently being discussed would make this interface even harder to use and manage. If the single file is not required then I think we should go back to smaller files distributed in resctrl. This may not even be an either/or argument. One way to view mbm_assign_control could be as a way for user to interact with the distributed counter related files with a single file system operation. Although, without knowing how counter configuration is expected to work this remains unclear. Reinette
Hi Reinette, On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > > Hi Peter, > > On 2/20/25 6:53 AM, Peter Newman wrote: > > Hi Reinette, > > > > On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre > > <reinette.chatre@intel.com> wrote: > >> > >> Hi Peter, > >> > >> On 2/19/25 3:28 AM, Peter Newman wrote: > >>> Hi Reinette, > >>> > >>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre > >>> <reinette.chatre@intel.com> wrote: > >>>> > >>>> Hi Peter, > >>>> > >>>> On 2/17/25 2:26 AM, Peter Newman wrote: > >>>>> Hi Reinette, > >>>>> > >>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre > >>>>> <reinette.chatre@intel.com> wrote: > >>>>>> > >>>>>> Hi Babu, > >>>>>> > >>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote: > >>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: > >>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote: > >>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: > >>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: > >>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: > >>>>>> > >>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax) > >>>>>> > >>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters. > >>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. > >>>>>>>>>> Please help me understand if you see it differently. > >>>>>>>>>> > >>>>>>>>>> Doing so would need to come up with alphabetical letters for these events, > >>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of: > >>>>>>>>>> > >>>>>>>>>> mbm_local_read_bytes a > >>>>>>>>>> mbm_local_write_bytes b > >>>>>>>>>> > >>>>>>>>>> Then mbm_assign_control can be used as: > >>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control > >>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes > >>>>>>>>>> <value> > >>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > >>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> > >>>>>>>>>> > >>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), > >>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined > >>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? > >>>>>> > >>>>>> As mentioned above, one possible issue with existing interface is that > >>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit > >>>>>> is low enough to be of concern. > >>>>> > >>>>> The events which can be monitored by a single counter on ABMC and MPAM > >>>>> so far are combinable, so 26 counters per group today means it limits > >>>>> breaking down MBM traffic for each group 26 ways. If a user complained > >>>>> that a 26-way breakdown of a group's MBM traffic was limiting their > >>>>> investigation, I would question whether they know what they're looking > >>>>> for. > >>>> > >>>> The key here is "so far" as well as the focus on MBM only. > >>>> > >>>> It is impossible for me to predict what we will see in a couple of years > >>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface > >>>> to support their users. Just looking at the Intel RDT spec the event register > >>>> has space for 32 events for each "CPU agent" resource. That does not take into > >>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned > >>>> that he is working on patches [1] that will add new events and shared the idea > >>>> that we may be trending to support "perf" like events associated with RMID. I > >>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their > >>>> customers. > >>>> This all makes me think that resctrl should be ready to support more events than 26. > >>> > >>> I was thinking of the letters as representing a reusable, user-defined > >>> event-set for applying to a single counter rather than as individual > >>> events, since MPAM and ABMC allow us to choose the set of events each > >>> one counts. Wherever we define the letters, we could use more symbolic > >>> event names. > >> > >> Thank you for clarifying. > >> > >>> > >>> In the letters as events model, choosing the events assigned to a > >>> group wouldn't be enough information, since we would want to control > >>> which events should share a counter and which should be counted by > >>> separate counters. I think the amount of information that would need > >>> to be encoded into mbm_assign_control to represent the level of > >>> configurability supported by hardware would quickly get out of hand. > >>> > >>> Maybe as an example, one counter for all reads, one counter for all > >>> writes in ABMC would look like... > >>> > >>> (L3_QOS_ABMC_CFG.BwType field names below) > >>> > >>> (per domain) > >>> group 0: > >>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>> counter 1: VictimBW,LclNTWr,RmtNTWr > >>> group 1: > >>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>> ... > >>> > >> > >> I think this may also be what Dave was heading towards in [2] but in that > >> example and above the counter configuration appears to be global. You do mention > >> "configurability supported by hardware" so I wonder if per-domain counter > >> configuration is a requirement? > > > > If it's global and we want a particular group to be watched by more > > counters, I wouldn't want this to result in allocating more counters > > for that group in all domains, or allocating counters in domains where > > they're not needed. I want to encourage my users to avoid allocating > > monitoring resources in domains where a job is not allowed to run so > > there's less pressure on the counters. > > > > In Dave's proposal it looks like global configuration means > > globally-defined "named counter configurations", which works because > > it's really per-domain assignment of the configurations to however > > many counters the group needs in each domain. > > I think I am becoming lost. Would a global configuration not break your > view of "event-set applied to a single counter"? If a counter is configured > globally then it would not make it possible to support the full configurability > of the hardware. > Before I add more confusion, let me try with an example that builds on your > earlier example copied below: > > >>> (per domain) > >>> group 0: > >>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>> counter 1: VictimBW,LclNTWr,RmtNTWr > >>> group 1: > >>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>> ... > > Since the above states "per domain" I rewrite the example to highlight that as > I understand it: > > group 0: > domain 0: > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 1: VictimBW,LclNTWr,RmtNTWr > domain 1: > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 1: VictimBW,LclNTWr,RmtNTWr > group 1: > domain 0: > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 3: VictimBW,LclNTWr,RmtNTWr > domain 1: > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 3: VictimBW,LclNTWr,RmtNTWr > > You mention that you do not want counters to be allocated in domains that they > are not needed in. So, let's say group 0 does not need counter 0 and counter 1 > in domain 1, resulting in: > > group 0: > domain 0: > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 1: VictimBW,LclNTWr,RmtNTWr > group 1: > domain 0: > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 3: VictimBW,LclNTWr,RmtNTWr > domain 1: > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 3: VictimBW,LclNTWr,RmtNTWr > > With counter 0 and counter 1 available in domain 1, these counters could > theoretically be configured to give group 1 more data in domain 1: > > group 0: > domain 0: > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 1: VictimBW,LclNTWr,RmtNTWr > group 1: > domain 0: > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 3: VictimBW,LclNTWr,RmtNTWr > domain 1: > counter 0: LclFill,RmtFill > counter 1: LclNTWr,RmtNTWr > counter 2: LclSlowFill,RmtSlowFill > counter 3: VictimBW > > The counters are shown with different per-domain configurations that seems to > match with earlier goals of (a) choose events counted by each counter and > (b) do not allocate counters in domains where they are not needed. As I > understand the above does contradict global counter configuration though. > Or do you mean that only the *name* of the counter is global and then > that it is reconfigured as part of every assignment? Yes, I meant only the *name* is global. I assume based on a particular system configuration, the user will settle on a handful of useful groupings to count. Perhaps mbm_assign_control syntax is the clearest way to express an example... # define global configurations (in ABMC terms), not necessarily in this # syntax and probably not in the mbm_assign_control file. r=LclFill,RmtFill,LclSlowFill,RmtSlowFill w=VictimBW,LclNTWr,RmtNTWr # legacy "total" configuration, effectively r+w t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr /group0/0=t;1=t /group1/0=t;1=t /group2/0=_;1=t /group3/0=rw;1=_ - group2 is restricted to domain 0 - group3 is restricted to domain 1 - the rest are unrestricted - In group3, we decided we need to separate read and write traffic This consumes 4 counters in domain 0 and 3 counters in domain 1. > > >> Until now I viewed counter configuration separate from counter assignment, > >> similar to how AMD's counters can be configured via mbm_total_bytes_config and > >> mbm_local_bytes_config before they are assigned. That is still per-domain > >> counter configuration though, not per-counter. > >> > >>> I assume packing all of this info for a group's desired counter > >>> configuration into a single line (with 32 domains per line on many > >>> dual-socket AMD configurations I see) would be difficult to look at, > >>> even if we could settle on a single letter to represent each > >>> universally. > >>> > >>>> > >>>> My goal is for resctrl to have a user interface that can as much as possible > >>>> be ready for whatever may be required from it years down the line. Of course, > >>>> I may be wrong and resctrl would never need to support more than 26 events per > >>>> resource (*). The risk is that resctrl *may* need to support more than 26 events > >>>> and how could resctrl support that? > >>>> > >>>> What is the risk of supporting more than 26 events? As I highlighted earlier > >>>> the interface I used as demonstration may become unwieldy to parse on a system > >>>> with many domains that supports many events. This is a concern for me. Any suggestions > >>>> will be appreciated, especially from you since I know that you are very familiar with > >>>> issues related to large scale use of resctrl interfaces. > >>> > >>> It's mainly just the unwieldiness of all the information in one file. > >>> It's already at the limit of what I can visually look through. > >> > >> I agree. > >> > >>> > >>> I believe that shared assignments will take care of all the > >>> high-frequency and performance-intensive batch configuration updates I > >>> was originally concerned about, so I no longer see much benefit in > >>> finding ways to textually encode all this information in a single file > >>> when it would be more manageable to distribute it around the > >>> filesystem hierarchy. > >> > >> This is significant. The motivation for the single file was to support > >> the "high-frequency and performance-intensive" usage. Would "shared assignments" > >> not also depend on the same files that, if distributed, will require many > >> filesystem operations? > >> Having the files distributed will be significantly simpler while also > >> avoiding the file size issue that Dave Martin exposed. > > > > The remaining filesystem operations will be assigning or removing > > shared counter assignments in the applicable domains, which would > > normally correspond to mkdir/rmdir of groups or changing their CPU > > affinity. The shared assignments are more "program and forget", while > > the exclusive assignment approach requires updates for every counter > > (in every domain) every few seconds to cover a large number of groups. > > > > When they want to pay extra attention to a particular group, I expect > > they'll ask for exclusive counters and leave them assigned for a while > > as they collect extra data. > > The single file approach is already unwieldy. The demands that will be > placed on it to support the usages currently being discussed would make this > interface even harder to use and manage. If the single file is not required > then I think we should go back to smaller files distributed in resctrl. > This may not even be an either/or argument. One way to view mbm_assign_control > could be as a way for user to interact with the distributed counter > related files with a single file system operation. Although, without > knowing how counter configuration is expected to work this remains unclear. If we do both interfaces and the multi-file model gives us more capability to express configurations, we could find situations where there are configurations we cannot represent when reading back from mbm_assign_control, or updates through mbm_assign_control have ambiguous effects on existing configurations which were created with other files. However, the example I gave above seems to be adequately represented by a minor extension to mbm_assign_control and we all seem to understand it now, so maybe it's not broken yet. It's unfortunate that work went into a requirement that's no longer relevant, but I don't think that on its own is a blocker. -Peter
Hi Peter, On 2/21/25 5:12 AM, Peter Newman wrote: > On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> On 2/20/25 6:53 AM, Peter Newman wrote: >>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre >>> <reinette.chatre@intel.com> wrote: >>>> On 2/19/25 3:28 AM, Peter Newman wrote: >>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre >>>>> <reinette.chatre@intel.com> wrote: >>>>>> On 2/17/25 2:26 AM, Peter Newman wrote: >>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre >>>>>>> <reinette.chatre@intel.com> wrote: >>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote: >>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: >>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote: >>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: >>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: >>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: >>>>>>>> >>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax) >>>>>>>> >>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters. >>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. >>>>>>>>>>>> Please help me understand if you see it differently. >>>>>>>>>>>> >>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events, >>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of: >>>>>>>>>>>> >>>>>>>>>>>> mbm_local_read_bytes a >>>>>>>>>>>> mbm_local_write_bytes b >>>>>>>>>>>> >>>>>>>>>>>> Then mbm_assign_control can be used as: >>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control >>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes >>>>>>>>>>>> <value> >>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> >>>>>>>>>>>> >>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), >>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined >>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? >>>>>>>> >>>>>>>> As mentioned above, one possible issue with existing interface is that >>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit >>>>>>>> is low enough to be of concern. >>>>>>> >>>>>>> The events which can be monitored by a single counter on ABMC and MPAM >>>>>>> so far are combinable, so 26 counters per group today means it limits >>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained >>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their >>>>>>> investigation, I would question whether they know what they're looking >>>>>>> for. >>>>>> >>>>>> The key here is "so far" as well as the focus on MBM only. >>>>>> >>>>>> It is impossible for me to predict what we will see in a couple of years >>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface >>>>>> to support their users. Just looking at the Intel RDT spec the event register >>>>>> has space for 32 events for each "CPU agent" resource. That does not take into >>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned >>>>>> that he is working on patches [1] that will add new events and shared the idea >>>>>> that we may be trending to support "perf" like events associated with RMID. I >>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their >>>>>> customers. >>>>>> This all makes me think that resctrl should be ready to support more events than 26. >>>>> >>>>> I was thinking of the letters as representing a reusable, user-defined >>>>> event-set for applying to a single counter rather than as individual >>>>> events, since MPAM and ABMC allow us to choose the set of events each >>>>> one counts. Wherever we define the letters, we could use more symbolic >>>>> event names. >>>> >>>> Thank you for clarifying. >>>> >>>>> >>>>> In the letters as events model, choosing the events assigned to a >>>>> group wouldn't be enough information, since we would want to control >>>>> which events should share a counter and which should be counted by >>>>> separate counters. I think the amount of information that would need >>>>> to be encoded into mbm_assign_control to represent the level of >>>>> configurability supported by hardware would quickly get out of hand. >>>>> >>>>> Maybe as an example, one counter for all reads, one counter for all >>>>> writes in ABMC would look like... >>>>> >>>>> (L3_QOS_ABMC_CFG.BwType field names below) >>>>> >>>>> (per domain) >>>>> group 0: >>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>> group 1: >>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>> ... >>>>> >>>> >>>> I think this may also be what Dave was heading towards in [2] but in that >>>> example and above the counter configuration appears to be global. You do mention >>>> "configurability supported by hardware" so I wonder if per-domain counter >>>> configuration is a requirement? >>> >>> If it's global and we want a particular group to be watched by more >>> counters, I wouldn't want this to result in allocating more counters >>> for that group in all domains, or allocating counters in domains where >>> they're not needed. I want to encourage my users to avoid allocating >>> monitoring resources in domains where a job is not allowed to run so >>> there's less pressure on the counters. >>> >>> In Dave's proposal it looks like global configuration means >>> globally-defined "named counter configurations", which works because >>> it's really per-domain assignment of the configurations to however >>> many counters the group needs in each domain. >> >> I think I am becoming lost. Would a global configuration not break your >> view of "event-set applied to a single counter"? If a counter is configured >> globally then it would not make it possible to support the full configurability >> of the hardware. >> Before I add more confusion, let me try with an example that builds on your >> earlier example copied below: >> >>>>> (per domain) >>>>> group 0: >>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>> group 1: >>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>> ... >> >> Since the above states "per domain" I rewrite the example to highlight that as >> I understand it: >> >> group 0: >> domain 0: >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >> counter 1: VictimBW,LclNTWr,RmtNTWr >> domain 1: >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >> counter 1: VictimBW,LclNTWr,RmtNTWr >> group 1: >> domain 0: >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >> counter 3: VictimBW,LclNTWr,RmtNTWr >> domain 1: >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >> counter 3: VictimBW,LclNTWr,RmtNTWr >> >> You mention that you do not want counters to be allocated in domains that they >> are not needed in. So, let's say group 0 does not need counter 0 and counter 1 >> in domain 1, resulting in: >> >> group 0: >> domain 0: >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >> counter 1: VictimBW,LclNTWr,RmtNTWr >> group 1: >> domain 0: >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >> counter 3: VictimBW,LclNTWr,RmtNTWr >> domain 1: >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >> counter 3: VictimBW,LclNTWr,RmtNTWr >> >> With counter 0 and counter 1 available in domain 1, these counters could >> theoretically be configured to give group 1 more data in domain 1: >> >> group 0: >> domain 0: >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >> counter 1: VictimBW,LclNTWr,RmtNTWr >> group 1: >> domain 0: >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >> counter 3: VictimBW,LclNTWr,RmtNTWr >> domain 1: >> counter 0: LclFill,RmtFill >> counter 1: LclNTWr,RmtNTWr >> counter 2: LclSlowFill,RmtSlowFill >> counter 3: VictimBW >> >> The counters are shown with different per-domain configurations that seems to >> match with earlier goals of (a) choose events counted by each counter and >> (b) do not allocate counters in domains where they are not needed. As I >> understand the above does contradict global counter configuration though. >> Or do you mean that only the *name* of the counter is global and then >> that it is reconfigured as part of every assignment? > > Yes, I meant only the *name* is global. I assume based on a particular > system configuration, the user will settle on a handful of useful > groupings to count. > > Perhaps mbm_assign_control syntax is the clearest way to express an example... > > # define global configurations (in ABMC terms), not necessarily in this > # syntax and probably not in the mbm_assign_control file. > > r=LclFill,RmtFill,LclSlowFill,RmtSlowFill > w=VictimBW,LclNTWr,RmtNTWr > > # legacy "total" configuration, effectively r+w > t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr > > /group0/0=t;1=t > /group1/0=t;1=t > /group2/0=_;1=t > /group3/0=rw;1=_ > > - group2 is restricted to domain 0 > - group3 is restricted to domain 1 > - the rest are unrestricted > - In group3, we decided we need to separate read and write traffic > > This consumes 4 counters in domain 0 and 3 counters in domain 1. > I see. Thank you for the example. resctrl supports per-domain configurations with the following possible when using mbm_total_bytes_config and mbm_local_bytes_config: t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr /group0/0=t;1=t /group1/0=t;1=t Even though the flags are identical in all domains, the assigned counters will be configured differently in each domain. With this supported by hardware and currently also supported by resctrl it seems reasonable to carry this forward to what will be supported next. >> >>>> Until now I viewed counter configuration separate from counter assignment, >>>> similar to how AMD's counters can be configured via mbm_total_bytes_config and >>>> mbm_local_bytes_config before they are assigned. That is still per-domain >>>> counter configuration though, not per-counter. >>>> >>>>> I assume packing all of this info for a group's desired counter >>>>> configuration into a single line (with 32 domains per line on many >>>>> dual-socket AMD configurations I see) would be difficult to look at, >>>>> even if we could settle on a single letter to represent each >>>>> universally. >>>>> >>>>>> >>>>>> My goal is for resctrl to have a user interface that can as much as possible >>>>>> be ready for whatever may be required from it years down the line. Of course, >>>>>> I may be wrong and resctrl would never need to support more than 26 events per >>>>>> resource (*). The risk is that resctrl *may* need to support more than 26 events >>>>>> and how could resctrl support that? >>>>>> >>>>>> What is the risk of supporting more than 26 events? As I highlighted earlier >>>>>> the interface I used as demonstration may become unwieldy to parse on a system >>>>>> with many domains that supports many events. This is a concern for me. Any suggestions >>>>>> will be appreciated, especially from you since I know that you are very familiar with >>>>>> issues related to large scale use of resctrl interfaces. >>>>> >>>>> It's mainly just the unwieldiness of all the information in one file. >>>>> It's already at the limit of what I can visually look through. >>>> >>>> I agree. >>>> >>>>> >>>>> I believe that shared assignments will take care of all the >>>>> high-frequency and performance-intensive batch configuration updates I >>>>> was originally concerned about, so I no longer see much benefit in >>>>> finding ways to textually encode all this information in a single file >>>>> when it would be more manageable to distribute it around the >>>>> filesystem hierarchy. >>>> >>>> This is significant. The motivation for the single file was to support >>>> the "high-frequency and performance-intensive" usage. Would "shared assignments" >>>> not also depend on the same files that, if distributed, will require many >>>> filesystem operations? >>>> Having the files distributed will be significantly simpler while also >>>> avoiding the file size issue that Dave Martin exposed. >>> >>> The remaining filesystem operations will be assigning or removing >>> shared counter assignments in the applicable domains, which would >>> normally correspond to mkdir/rmdir of groups or changing their CPU >>> affinity. The shared assignments are more "program and forget", while >>> the exclusive assignment approach requires updates for every counter >>> (in every domain) every few seconds to cover a large number of groups. >>> >>> When they want to pay extra attention to a particular group, I expect >>> they'll ask for exclusive counters and leave them assigned for a while >>> as they collect extra data. >> >> The single file approach is already unwieldy. The demands that will be >> placed on it to support the usages currently being discussed would make this >> interface even harder to use and manage. If the single file is not required >> then I think we should go back to smaller files distributed in resctrl. >> This may not even be an either/or argument. One way to view mbm_assign_control >> could be as a way for user to interact with the distributed counter >> related files with a single file system operation. Although, without >> knowing how counter configuration is expected to work this remains unclear. > > If we do both interfaces and the multi-file model gives us more > capability to express configurations, we could find situations where > there are configurations we cannot represent when reading back from > mbm_assign_control, or updates through mbm_assign_control have > ambiguous effects on existing configurations which were created with > other files. Right. My assumption was that the syntax would be identical. > > However, the example I gave above seems to be adequately represented > by a minor extension to mbm_assign_control and we all seem to To confirm what you mean with "minor extension to mbm_assign_control", is this where the flags are associated with counter configurations? At this time this is done separately from mbm_assign_control with the hardcoded "t" and "l" flags configured via mbm_total_bytes_config and mbm_local_bytes respectively. I think it would be simpler to keep these configurations separate from mbm_assign_control. How it would look without better understanding of MPAM is not clear to me at this time, unless if the requirement is to enhance support for ABMC and BMEC. I do see that this can be added later to build on what is supported by mbm_assign_control with the syntax in this version. > understand it now, so maybe it's not broken yet. It's unfortunate that > work went into a requirement that's no longer relevant, but I don't > think that on its own is a blocker. I understand that requirements may change as we get new information. Digesting it now is significantly easier than trying to adapt after the user interface is merged and essentially set in stone. Reinette
Hi Reinette, On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre <reinette.chatre@intel.com> wrote: > > Hi Peter, > > On 2/21/25 5:12 AM, Peter Newman wrote: > > On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre > > <reinette.chatre@intel.com> wrote: > >> On 2/20/25 6:53 AM, Peter Newman wrote: > >>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre > >>> <reinette.chatre@intel.com> wrote: > >>>> On 2/19/25 3:28 AM, Peter Newman wrote: > >>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre > >>>>> <reinette.chatre@intel.com> wrote: > >>>>>> On 2/17/25 2:26 AM, Peter Newman wrote: > >>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre > >>>>>>> <reinette.chatre@intel.com> wrote: > >>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote: > >>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: > >>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote: > >>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: > >>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: > >>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: > >>>>>>>> > >>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax) > >>>>>>>> > >>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters. > >>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. > >>>>>>>>>>>> Please help me understand if you see it differently. > >>>>>>>>>>>> > >>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events, > >>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of: > >>>>>>>>>>>> > >>>>>>>>>>>> mbm_local_read_bytes a > >>>>>>>>>>>> mbm_local_write_bytes b > >>>>>>>>>>>> > >>>>>>>>>>>> Then mbm_assign_control can be used as: > >>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control > >>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes > >>>>>>>>>>>> <value> > >>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > >>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> > >>>>>>>>>>>> > >>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), > >>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined > >>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? > >>>>>>>> > >>>>>>>> As mentioned above, one possible issue with existing interface is that > >>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit > >>>>>>>> is low enough to be of concern. > >>>>>>> > >>>>>>> The events which can be monitored by a single counter on ABMC and MPAM > >>>>>>> so far are combinable, so 26 counters per group today means it limits > >>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained > >>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their > >>>>>>> investigation, I would question whether they know what they're looking > >>>>>>> for. > >>>>>> > >>>>>> The key here is "so far" as well as the focus on MBM only. > >>>>>> > >>>>>> It is impossible for me to predict what we will see in a couple of years > >>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface > >>>>>> to support their users. Just looking at the Intel RDT spec the event register > >>>>>> has space for 32 events for each "CPU agent" resource. That does not take into > >>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned > >>>>>> that he is working on patches [1] that will add new events and shared the idea > >>>>>> that we may be trending to support "perf" like events associated with RMID. I > >>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their > >>>>>> customers. > >>>>>> This all makes me think that resctrl should be ready to support more events than 26. > >>>>> > >>>>> I was thinking of the letters as representing a reusable, user-defined > >>>>> event-set for applying to a single counter rather than as individual > >>>>> events, since MPAM and ABMC allow us to choose the set of events each > >>>>> one counts. Wherever we define the letters, we could use more symbolic > >>>>> event names. > >>>> > >>>> Thank you for clarifying. > >>>> > >>>>> > >>>>> In the letters as events model, choosing the events assigned to a > >>>>> group wouldn't be enough information, since we would want to control > >>>>> which events should share a counter and which should be counted by > >>>>> separate counters. I think the amount of information that would need > >>>>> to be encoded into mbm_assign_control to represent the level of > >>>>> configurability supported by hardware would quickly get out of hand. > >>>>> > >>>>> Maybe as an example, one counter for all reads, one counter for all > >>>>> writes in ABMC would look like... > >>>>> > >>>>> (L3_QOS_ABMC_CFG.BwType field names below) > >>>>> > >>>>> (per domain) > >>>>> group 0: > >>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>>> counter 1: VictimBW,LclNTWr,RmtNTWr > >>>>> group 1: > >>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>>>> ... > >>>>> > >>>> > >>>> I think this may also be what Dave was heading towards in [2] but in that > >>>> example and above the counter configuration appears to be global. You do mention > >>>> "configurability supported by hardware" so I wonder if per-domain counter > >>>> configuration is a requirement? > >>> > >>> If it's global and we want a particular group to be watched by more > >>> counters, I wouldn't want this to result in allocating more counters > >>> for that group in all domains, or allocating counters in domains where > >>> they're not needed. I want to encourage my users to avoid allocating > >>> monitoring resources in domains where a job is not allowed to run so > >>> there's less pressure on the counters. > >>> > >>> In Dave's proposal it looks like global configuration means > >>> globally-defined "named counter configurations", which works because > >>> it's really per-domain assignment of the configurations to however > >>> many counters the group needs in each domain. > >> > >> I think I am becoming lost. Would a global configuration not break your > >> view of "event-set applied to a single counter"? If a counter is configured > >> globally then it would not make it possible to support the full configurability > >> of the hardware. > >> Before I add more confusion, let me try with an example that builds on your > >> earlier example copied below: > >> > >>>>> (per domain) > >>>>> group 0: > >>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>>> counter 1: VictimBW,LclNTWr,RmtNTWr > >>>>> group 1: > >>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>>>> ... > >> > >> Since the above states "per domain" I rewrite the example to highlight that as > >> I understand it: > >> > >> group 0: > >> domain 0: > >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 1: VictimBW,LclNTWr,RmtNTWr > >> domain 1: > >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 1: VictimBW,LclNTWr,RmtNTWr > >> group 1: > >> domain 0: > >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 3: VictimBW,LclNTWr,RmtNTWr > >> domain 1: > >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 3: VictimBW,LclNTWr,RmtNTWr > >> > >> You mention that you do not want counters to be allocated in domains that they > >> are not needed in. So, let's say group 0 does not need counter 0 and counter 1 > >> in domain 1, resulting in: > >> > >> group 0: > >> domain 0: > >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 1: VictimBW,LclNTWr,RmtNTWr > >> group 1: > >> domain 0: > >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 3: VictimBW,LclNTWr,RmtNTWr > >> domain 1: > >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 3: VictimBW,LclNTWr,RmtNTWr > >> > >> With counter 0 and counter 1 available in domain 1, these counters could > >> theoretically be configured to give group 1 more data in domain 1: > >> > >> group 0: > >> domain 0: > >> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 1: VictimBW,LclNTWr,RmtNTWr > >> group 1: > >> domain 0: > >> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >> counter 3: VictimBW,LclNTWr,RmtNTWr > >> domain 1: > >> counter 0: LclFill,RmtFill > >> counter 1: LclNTWr,RmtNTWr > >> counter 2: LclSlowFill,RmtSlowFill > >> counter 3: VictimBW > >> > >> The counters are shown with different per-domain configurations that seems to > >> match with earlier goals of (a) choose events counted by each counter and > >> (b) do not allocate counters in domains where they are not needed. As I > >> understand the above does contradict global counter configuration though. > >> Or do you mean that only the *name* of the counter is global and then > >> that it is reconfigured as part of every assignment? > > > > Yes, I meant only the *name* is global. I assume based on a particular > > system configuration, the user will settle on a handful of useful > > groupings to count. > > > > Perhaps mbm_assign_control syntax is the clearest way to express an example... > > > > # define global configurations (in ABMC terms), not necessarily in this > > # syntax and probably not in the mbm_assign_control file. > > > > r=LclFill,RmtFill,LclSlowFill,RmtSlowFill > > w=VictimBW,LclNTWr,RmtNTWr > > > > # legacy "total" configuration, effectively r+w > > t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr > > > > /group0/0=t;1=t > > /group1/0=t;1=t > > /group2/0=_;1=t > > /group3/0=rw;1=_ > > > > - group2 is restricted to domain 0 > > - group3 is restricted to domain 1 > > - the rest are unrestricted > > - In group3, we decided we need to separate read and write traffic > > > > This consumes 4 counters in domain 0 and 3 counters in domain 1. > > > > I see. Thank you for the example. > > resctrl supports per-domain configurations with the following possible when > using mbm_total_bytes_config and mbm_local_bytes_config: > > t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr > t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr > > /group0/0=t;1=t > /group1/0=t;1=t > > Even though the flags are identical in all domains, the assigned counters will > be configured differently in each domain. > > With this supported by hardware and currently also supported by resctrl it seems > reasonable to carry this forward to what will be supported next. The hardware supports both a per-domain mode, where all groups in a domain use the same configurations and are limited to two events per group and a per-group mode where every group can be configured and assigned freely. This series is using the legacy counter access mode where only counters whose BwType matches an instance of QOS_EVT_CFG_n in the domain can be read. If we chose to read the assigned counter directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC) rather than asking the hardware to find the counter by RMID, we would not be limited to 2 counters per group/domain and the hardware would have the same flexibility as on MPAM. (I might have said something confusing in my last messages because I had forgotten that I switched to the extended assignment mode when prototyping with soft-ABMC and MPAM.) Forcing all groups on a domain to share the same 2 counter configurations would not be acceptable for us, as the example I gave earlier is one I've already been asked about. I'm worried about requiring support for domain-level mbm_total_bytes_config and mbm_local_bytes_config files to be carried forward, because this conflicts with the configuration being per group/domain. (i.e., what would be read back from the domain files if per-group customizations have already been applied?) > > >> > >>>> Until now I viewed counter configuration separate from counter assignment, > >>>> similar to how AMD's counters can be configured via mbm_total_bytes_config and > >>>> mbm_local_bytes_config before they are assigned. That is still per-domain > >>>> counter configuration though, not per-counter. > >>>> > >>>>> I assume packing all of this info for a group's desired counter > >>>>> configuration into a single line (with 32 domains per line on many > >>>>> dual-socket AMD configurations I see) would be difficult to look at, > >>>>> even if we could settle on a single letter to represent each > >>>>> universally. > >>>>> > >>>>>> > >>>>>> My goal is for resctrl to have a user interface that can as much as possible > >>>>>> be ready for whatever may be required from it years down the line. Of course, > >>>>>> I may be wrong and resctrl would never need to support more than 26 events per > >>>>>> resource (*). The risk is that resctrl *may* need to support more than 26 events > >>>>>> and how could resctrl support that? > >>>>>> > >>>>>> What is the risk of supporting more than 26 events? As I highlighted earlier > >>>>>> the interface I used as demonstration may become unwieldy to parse on a system > >>>>>> with many domains that supports many events. This is a concern for me. Any suggestions > >>>>>> will be appreciated, especially from you since I know that you are very familiar with > >>>>>> issues related to large scale use of resctrl interfaces. > >>>>> > >>>>> It's mainly just the unwieldiness of all the information in one file. > >>>>> It's already at the limit of what I can visually look through. > >>>> > >>>> I agree. > >>>> > >>>>> > >>>>> I believe that shared assignments will take care of all the > >>>>> high-frequency and performance-intensive batch configuration updates I > >>>>> was originally concerned about, so I no longer see much benefit in > >>>>> finding ways to textually encode all this information in a single file > >>>>> when it would be more manageable to distribute it around the > >>>>> filesystem hierarchy. > >>>> > >>>> This is significant. The motivation for the single file was to support > >>>> the "high-frequency and performance-intensive" usage. Would "shared assignments" > >>>> not also depend on the same files that, if distributed, will require many > >>>> filesystem operations? > >>>> Having the files distributed will be significantly simpler while also > >>>> avoiding the file size issue that Dave Martin exposed. > >>> > >>> The remaining filesystem operations will be assigning or removing > >>> shared counter assignments in the applicable domains, which would > >>> normally correspond to mkdir/rmdir of groups or changing their CPU > >>> affinity. The shared assignments are more "program and forget", while > >>> the exclusive assignment approach requires updates for every counter > >>> (in every domain) every few seconds to cover a large number of groups. > >>> > >>> When they want to pay extra attention to a particular group, I expect > >>> they'll ask for exclusive counters and leave them assigned for a while > >>> as they collect extra data. > >> > >> The single file approach is already unwieldy. The demands that will be > >> placed on it to support the usages currently being discussed would make this > >> interface even harder to use and manage. If the single file is not required > >> then I think we should go back to smaller files distributed in resctrl. > >> This may not even be an either/or argument. One way to view mbm_assign_control > >> could be as a way for user to interact with the distributed counter > >> related files with a single file system operation. Although, without > >> knowing how counter configuration is expected to work this remains unclear. > > > > If we do both interfaces and the multi-file model gives us more > > capability to express configurations, we could find situations where > > there are configurations we cannot represent when reading back from > > mbm_assign_control, or updates through mbm_assign_control have > > ambiguous effects on existing configurations which were created with > > other files. > > Right. My assumption was that the syntax would be identical. > > > > > However, the example I gave above seems to be adequately represented > > by a minor extension to mbm_assign_control and we all seem to > > To confirm what you mean with "minor extension to mbm_assign_control", > is this where the flags are associated with counter configurations? At this > time this is done separately from mbm_assign_control with the hardcoded "t" > and "l" flags configured via mbm_total_bytes_config and mbm_local_bytes > respectively. I think it would be simpler to keep these configurations > separate from mbm_assign_control. How it would look without better > understanding of MPAM is not clear to me at this time, unless if the > requirement is to enhance support for ABMC and BMEC. I do see that > this can be added later to build on what is supported by mbm_assign_control > with the syntax in this version. As I explained above, I was looking at this from the perspective of the extended event assignment mode. Thanks, -Peter
Hi Peter, On 2/25/25 9:11 AM, Peter Newman wrote: > Hi Reinette, > > On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> >> Hi Peter, >> >> On 2/21/25 5:12 AM, Peter Newman wrote: >>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre >>> <reinette.chatre@intel.com> wrote: >>>> On 2/20/25 6:53 AM, Peter Newman wrote: >>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre >>>>> <reinette.chatre@intel.com> wrote: >>>>>> On 2/19/25 3:28 AM, Peter Newman wrote: >>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre >>>>>>> <reinette.chatre@intel.com> wrote: >>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote: >>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre >>>>>>>>> <reinette.chatre@intel.com> wrote: >>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote: >>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: >>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote: >>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: >>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: >>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: >>>>>>>>>> >>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax) >>>>>>>>>> >>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters. >>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. >>>>>>>>>>>>>> Please help me understand if you see it differently. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events, >>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of: >>>>>>>>>>>>>> >>>>>>>>>>>>>> mbm_local_read_bytes a >>>>>>>>>>>>>> mbm_local_write_bytes b >>>>>>>>>>>>>> >>>>>>>>>>>>>> Then mbm_assign_control can be used as: >>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control >>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes >>>>>>>>>>>>>> <value> >>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> >>>>>>>>>>>>>> >>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), >>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined >>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? >>>>>>>>>> >>>>>>>>>> As mentioned above, one possible issue with existing interface is that >>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit >>>>>>>>>> is low enough to be of concern. >>>>>>>>> >>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM >>>>>>>>> so far are combinable, so 26 counters per group today means it limits >>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained >>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their >>>>>>>>> investigation, I would question whether they know what they're looking >>>>>>>>> for. >>>>>>>> >>>>>>>> The key here is "so far" as well as the focus on MBM only. >>>>>>>> >>>>>>>> It is impossible for me to predict what we will see in a couple of years >>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface >>>>>>>> to support their users. Just looking at the Intel RDT spec the event register >>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into >>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned >>>>>>>> that he is working on patches [1] that will add new events and shared the idea >>>>>>>> that we may be trending to support "perf" like events associated with RMID. I >>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their >>>>>>>> customers. >>>>>>>> This all makes me think that resctrl should be ready to support more events than 26. >>>>>>> >>>>>>> I was thinking of the letters as representing a reusable, user-defined >>>>>>> event-set for applying to a single counter rather than as individual >>>>>>> events, since MPAM and ABMC allow us to choose the set of events each >>>>>>> one counts. Wherever we define the letters, we could use more symbolic >>>>>>> event names. >>>>>> >>>>>> Thank you for clarifying. >>>>>> >>>>>>> >>>>>>> In the letters as events model, choosing the events assigned to a >>>>>>> group wouldn't be enough information, since we would want to control >>>>>>> which events should share a counter and which should be counted by >>>>>>> separate counters. I think the amount of information that would need >>>>>>> to be encoded into mbm_assign_control to represent the level of >>>>>>> configurability supported by hardware would quickly get out of hand. >>>>>>> >>>>>>> Maybe as an example, one counter for all reads, one counter for all >>>>>>> writes in ABMC would look like... >>>>>>> >>>>>>> (L3_QOS_ABMC_CFG.BwType field names below) >>>>>>> >>>>>>> (per domain) >>>>>>> group 0: >>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>>> group 1: >>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>>> ... >>>>>>> >>>>>> >>>>>> I think this may also be what Dave was heading towards in [2] but in that >>>>>> example and above the counter configuration appears to be global. You do mention >>>>>> "configurability supported by hardware" so I wonder if per-domain counter >>>>>> configuration is a requirement? >>>>> >>>>> If it's global and we want a particular group to be watched by more >>>>> counters, I wouldn't want this to result in allocating more counters >>>>> for that group in all domains, or allocating counters in domains where >>>>> they're not needed. I want to encourage my users to avoid allocating >>>>> monitoring resources in domains where a job is not allowed to run so >>>>> there's less pressure on the counters. >>>>> >>>>> In Dave's proposal it looks like global configuration means >>>>> globally-defined "named counter configurations", which works because >>>>> it's really per-domain assignment of the configurations to however >>>>> many counters the group needs in each domain. >>>> >>>> I think I am becoming lost. Would a global configuration not break your >>>> view of "event-set applied to a single counter"? If a counter is configured >>>> globally then it would not make it possible to support the full configurability >>>> of the hardware. >>>> Before I add more confusion, let me try with an example that builds on your >>>> earlier example copied below: >>>> >>>>>>> (per domain) >>>>>>> group 0: >>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>>> group 1: >>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>>> ... >>>> >>>> Since the above states "per domain" I rewrite the example to highlight that as >>>> I understand it: >>>> >>>> group 0: >>>> domain 0: >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>> domain 1: >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>> group 1: >>>> domain 0: >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>> domain 1: >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>> >>>> You mention that you do not want counters to be allocated in domains that they >>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1 >>>> in domain 1, resulting in: >>>> >>>> group 0: >>>> domain 0: >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>> group 1: >>>> domain 0: >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>> domain 1: >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>> >>>> With counter 0 and counter 1 available in domain 1, these counters could >>>> theoretically be configured to give group 1 more data in domain 1: >>>> >>>> group 0: >>>> domain 0: >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>> group 1: >>>> domain 0: >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>> domain 1: >>>> counter 0: LclFill,RmtFill >>>> counter 1: LclNTWr,RmtNTWr >>>> counter 2: LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW >>>> >>>> The counters are shown with different per-domain configurations that seems to >>>> match with earlier goals of (a) choose events counted by each counter and >>>> (b) do not allocate counters in domains where they are not needed. As I >>>> understand the above does contradict global counter configuration though. >>>> Or do you mean that only the *name* of the counter is global and then >>>> that it is reconfigured as part of every assignment? >>> >>> Yes, I meant only the *name* is global. I assume based on a particular >>> system configuration, the user will settle on a handful of useful >>> groupings to count. >>> >>> Perhaps mbm_assign_control syntax is the clearest way to express an example... >>> >>> # define global configurations (in ABMC terms), not necessarily in this >>> # syntax and probably not in the mbm_assign_control file. >>> >>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill >>> w=VictimBW,LclNTWr,RmtNTWr >>> >>> # legacy "total" configuration, effectively r+w >>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr >>> >>> /group0/0=t;1=t >>> /group1/0=t;1=t >>> /group2/0=_;1=t >>> /group3/0=rw;1=_ >>> >>> - group2 is restricted to domain 0 >>> - group3 is restricted to domain 1 >>> - the rest are unrestricted >>> - In group3, we decided we need to separate read and write traffic >>> >>> This consumes 4 counters in domain 0 and 3 counters in domain 1. >>> >> >> I see. Thank you for the example. >> >> resctrl supports per-domain configurations with the following possible when >> using mbm_total_bytes_config and mbm_local_bytes_config: >> >> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr >> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr >> >> /group0/0=t;1=t >> /group1/0=t;1=t >> >> Even though the flags are identical in all domains, the assigned counters will >> be configured differently in each domain. >> >> With this supported by hardware and currently also supported by resctrl it seems >> reasonable to carry this forward to what will be supported next. > > The hardware supports both a per-domain mode, where all groups in a > domain use the same configurations and are limited to two events per > group and a per-group mode where every group can be configured and > assigned freely. This series is using the legacy counter access mode > where only counters whose BwType matches an instance of QOS_EVT_CFG_n > in the domain can be read. If we chose to read the assigned counter > directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC) > rather than asking the hardware to find the counter by RMID, we would > not be limited to 2 counters per group/domain and the hardware would > have the same flexibility as on MPAM. > > (I might have said something confusing in my last messages because I > had forgotten that I switched to the extended assignment mode when > prototyping with soft-ABMC and MPAM.) > > Forcing all groups on a domain to share the same 2 counter > configurations would not be acceptable for us, as the example I gave > earlier is one I've already been asked about. I am surprised to hear this at this point of this work. Sounds like we need to go back a couple of steps to determine how to best support user requirements that now includes per-group counter assignment. Have you perhaps looked into how users access the counter data as part of your prototyping? Reinette
Hi Peter, On 2/25/25 11:11, Peter Newman wrote: > Hi Reinette, > > On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> >> Hi Peter, >> >> On 2/21/25 5:12 AM, Peter Newman wrote: >>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre >>> <reinette.chatre@intel.com> wrote: >>>> On 2/20/25 6:53 AM, Peter Newman wrote: >>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre >>>>> <reinette.chatre@intel.com> wrote: >>>>>> On 2/19/25 3:28 AM, Peter Newman wrote: >>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre >>>>>>> <reinette.chatre@intel.com> wrote: >>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote: >>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre >>>>>>>>> <reinette.chatre@intel.com> wrote: >>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote: >>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: >>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote: >>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: >>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: >>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: >>>>>>>>>> >>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax) >>>>>>>>>> >>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters. >>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. >>>>>>>>>>>>>> Please help me understand if you see it differently. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events, >>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of: >>>>>>>>>>>>>> >>>>>>>>>>>>>> mbm_local_read_bytes a >>>>>>>>>>>>>> mbm_local_write_bytes b >>>>>>>>>>>>>> >>>>>>>>>>>>>> Then mbm_assign_control can be used as: >>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control >>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes >>>>>>>>>>>>>> <value> >>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> >>>>>>>>>>>>>> >>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), >>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined >>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? >>>>>>>>>> >>>>>>>>>> As mentioned above, one possible issue with existing interface is that >>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit >>>>>>>>>> is low enough to be of concern. >>>>>>>>> >>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM >>>>>>>>> so far are combinable, so 26 counters per group today means it limits >>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained >>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their >>>>>>>>> investigation, I would question whether they know what they're looking >>>>>>>>> for. >>>>>>>> >>>>>>>> The key here is "so far" as well as the focus on MBM only. >>>>>>>> >>>>>>>> It is impossible for me to predict what we will see in a couple of years >>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface >>>>>>>> to support their users. Just looking at the Intel RDT spec the event register >>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into >>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned >>>>>>>> that he is working on patches [1] that will add new events and shared the idea >>>>>>>> that we may be trending to support "perf" like events associated with RMID. I >>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their >>>>>>>> customers. >>>>>>>> This all makes me think that resctrl should be ready to support more events than 26. >>>>>>> >>>>>>> I was thinking of the letters as representing a reusable, user-defined >>>>>>> event-set for applying to a single counter rather than as individual >>>>>>> events, since MPAM and ABMC allow us to choose the set of events each >>>>>>> one counts. Wherever we define the letters, we could use more symbolic >>>>>>> event names. >>>>>> >>>>>> Thank you for clarifying. >>>>>> >>>>>>> >>>>>>> In the letters as events model, choosing the events assigned to a >>>>>>> group wouldn't be enough information, since we would want to control >>>>>>> which events should share a counter and which should be counted by >>>>>>> separate counters. I think the amount of information that would need >>>>>>> to be encoded into mbm_assign_control to represent the level of >>>>>>> configurability supported by hardware would quickly get out of hand. >>>>>>> >>>>>>> Maybe as an example, one counter for all reads, one counter for all >>>>>>> writes in ABMC would look like... >>>>>>> >>>>>>> (L3_QOS_ABMC_CFG.BwType field names below) >>>>>>> >>>>>>> (per domain) >>>>>>> group 0: >>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>>> group 1: >>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>>> ... >>>>>>> >>>>>> >>>>>> I think this may also be what Dave was heading towards in [2] but in that >>>>>> example and above the counter configuration appears to be global. You do mention >>>>>> "configurability supported by hardware" so I wonder if per-domain counter >>>>>> configuration is a requirement? >>>>> >>>>> If it's global and we want a particular group to be watched by more >>>>> counters, I wouldn't want this to result in allocating more counters >>>>> for that group in all domains, or allocating counters in domains where >>>>> they're not needed. I want to encourage my users to avoid allocating >>>>> monitoring resources in domains where a job is not allowed to run so >>>>> there's less pressure on the counters. >>>>> >>>>> In Dave's proposal it looks like global configuration means >>>>> globally-defined "named counter configurations", which works because >>>>> it's really per-domain assignment of the configurations to however >>>>> many counters the group needs in each domain. >>>> >>>> I think I am becoming lost. Would a global configuration not break your >>>> view of "event-set applied to a single counter"? If a counter is configured >>>> globally then it would not make it possible to support the full configurability >>>> of the hardware. >>>> Before I add more confusion, let me try with an example that builds on your >>>> earlier example copied below: >>>> >>>>>>> (per domain) >>>>>>> group 0: >>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>>> group 1: >>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>>> ... >>>> >>>> Since the above states "per domain" I rewrite the example to highlight that as >>>> I understand it: >>>> >>>> group 0: >>>> domain 0: >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>> domain 1: >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>> group 1: >>>> domain 0: >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>> domain 1: >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>> >>>> You mention that you do not want counters to be allocated in domains that they >>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1 >>>> in domain 1, resulting in: >>>> >>>> group 0: >>>> domain 0: >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>> group 1: >>>> domain 0: >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>> domain 1: >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>> >>>> With counter 0 and counter 1 available in domain 1, these counters could >>>> theoretically be configured to give group 1 more data in domain 1: >>>> >>>> group 0: >>>> domain 0: >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>> group 1: >>>> domain 0: >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>> domain 1: >>>> counter 0: LclFill,RmtFill >>>> counter 1: LclNTWr,RmtNTWr >>>> counter 2: LclSlowFill,RmtSlowFill >>>> counter 3: VictimBW >>>> >>>> The counters are shown with different per-domain configurations that seems to >>>> match with earlier goals of (a) choose events counted by each counter and >>>> (b) do not allocate counters in domains where they are not needed. As I >>>> understand the above does contradict global counter configuration though. >>>> Or do you mean that only the *name* of the counter is global and then >>>> that it is reconfigured as part of every assignment? >>> >>> Yes, I meant only the *name* is global. I assume based on a particular >>> system configuration, the user will settle on a handful of useful >>> groupings to count. >>> >>> Perhaps mbm_assign_control syntax is the clearest way to express an example... >>> >>> # define global configurations (in ABMC terms), not necessarily in this >>> # syntax and probably not in the mbm_assign_control file. >>> >>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill >>> w=VictimBW,LclNTWr,RmtNTWr >>> >>> # legacy "total" configuration, effectively r+w >>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr >>> >>> /group0/0=t;1=t >>> /group1/0=t;1=t >>> /group2/0=_;1=t >>> /group3/0=rw;1=_ >>> >>> - group2 is restricted to domain 0 >>> - group3 is restricted to domain 1 >>> - the rest are unrestricted >>> - In group3, we decided we need to separate read and write traffic >>> >>> This consumes 4 counters in domain 0 and 3 counters in domain 1. >>> >> >> I see. Thank you for the example. >> >> resctrl supports per-domain configurations with the following possible when >> using mbm_total_bytes_config and mbm_local_bytes_config: >> >> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr >> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr >> >> /group0/0=t;1=t >> /group1/0=t;1=t >> >> Even though the flags are identical in all domains, the assigned counters will >> be configured differently in each domain. >> >> With this supported by hardware and currently also supported by resctrl it seems >> reasonable to carry this forward to what will be supported next. > > The hardware supports both a per-domain mode, where all groups in a > domain use the same configurations and are limited to two events per > group and a per-group mode where every group can be configured and > assigned freely. This series is using the legacy counter access mode > where only counters whose BwType matches an instance of QOS_EVT_CFG_n > in the domain can be read. If we chose to read the assigned counter > directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC) > rather than asking the hardware to find the counter by RMID, we would > not be limited to 2 counters per group/domain and the hardware would > have the same flexibility as on MPAM. In extended mode, the contents of a specific counter can be read by setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1, [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading QM_CTR will then return the contents of the specified counter. It is documented below. https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC) We previously discussed this with you (off the public list) and I initially proposed the extended assignment mode. Yes, the extended mode allows greater flexibility by enabling multiple counters to be assigned to the same group, rather than being limited to just two. However, the challenge is that we currently lack the necessary interfaces to configure multiple events per group. Without these interfaces, the extended mode is not practical at this time. Therefore, we ultimately agreed to use the legacy mode, as it does not require modifications to the existing interface, allowing us to continue using it as is. > > (I might have said something confusing in my last messages because I > had forgotten that I switched to the extended assignment mode when > prototyping with soft-ABMC and MPAM.) > > Forcing all groups on a domain to share the same 2 counter > configurations would not be acceptable for us, as the example I gave > earlier is one I've already been asked about. I don’t see this as a blocker. It should be considered an extension to the current ABMC series. We can easily build on top of this series once we finalize how to configure the multiple event interface for each group. > > I'm worried about requiring support for domain-level > mbm_total_bytes_config and mbm_local_bytes_config files to be carried > forward, because this conflicts with the configuration being per > group/domain. (i.e., what would be read back from the domain files if > per-group customizations have already been applied?) > >> >>>> >>>>>> Until now I viewed counter configuration separate from counter assignment, >>>>>> similar to how AMD's counters can be configured via mbm_total_bytes_config and >>>>>> mbm_local_bytes_config before they are assigned. That is still per-domain >>>>>> counter configuration though, not per-counter. >>>>>> >>>>>>> I assume packing all of this info for a group's desired counter >>>>>>> configuration into a single line (with 32 domains per line on many >>>>>>> dual-socket AMD configurations I see) would be difficult to look at, >>>>>>> even if we could settle on a single letter to represent each >>>>>>> universally. >>>>>>> >>>>>>>> >>>>>>>> My goal is for resctrl to have a user interface that can as much as possible >>>>>>>> be ready for whatever may be required from it years down the line. Of course, >>>>>>>> I may be wrong and resctrl would never need to support more than 26 events per >>>>>>>> resource (*). The risk is that resctrl *may* need to support more than 26 events >>>>>>>> and how could resctrl support that? >>>>>>>> >>>>>>>> What is the risk of supporting more than 26 events? As I highlighted earlier >>>>>>>> the interface I used as demonstration may become unwieldy to parse on a system >>>>>>>> with many domains that supports many events. This is a concern for me. Any suggestions >>>>>>>> will be appreciated, especially from you since I know that you are very familiar with >>>>>>>> issues related to large scale use of resctrl interfaces. >>>>>>> >>>>>>> It's mainly just the unwieldiness of all the information in one file. >>>>>>> It's already at the limit of what I can visually look through. >>>>>> >>>>>> I agree. >>>>>> >>>>>>> >>>>>>> I believe that shared assignments will take care of all the >>>>>>> high-frequency and performance-intensive batch configuration updates I >>>>>>> was originally concerned about, so I no longer see much benefit in >>>>>>> finding ways to textually encode all this information in a single file >>>>>>> when it would be more manageable to distribute it around the >>>>>>> filesystem hierarchy. >>>>>> >>>>>> This is significant. The motivation for the single file was to support >>>>>> the "high-frequency and performance-intensive" usage. Would "shared assignments" >>>>>> not also depend on the same files that, if distributed, will require many >>>>>> filesystem operations? >>>>>> Having the files distributed will be significantly simpler while also >>>>>> avoiding the file size issue that Dave Martin exposed. >>>>> >>>>> The remaining filesystem operations will be assigning or removing >>>>> shared counter assignments in the applicable domains, which would >>>>> normally correspond to mkdir/rmdir of groups or changing their CPU >>>>> affinity. The shared assignments are more "program and forget", while >>>>> the exclusive assignment approach requires updates for every counter >>>>> (in every domain) every few seconds to cover a large number of groups. >>>>> >>>>> When they want to pay extra attention to a particular group, I expect >>>>> they'll ask for exclusive counters and leave them assigned for a while >>>>> as they collect extra data. >>>> >>>> The single file approach is already unwieldy. The demands that will be >>>> placed on it to support the usages currently being discussed would make this >>>> interface even harder to use and manage. If the single file is not required >>>> then I think we should go back to smaller files distributed in resctrl. >>>> This may not even be an either/or argument. One way to view mbm_assign_control >>>> could be as a way for user to interact with the distributed counter >>>> related files with a single file system operation. Although, without >>>> knowing how counter configuration is expected to work this remains unclear. >>> >>> If we do both interfaces and the multi-file model gives us more >>> capability to express configurations, we could find situations where >>> there are configurations we cannot represent when reading back from >>> mbm_assign_control, or updates through mbm_assign_control have >>> ambiguous effects on existing configurations which were created with >>> other files. >> >> Right. My assumption was that the syntax would be identical. >> >>> >>> However, the example I gave above seems to be adequately represented >>> by a minor extension to mbm_assign_control and we all seem to >> >> To confirm what you mean with "minor extension to mbm_assign_control", >> is this where the flags are associated with counter configurations? At this >> time this is done separately from mbm_assign_control with the hardcoded "t" >> and "l" flags configured via mbm_total_bytes_config and mbm_local_bytes >> respectively. I think it would be simpler to keep these configurations >> separate from mbm_assign_control. How it would look without better >> understanding of MPAM is not clear to me at this time, unless if the >> requirement is to enhance support for ABMC and BMEC. I do see that >> this can be added later to build on what is supported by mbm_assign_control >> with the syntax in this version. > > As I explained above, I was looking at this from the perspective of > the extended event assignment mode. > > Thanks, > -Peter > -- Thanks Babu Moger
Hi Babu, On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote: > > Hi Peter, > > On 2/25/25 11:11, Peter Newman wrote: > > Hi Reinette, > > > > On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre > > <reinette.chatre@intel.com> wrote: > >> > >> Hi Peter, > >> > >> On 2/21/25 5:12 AM, Peter Newman wrote: > >>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre > >>> <reinette.chatre@intel.com> wrote: > >>>> On 2/20/25 6:53 AM, Peter Newman wrote: > >>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre > >>>>> <reinette.chatre@intel.com> wrote: > >>>>>> On 2/19/25 3:28 AM, Peter Newman wrote: > >>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre > >>>>>>> <reinette.chatre@intel.com> wrote: > >>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote: > >>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre > >>>>>>>>> <reinette.chatre@intel.com> wrote: > >>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote: > >>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: > >>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote: > >>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: > >>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: > >>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: > >>>>>>>>>> > >>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax) > >>>>>>>>>> > >>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters. > >>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. > >>>>>>>>>>>>>> Please help me understand if you see it differently. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events, > >>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> mbm_local_read_bytes a > >>>>>>>>>>>>>> mbm_local_write_bytes b > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Then mbm_assign_control can be used as: > >>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control > >>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes > >>>>>>>>>>>>>> <value> > >>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > >>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), > >>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined > >>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? > >>>>>>>>>> > >>>>>>>>>> As mentioned above, one possible issue with existing interface is that > >>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit > >>>>>>>>>> is low enough to be of concern. > >>>>>>>>> > >>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM > >>>>>>>>> so far are combinable, so 26 counters per group today means it limits > >>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained > >>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their > >>>>>>>>> investigation, I would question whether they know what they're looking > >>>>>>>>> for. > >>>>>>>> > >>>>>>>> The key here is "so far" as well as the focus on MBM only. > >>>>>>>> > >>>>>>>> It is impossible for me to predict what we will see in a couple of years > >>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface > >>>>>>>> to support their users. Just looking at the Intel RDT spec the event register > >>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into > >>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned > >>>>>>>> that he is working on patches [1] that will add new events and shared the idea > >>>>>>>> that we may be trending to support "perf" like events associated with RMID. I > >>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their > >>>>>>>> customers. > >>>>>>>> This all makes me think that resctrl should be ready to support more events than 26. > >>>>>>> > >>>>>>> I was thinking of the letters as representing a reusable, user-defined > >>>>>>> event-set for applying to a single counter rather than as individual > >>>>>>> events, since MPAM and ABMC allow us to choose the set of events each > >>>>>>> one counts. Wherever we define the letters, we could use more symbolic > >>>>>>> event names. > >>>>>> > >>>>>> Thank you for clarifying. > >>>>>> > >>>>>>> > >>>>>>> In the letters as events model, choosing the events assigned to a > >>>>>>> group wouldn't be enough information, since we would want to control > >>>>>>> which events should share a counter and which should be counted by > >>>>>>> separate counters. I think the amount of information that would need > >>>>>>> to be encoded into mbm_assign_control to represent the level of > >>>>>>> configurability supported by hardware would quickly get out of hand. > >>>>>>> > >>>>>>> Maybe as an example, one counter for all reads, one counter for all > >>>>>>> writes in ABMC would look like... > >>>>>>> > >>>>>>> (L3_QOS_ABMC_CFG.BwType field names below) > >>>>>>> > >>>>>>> (per domain) > >>>>>>> group 0: > >>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr > >>>>>>> group 1: > >>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>>>>>> ... > >>>>>>> > >>>>>> > >>>>>> I think this may also be what Dave was heading towards in [2] but in that > >>>>>> example and above the counter configuration appears to be global. You do mention > >>>>>> "configurability supported by hardware" so I wonder if per-domain counter > >>>>>> configuration is a requirement? > >>>>> > >>>>> If it's global and we want a particular group to be watched by more > >>>>> counters, I wouldn't want this to result in allocating more counters > >>>>> for that group in all domains, or allocating counters in domains where > >>>>> they're not needed. I want to encourage my users to avoid allocating > >>>>> monitoring resources in domains where a job is not allowed to run so > >>>>> there's less pressure on the counters. > >>>>> > >>>>> In Dave's proposal it looks like global configuration means > >>>>> globally-defined "named counter configurations", which works because > >>>>> it's really per-domain assignment of the configurations to however > >>>>> many counters the group needs in each domain. > >>>> > >>>> I think I am becoming lost. Would a global configuration not break your > >>>> view of "event-set applied to a single counter"? If a counter is configured > >>>> globally then it would not make it possible to support the full configurability > >>>> of the hardware. > >>>> Before I add more confusion, let me try with an example that builds on your > >>>> earlier example copied below: > >>>> > >>>>>>> (per domain) > >>>>>>> group 0: > >>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr > >>>>>>> group 1: > >>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>>>>>> ... > >>>> > >>>> Since the above states "per domain" I rewrite the example to highlight that as > >>>> I understand it: > >>>> > >>>> group 0: > >>>> domain 0: > >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>> counter 1: VictimBW,LclNTWr,RmtNTWr > >>>> domain 1: > >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>> counter 1: VictimBW,LclNTWr,RmtNTWr > >>>> group 1: > >>>> domain 0: > >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>>> domain 1: > >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>>> > >>>> You mention that you do not want counters to be allocated in domains that they > >>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1 > >>>> in domain 1, resulting in: > >>>> > >>>> group 0: > >>>> domain 0: > >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>> counter 1: VictimBW,LclNTWr,RmtNTWr > >>>> group 1: > >>>> domain 0: > >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>>> domain 1: > >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>>> > >>>> With counter 0 and counter 1 available in domain 1, these counters could > >>>> theoretically be configured to give group 1 more data in domain 1: > >>>> > >>>> group 0: > >>>> domain 0: > >>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>> counter 1: VictimBW,LclNTWr,RmtNTWr > >>>> group 1: > >>>> domain 0: > >>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>>> counter 3: VictimBW,LclNTWr,RmtNTWr > >>>> domain 1: > >>>> counter 0: LclFill,RmtFill > >>>> counter 1: LclNTWr,RmtNTWr > >>>> counter 2: LclSlowFill,RmtSlowFill > >>>> counter 3: VictimBW > >>>> > >>>> The counters are shown with different per-domain configurations that seems to > >>>> match with earlier goals of (a) choose events counted by each counter and > >>>> (b) do not allocate counters in domains where they are not needed. As I > >>>> understand the above does contradict global counter configuration though. > >>>> Or do you mean that only the *name* of the counter is global and then > >>>> that it is reconfigured as part of every assignment? > >>> > >>> Yes, I meant only the *name* is global. I assume based on a particular > >>> system configuration, the user will settle on a handful of useful > >>> groupings to count. > >>> > >>> Perhaps mbm_assign_control syntax is the clearest way to express an example... > >>> > >>> # define global configurations (in ABMC terms), not necessarily in this > >>> # syntax and probably not in the mbm_assign_control file. > >>> > >>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill > >>> w=VictimBW,LclNTWr,RmtNTWr > >>> > >>> # legacy "total" configuration, effectively r+w > >>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr > >>> > >>> /group0/0=t;1=t > >>> /group1/0=t;1=t > >>> /group2/0=_;1=t > >>> /group3/0=rw;1=_ > >>> > >>> - group2 is restricted to domain 0 > >>> - group3 is restricted to domain 1 > >>> - the rest are unrestricted > >>> - In group3, we decided we need to separate read and write traffic > >>> > >>> This consumes 4 counters in domain 0 and 3 counters in domain 1. > >>> > >> > >> I see. Thank you for the example. > >> > >> resctrl supports per-domain configurations with the following possible when > >> using mbm_total_bytes_config and mbm_local_bytes_config: > >> > >> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr > >> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr > >> > >> /group0/0=t;1=t > >> /group1/0=t;1=t > >> > >> Even though the flags are identical in all domains, the assigned counters will > >> be configured differently in each domain. > >> > >> With this supported by hardware and currently also supported by resctrl it seems > >> reasonable to carry this forward to what will be supported next. > > > > The hardware supports both a per-domain mode, where all groups in a > > domain use the same configurations and are limited to two events per > > group and a per-group mode where every group can be configured and > > assigned freely. This series is using the legacy counter access mode > > where only counters whose BwType matches an instance of QOS_EVT_CFG_n > > in the domain can be read. If we chose to read the assigned counter > > directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC) > > rather than asking the hardware to find the counter by RMID, we would > > not be limited to 2 counters per group/domain and the hardware would > > have the same flexibility as on MPAM. > > In extended mode, the contents of a specific counter can be read by > setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1, > [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading > QM_CTR will then return the contents of the specified counter. > > It is documented below. > https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf > Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC) > > We previously discussed this with you (off the public list) and I > initially proposed the extended assignment mode. > > Yes, the extended mode allows greater flexibility by enabling multiple > counters to be assigned to the same group, rather than being limited to > just two. > > However, the challenge is that we currently lack the necessary interfaces > to configure multiple events per group. Without these interfaces, the > extended mode is not practical at this time. > > Therefore, we ultimately agreed to use the legacy mode, as it does not > require modifications to the existing interface, allowing us to continue > using it as is. > > > > > (I might have said something confusing in my last messages because I > > had forgotten that I switched to the extended assignment mode when > > prototyping with soft-ABMC and MPAM.) > > > > Forcing all groups on a domain to share the same 2 counter > > configurations would not be acceptable for us, as the example I gave > > earlier is one I've already been asked about. > > I don’t see this as a blocker. It should be considered an extension to the > current ABMC series. We can easily build on top of this series once we > finalize how to configure the multiple event interface for each group. I don't think it is, either. Only being able to use ABMC to assign counters is fine for our use as an incremental step. My longer-term concern is the domain-scoped mbm_total_bytes_config and mbm_local_bytes_config files, but they were introduced with BMEC, so there's already an expectation that the files are present when BMEC is supported. On ABMC hardware that also supports BMEC, I'm concerned about enabling ABMC when only the BMEC-style event configuration interface exists. The scope of my issue is just whether enabling "full" ABMC support will require an additional opt-in, since that could remove the BMEC interface. If it does, it's something we can live with. -Peter
Hi Peter/Reinette, On 2/26/25 07:27, Peter Newman wrote: > Hi Babu, > > On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote: >> >> Hi Peter, >> >> On 2/25/25 11:11, Peter Newman wrote: >>> Hi Reinette, >>> >>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre >>> <reinette.chatre@intel.com> wrote: >>>> >>>> Hi Peter, >>>> >>>> On 2/21/25 5:12 AM, Peter Newman wrote: >>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre >>>>> <reinette.chatre@intel.com> wrote: >>>>>> On 2/20/25 6:53 AM, Peter Newman wrote: >>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre >>>>>>> <reinette.chatre@intel.com> wrote: >>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote: >>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre >>>>>>>>> <reinette.chatre@intel.com> wrote: >>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote: >>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre >>>>>>>>>>> <reinette.chatre@intel.com> wrote: >>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote: >>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: >>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote: >>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: >>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: >>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: >>>>>>>>>>>> >>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax) >>>>>>>>>>>> >>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters. >>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. >>>>>>>>>>>>>>>> Please help me understand if you see it differently. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events, >>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> mbm_local_read_bytes a >>>>>>>>>>>>>>>> mbm_local_write_bytes b >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Then mbm_assign_control can be used as: >>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control >>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes >>>>>>>>>>>>>>>> <value> >>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), >>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined >>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? >>>>>>>>>>>> >>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that >>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit >>>>>>>>>>>> is low enough to be of concern. >>>>>>>>>>> >>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM >>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits >>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained >>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their >>>>>>>>>>> investigation, I would question whether they know what they're looking >>>>>>>>>>> for. >>>>>>>>>> >>>>>>>>>> The key here is "so far" as well as the focus on MBM only. >>>>>>>>>> >>>>>>>>>> It is impossible for me to predict what we will see in a couple of years >>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface >>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register >>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into >>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned >>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea >>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I >>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their >>>>>>>>>> customers. >>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26. >>>>>>>>> >>>>>>>>> I was thinking of the letters as representing a reusable, user-defined >>>>>>>>> event-set for applying to a single counter rather than as individual >>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each >>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic >>>>>>>>> event names. >>>>>>>> >>>>>>>> Thank you for clarifying. >>>>>>>> >>>>>>>>> >>>>>>>>> In the letters as events model, choosing the events assigned to a >>>>>>>>> group wouldn't be enough information, since we would want to control >>>>>>>>> which events should share a counter and which should be counted by >>>>>>>>> separate counters. I think the amount of information that would need >>>>>>>>> to be encoded into mbm_assign_control to represent the level of >>>>>>>>> configurability supported by hardware would quickly get out of hand. >>>>>>>>> >>>>>>>>> Maybe as an example, one counter for all reads, one counter for all >>>>>>>>> writes in ABMC would look like... >>>>>>>>> >>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below) >>>>>>>>> >>>>>>>>> (per domain) >>>>>>>>> group 0: >>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>>>>> group 1: >>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>>>>> ... >>>>>>>>> >>>>>>>> >>>>>>>> I think this may also be what Dave was heading towards in [2] but in that >>>>>>>> example and above the counter configuration appears to be global. You do mention >>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter >>>>>>>> configuration is a requirement? >>>>>>> >>>>>>> If it's global and we want a particular group to be watched by more >>>>>>> counters, I wouldn't want this to result in allocating more counters >>>>>>> for that group in all domains, or allocating counters in domains where >>>>>>> they're not needed. I want to encourage my users to avoid allocating >>>>>>> monitoring resources in domains where a job is not allowed to run so >>>>>>> there's less pressure on the counters. >>>>>>> >>>>>>> In Dave's proposal it looks like global configuration means >>>>>>> globally-defined "named counter configurations", which works because >>>>>>> it's really per-domain assignment of the configurations to however >>>>>>> many counters the group needs in each domain. >>>>>> >>>>>> I think I am becoming lost. Would a global configuration not break your >>>>>> view of "event-set applied to a single counter"? If a counter is configured >>>>>> globally then it would not make it possible to support the full configurability >>>>>> of the hardware. >>>>>> Before I add more confusion, let me try with an example that builds on your >>>>>> earlier example copied below: >>>>>> >>>>>>>>> (per domain) >>>>>>>>> group 0: >>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>>>>> group 1: >>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>>>>> ... >>>>>> >>>>>> Since the above states "per domain" I rewrite the example to highlight that as >>>>>> I understand it: >>>>>> >>>>>> group 0: >>>>>> domain 0: >>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>> domain 1: >>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>> group 1: >>>>>> domain 0: >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>> domain 1: >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>> >>>>>> You mention that you do not want counters to be allocated in domains that they >>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1 >>>>>> in domain 1, resulting in: >>>>>> >>>>>> group 0: >>>>>> domain 0: >>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>> group 1: >>>>>> domain 0: >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>> domain 1: >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>> >>>>>> With counter 0 and counter 1 available in domain 1, these counters could >>>>>> theoretically be configured to give group 1 more data in domain 1: >>>>>> >>>>>> group 0: >>>>>> domain 0: >>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>> group 1: >>>>>> domain 0: >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>> domain 1: >>>>>> counter 0: LclFill,RmtFill >>>>>> counter 1: LclNTWr,RmtNTWr >>>>>> counter 2: LclSlowFill,RmtSlowFill >>>>>> counter 3: VictimBW >>>>>> >>>>>> The counters are shown with different per-domain configurations that seems to >>>>>> match with earlier goals of (a) choose events counted by each counter and >>>>>> (b) do not allocate counters in domains where they are not needed. As I >>>>>> understand the above does contradict global counter configuration though. >>>>>> Or do you mean that only the *name* of the counter is global and then >>>>>> that it is reconfigured as part of every assignment? >>>>> >>>>> Yes, I meant only the *name* is global. I assume based on a particular >>>>> system configuration, the user will settle on a handful of useful >>>>> groupings to count. >>>>> >>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example... >>>>> >>>>> # define global configurations (in ABMC terms), not necessarily in this >>>>> # syntax and probably not in the mbm_assign_control file. >>>>> >>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>> w=VictimBW,LclNTWr,RmtNTWr >>>>> >>>>> # legacy "total" configuration, effectively r+w >>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr >>>>> >>>>> /group0/0=t;1=t >>>>> /group1/0=t;1=t >>>>> /group2/0=_;1=t >>>>> /group3/0=rw;1=_ >>>>> >>>>> - group2 is restricted to domain 0 >>>>> - group3 is restricted to domain 1 >>>>> - the rest are unrestricted >>>>> - In group3, we decided we need to separate read and write traffic >>>>> >>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1. >>>>> >>>> >>>> I see. Thank you for the example. >>>> >>>> resctrl supports per-domain configurations with the following possible when >>>> using mbm_total_bytes_config and mbm_local_bytes_config: >>>> >>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr >>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr >>>> >>>> /group0/0=t;1=t >>>> /group1/0=t;1=t >>>> >>>> Even though the flags are identical in all domains, the assigned counters will >>>> be configured differently in each domain. >>>> >>>> With this supported by hardware and currently also supported by resctrl it seems >>>> reasonable to carry this forward to what will be supported next. >>> >>> The hardware supports both a per-domain mode, where all groups in a >>> domain use the same configurations and are limited to two events per >>> group and a per-group mode where every group can be configured and >>> assigned freely. This series is using the legacy counter access mode >>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n >>> in the domain can be read. If we chose to read the assigned counter >>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC) >>> rather than asking the hardware to find the counter by RMID, we would >>> not be limited to 2 counters per group/domain and the hardware would >>> have the same flexibility as on MPAM. >> >> In extended mode, the contents of a specific counter can be read by >> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1, >> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading >> QM_CTR will then return the contents of the specified counter. >> >> It is documented below. >> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf >> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC) >> >> We previously discussed this with you (off the public list) and I >> initially proposed the extended assignment mode. >> >> Yes, the extended mode allows greater flexibility by enabling multiple >> counters to be assigned to the same group, rather than being limited to >> just two. >> >> However, the challenge is that we currently lack the necessary interfaces >> to configure multiple events per group. Without these interfaces, the >> extended mode is not practical at this time. >> >> Therefore, we ultimately agreed to use the legacy mode, as it does not >> require modifications to the existing interface, allowing us to continue >> using it as is. >> >>> >>> (I might have said something confusing in my last messages because I >>> had forgotten that I switched to the extended assignment mode when >>> prototyping with soft-ABMC and MPAM.) >>> >>> Forcing all groups on a domain to share the same 2 counter >>> configurations would not be acceptable for us, as the example I gave >>> earlier is one I've already been asked about. >> >> I don’t see this as a blocker. It should be considered an extension to the >> current ABMC series. We can easily build on top of this series once we >> finalize how to configure the multiple event interface for each group. > > I don't think it is, either. Only being able to use ABMC to assign > counters is fine for our use as an incremental step. My longer-term > concern is the domain-scoped mbm_total_bytes_config and > mbm_local_bytes_config files, but they were introduced with BMEC, so > there's already an expectation that the files are present when BMEC is > supported. > > On ABMC hardware that also supports BMEC, I'm concerned about enabling > ABMC when only the BMEC-style event configuration interface exists. > The scope of my issue is just whether enabling "full" ABMC support > will require an additional opt-in, since that could remove the BMEC > interface. If it does, it's something we can live with. As you know, this series is currently blocked without further feedback. I’d like to begin reworking these patches to incorporate Peter’s feedback. Any input or suggestions would be appreciated. Here’s what we’ve learned so far: 1. Assignments should be independent of BMEC. 2. We should be able to specify multiple event types to a counter (e.g., read, write, victimBM, etc.). This is also called shared counter 3. There should be an option to assign events per domain. 4. Currently, only two counters can be assigned per group, but the design should allow flexibility to assign more in the future as the interface evolves. 5. Utilize the extended RMID read mode. Here is my proposal using Peter's earlier example: # define event configurations ======================================================== Bits Mnemonics Description ==== ======================================================== 6 VictimBW Dirty Victims from all types of memory 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain 4 LclSlowFill Reads to slow memory in the local NUMA domain 3 RmtNTWr Non-temporal writes to non-local NUMA domain 2 LclNTWr Non-temporal writes to local NUMA domain 1 mtFill Reads to memory in the non-local NUMA domain 0 LclFill Reads to memory in the local NUMA domain ==== ======================================================== #Define flags based on combination of above event types. t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr l = LclFill, LclNTWr, LclSlowFill r = LclFill,RmtFill,LclSlowFill,RmtSlowFill w = VictimBW,LclNTWr,RmtNTWr v = VictimBW Peter suggested the following format earlier : /group0/0=t;1=t /group1/0=t;1=t /group2/0=_;1=t /group3/0=rw;1=_ Interpretation: /group0/0=t;1=t : Assign a counter with event configuration 't' to domain 0 and 1 on the resctrl group0. This format does not indicate which index should be used for assignment. Based the index we can read the events from either mbm_total_bytes or mbm_local_bytes. Currently, we can assign two counters to a group and events can be read from mon_data/mon_L3_00/mbm_total_bytes (index 0) and mon_data/mon_L3_00/mbm_local_bytes (index 1). To address this, we need to include the index in some form. One approach is to incorporate this information into the group's name. Like below: /group0:0/0=t;1=t /group0:1/0=l;1=l /group1:0/0=t;1=t /group2:1/0=_;1=t /group3:0/0=rw;1=_ Interpretation: /group0:0/0=t;1=t : Assign a counter with event configuration 't' to domain 0 and 1 on the resctrl group0 and use the index 0. The events can be read in group0/mon_data/mon_L3_00/mbm_total_bytes and group0/mon_data/mon_L3_01/mbm_total_bytes /group0:1/0=l;1=l : Assign a counter with event configuration 'l' to domain 0 and 1 on the resctrl group0 and use the index 1. The events can be read in group0/mon_data/mon_L3_00/mbm_local_bytes and group0/mon_data/mon_L3_01/mbm_local_bytes What are your thoughts? -- Thanks Babu Moger
On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>
> Hi Peter/Reinette,
>
> On 2/26/25 07:27, Peter Newman wrote:
> > Hi Babu,
> >
> > On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> On 2/25/25 11:11, Peter Newman wrote:
> >>> Hi Reinette,
> >>>
> >>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
> >>> <reinette.chatre@intel.com> wrote:
> >>>>
> >>>> Hi Peter,
> >>>>
> >>>> On 2/21/25 5:12 AM, Peter Newman wrote:
> >>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
> >>>>> <reinette.chatre@intel.com> wrote:
> >>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
> >>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> >>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
> >>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> >>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
> >>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> >>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>>>>>>>>>>>> Please help me understand if you see it differently.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> mbm_local_read_bytes a
> >>>>>>>>>>>>>>>> mbm_local_write_bytes b
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
> >>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>>>>>>>>>>>> <value>
> >>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>>>>>>>>>>>
> >>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
> >>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
> >>>>>>>>>>>> is low enough to be of concern.
> >>>>>>>>>>>
> >>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
> >>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
> >>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
> >>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
> >>>>>>>>>>> investigation, I would question whether they know what they're looking
> >>>>>>>>>>> for.
> >>>>>>>>>>
> >>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
> >>>>>>>>>>
> >>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
> >>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> >>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
> >>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
> >>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> >>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
> >>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
> >>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> >>>>>>>>>> customers.
> >>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
> >>>>>>>>>
> >>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
> >>>>>>>>> event-set for applying to a single counter rather than as individual
> >>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
> >>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
> >>>>>>>>> event names.
> >>>>>>>>
> >>>>>>>> Thank you for clarifying.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> In the letters as events model, choosing the events assigned to a
> >>>>>>>>> group wouldn't be enough information, since we would want to control
> >>>>>>>>> which events should share a counter and which should be counted by
> >>>>>>>>> separate counters. I think the amount of information that would need
> >>>>>>>>> to be encoded into mbm_assign_control to represent the level of
> >>>>>>>>> configurability supported by hardware would quickly get out of hand.
> >>>>>>>>>
> >>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
> >>>>>>>>> writes in ABMC would look like...
> >>>>>>>>>
> >>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
> >>>>>>>>>
> >>>>>>>>> (per domain)
> >>>>>>>>> group 0:
> >>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>> group 1:
> >>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>> ...
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
> >>>>>>>> example and above the counter configuration appears to be global. You do mention
> >>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
> >>>>>>>> configuration is a requirement?
> >>>>>>>
> >>>>>>> If it's global and we want a particular group to be watched by more
> >>>>>>> counters, I wouldn't want this to result in allocating more counters
> >>>>>>> for that group in all domains, or allocating counters in domains where
> >>>>>>> they're not needed. I want to encourage my users to avoid allocating
> >>>>>>> monitoring resources in domains where a job is not allowed to run so
> >>>>>>> there's less pressure on the counters.
> >>>>>>>
> >>>>>>> In Dave's proposal it looks like global configuration means
> >>>>>>> globally-defined "named counter configurations", which works because
> >>>>>>> it's really per-domain assignment of the configurations to however
> >>>>>>> many counters the group needs in each domain.
> >>>>>>
> >>>>>> I think I am becoming lost. Would a global configuration not break your
> >>>>>> view of "event-set applied to a single counter"? If a counter is configured
> >>>>>> globally then it would not make it possible to support the full configurability
> >>>>>> of the hardware.
> >>>>>> Before I add more confusion, let me try with an example that builds on your
> >>>>>> earlier example copied below:
> >>>>>>
> >>>>>>>>> (per domain)
> >>>>>>>>> group 0:
> >>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>> group 1:
> >>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>> ...
> >>>>>>
> >>>>>> Since the above states "per domain" I rewrite the example to highlight that as
> >>>>>> I understand it:
> >>>>>>
> >>>>>> group 0:
> >>>>>> domain 0:
> >>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>> domain 1:
> >>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>> group 1:
> >>>>>> domain 0:
> >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>> domain 1:
> >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>
> >>>>>> You mention that you do not want counters to be allocated in domains that they
> >>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> >>>>>> in domain 1, resulting in:
> >>>>>>
> >>>>>> group 0:
> >>>>>> domain 0:
> >>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>> group 1:
> >>>>>> domain 0:
> >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>> domain 1:
> >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>
> >>>>>> With counter 0 and counter 1 available in domain 1, these counters could
> >>>>>> theoretically be configured to give group 1 more data in domain 1:
> >>>>>>
> >>>>>> group 0:
> >>>>>> domain 0:
> >>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>> group 1:
> >>>>>> domain 0:
> >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>> domain 1:
> >>>>>> counter 0: LclFill,RmtFill
> >>>>>> counter 1: LclNTWr,RmtNTWr
> >>>>>> counter 2: LclSlowFill,RmtSlowFill
> >>>>>> counter 3: VictimBW
> >>>>>>
> >>>>>> The counters are shown with different per-domain configurations that seems to
> >>>>>> match with earlier goals of (a) choose events counted by each counter and
> >>>>>> (b) do not allocate counters in domains where they are not needed. As I
> >>>>>> understand the above does contradict global counter configuration though.
> >>>>>> Or do you mean that only the *name* of the counter is global and then
> >>>>>> that it is reconfigured as part of every assignment?
> >>>>>
> >>>>> Yes, I meant only the *name* is global. I assume based on a particular
> >>>>> system configuration, the user will settle on a handful of useful
> >>>>> groupings to count.
> >>>>>
> >>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
> >>>>>
> >>>>> # define global configurations (in ABMC terms), not necessarily in this
> >>>>> # syntax and probably not in the mbm_assign_control file.
> >>>>>
> >>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>> w=VictimBW,LclNTWr,RmtNTWr
> >>>>>
> >>>>> # legacy "total" configuration, effectively r+w
> >>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>
> >>>>> /group0/0=t;1=t
> >>>>> /group1/0=t;1=t
> >>>>> /group2/0=_;1=t
> >>>>> /group3/0=rw;1=_
> >>>>>
> >>>>> - group2 is restricted to domain 0
> >>>>> - group3 is restricted to domain 1
> >>>>> - the rest are unrestricted
> >>>>> - In group3, we decided we need to separate read and write traffic
> >>>>>
> >>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
> >>>>>
> >>>>
> >>>> I see. Thank you for the example.
> >>>>
> >>>> resctrl supports per-domain configurations with the following possible when
> >>>> using mbm_total_bytes_config and mbm_local_bytes_config:
> >>>>
> >>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
> >>>>
> >>>> /group0/0=t;1=t
> >>>> /group1/0=t;1=t
> >>>>
> >>>> Even though the flags are identical in all domains, the assigned counters will
> >>>> be configured differently in each domain.
> >>>>
> >>>> With this supported by hardware and currently also supported by resctrl it seems
> >>>> reasonable to carry this forward to what will be supported next.
> >>>
> >>> The hardware supports both a per-domain mode, where all groups in a
> >>> domain use the same configurations and are limited to two events per
> >>> group and a per-group mode where every group can be configured and
> >>> assigned freely. This series is using the legacy counter access mode
> >>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
> >>> in the domain can be read. If we chose to read the assigned counter
> >>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
> >>> rather than asking the hardware to find the counter by RMID, we would
> >>> not be limited to 2 counters per group/domain and the hardware would
> >>> have the same flexibility as on MPAM.
> >>
> >> In extended mode, the contents of a specific counter can be read by
> >> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
> >> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
> >> QM_CTR will then return the contents of the specified counter.
> >>
> >> It is documented below.
> >> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
> >> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
> >>
> >> We previously discussed this with you (off the public list) and I
> >> initially proposed the extended assignment mode.
> >>
> >> Yes, the extended mode allows greater flexibility by enabling multiple
> >> counters to be assigned to the same group, rather than being limited to
> >> just two.
> >>
> >> However, the challenge is that we currently lack the necessary interfaces
> >> to configure multiple events per group. Without these interfaces, the
> >> extended mode is not practical at this time.
> >>
> >> Therefore, we ultimately agreed to use the legacy mode, as it does not
> >> require modifications to the existing interface, allowing us to continue
> >> using it as is.
> >>
> >>>
> >>> (I might have said something confusing in my last messages because I
> >>> had forgotten that I switched to the extended assignment mode when
> >>> prototyping with soft-ABMC and MPAM.)
> >>>
> >>> Forcing all groups on a domain to share the same 2 counter
> >>> configurations would not be acceptable for us, as the example I gave
> >>> earlier is one I've already been asked about.
> >>
> >> I don’t see this as a blocker. It should be considered an extension to the
> >> current ABMC series. We can easily build on top of this series once we
> >> finalize how to configure the multiple event interface for each group.
> >
> > I don't think it is, either. Only being able to use ABMC to assign
> > counters is fine for our use as an incremental step. My longer-term
> > concern is the domain-scoped mbm_total_bytes_config and
> > mbm_local_bytes_config files, but they were introduced with BMEC, so
> > there's already an expectation that the files are present when BMEC is
> > supported.
> >
> > On ABMC hardware that also supports BMEC, I'm concerned about enabling
> > ABMC when only the BMEC-style event configuration interface exists.
> > The scope of my issue is just whether enabling "full" ABMC support
> > will require an additional opt-in, since that could remove the BMEC
> > interface. If it does, it's something we can live with.
>
> As you know, this series is currently blocked without further feedback.
>
> I’d like to begin reworking these patches to incorporate Peter’s feedback.
> Any input or suggestions would be appreciated.
>
> Here’s what we’ve learned so far:
>
> 1. Assignments should be independent of BMEC.
> 2. We should be able to specify multiple event types to a counter (e.g.,
> read, write, victimBM, etc.). This is also called shared counter
> 3. There should be an option to assign events per domain.
> 4. Currently, only two counters can be assigned per group, but the design
> should allow flexibility to assign more in the future as the interface
> evolves.
> 5. Utilize the extended RMID read mode.
>
>
> Here is my proposal using Peter's earlier example:
>
> # define event configurations
>
> ========================================================
> Bits Mnemonics Description
> ==== ========================================================
> 6 VictimBW Dirty Victims from all types of memory
> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
> 4 LclSlowFill Reads to slow memory in the local NUMA domain
> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
> 2 LclNTWr Non-temporal writes to local NUMA domain
> 1 mtFill Reads to memory in the non-local NUMA domain
> 0 LclFill Reads to memory in the local NUMA domain
> ==== ========================================================
>
> #Define flags based on combination of above event types.
>
> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> l = LclFill, LclNTWr, LclSlowFill
> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
> w = VictimBW,LclNTWr,RmtNTWr
> v = VictimBW
>
> Peter suggested the following format earlier :
>
> /group0/0=t;1=t
> /group1/0=t;1=t
> /group2/0=_;1=t
> /group3/0=rw;1=_
After some inquiries within Google, it sounds like nobody has invested
much into the current mbm_assign_control format yet, so it would be
best to drop it and distribute the configuration around the filesystem
hierarchy[1], which should allow us to produce something more flexible
and cleaner to implement.
Roughly what I had in mind:
Use mkdir in a info/<resource>_MON subdirectory to create free-form
names for the assignable configurations rather than being restricted
to single letters. In the resulting directory, populate a file where
we can specify the set of events the config should represent. I think
we should use symbolic names for the events rather than raw BMEC field
values. Moving forward we could come up with portable names for common
events and only support the BMEC names on AMD machines for users who
want specific events and don't care about portability.
Next, put assignment-control file nodes in per-domain directories
(i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
counter-configuration name into the file would then allocate a counter
in the domain, apply the named configuration, and monitor the parent
group-directory. We can also put a group/resource-scoped assign_* file
higher in the hierarchy to make it easier for users who want to
configure all domains the same for a group.
The configuration names listed in assign_* would result in files of
the same name in the appropriate mon_data domain directories from
which the count values can be read.
# mkdir info/L3_MON/counter_configs/mbm_local_bytes
# echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
# echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
# echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
# cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
LclFill
LclNTWr
LclSlowFill
Note that we could also pre-populate info/L3_MON/counter_configs with
the expected configuration for mbm_local_bytes and mbm_total_bytes for
backwards compatibility.
To manually allocate counters for "mbm_local_bytes":
# mkdir test
# echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
# echo mbm_local_bytes > test/mon_data/mon_L3_01/assign_exclusive
# echo mbm_local_bytes > test/mon_data/mon_L3_02/assign_exclusive
[..]
Which would result in the creation of test/mon_data/mon_L3_*/mbm_local_bytes
For unassignment, we can just make an "unassign" node alongside
"assign_exclusive" and "assign_shared". These should provide enough
context to form resctrl_arch_config_cntr() calls.
-Peter
[1] https://lore.kernel.org/lkml/CALPaoCj1TH+GN6+dFnt5xuN406u=tB-8mj+UuMRSm5KWPJW2wg@mail.gmail.com/
Hi Peter,
On 3/4/25 10:44, Peter Newman wrote:
> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>
>> Hi Peter/Reinette,
>>
>> On 2/26/25 07:27, Peter Newman wrote:
>>> Hi Babu,
>>>
>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>> Hi Reinette,
>>>>>
>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>> for.
>>>>>>>>>>>>
>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>
>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>> customers.
>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>
>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>> event names.
>>>>>>>>>>
>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>
>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>
>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>
>>>>>>>>>>> (per domain)
>>>>>>>>>>> group 0:
>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> group 1:
>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>> configuration is a requirement?
>>>>>>>>>
>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>
>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>
>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>> of the hardware.
>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>> earlier example copied below:
>>>>>>>>
>>>>>>>>>>> (per domain)
>>>>>>>>>>> group 0:
>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> group 1:
>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> ...
>>>>>>>>
>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>> I understand it:
>>>>>>>>
>>>>>>>> group 0:
>>>>>>>> domain 0:
>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>> domain 1:
>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>> group 1:
>>>>>>>> domain 0:
>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>> domain 1:
>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>
>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>> in domain 1, resulting in:
>>>>>>>>
>>>>>>>> group 0:
>>>>>>>> domain 0:
>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>> group 1:
>>>>>>>> domain 0:
>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>> domain 1:
>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>
>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>
>>>>>>>> group 0:
>>>>>>>> domain 0:
>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>> group 1:
>>>>>>>> domain 0:
>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>> domain 1:
>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>> counter 3: VictimBW
>>>>>>>>
>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>
>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>> groupings to count.
>>>>>>>
>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>
>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>
>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>
>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>
>>>>>>> /group0/0=t;1=t
>>>>>>> /group1/0=t;1=t
>>>>>>> /group2/0=_;1=t
>>>>>>> /group3/0=rw;1=_
>>>>>>>
>>>>>>> - group2 is restricted to domain 0
>>>>>>> - group3 is restricted to domain 1
>>>>>>> - the rest are unrestricted
>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>
>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>
>>>>>>
>>>>>> I see. Thank you for the example.
>>>>>>
>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>
>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>
>>>>>> /group0/0=t;1=t
>>>>>> /group1/0=t;1=t
>>>>>>
>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>> be configured differently in each domain.
>>>>>>
>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>
>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>> domain use the same configurations and are limited to two events per
>>>>> group and a per-group mode where every group can be configured and
>>>>> assigned freely. This series is using the legacy counter access mode
>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>> have the same flexibility as on MPAM.
>>>>
>>>> In extended mode, the contents of a specific counter can be read by
>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>> QM_CTR will then return the contents of the specified counter.
>>>>
>>>> It is documented below.
>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>
>>>> We previously discussed this with you (off the public list) and I
>>>> initially proposed the extended assignment mode.
>>>>
>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>> counters to be assigned to the same group, rather than being limited to
>>>> just two.
>>>>
>>>> However, the challenge is that we currently lack the necessary interfaces
>>>> to configure multiple events per group. Without these interfaces, the
>>>> extended mode is not practical at this time.
>>>>
>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>> require modifications to the existing interface, allowing us to continue
>>>> using it as is.
>>>>
>>>>>
>>>>> (I might have said something confusing in my last messages because I
>>>>> had forgotten that I switched to the extended assignment mode when
>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>
>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>> configurations would not be acceptable for us, as the example I gave
>>>>> earlier is one I've already been asked about.
>>>>
>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>> current ABMC series. We can easily build on top of this series once we
>>>> finalize how to configure the multiple event interface for each group.
>>>
>>> I don't think it is, either. Only being able to use ABMC to assign
>>> counters is fine for our use as an incremental step. My longer-term
>>> concern is the domain-scoped mbm_total_bytes_config and
>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>> there's already an expectation that the files are present when BMEC is
>>> supported.
>>>
>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>> ABMC when only the BMEC-style event configuration interface exists.
>>> The scope of my issue is just whether enabling "full" ABMC support
>>> will require an additional opt-in, since that could remove the BMEC
>>> interface. If it does, it's something we can live with.
>>
>> As you know, this series is currently blocked without further feedback.
>>
>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>> Any input or suggestions would be appreciated.
>>
>> Here’s what we’ve learned so far:
>>
>> 1. Assignments should be independent of BMEC.
>> 2. We should be able to specify multiple event types to a counter (e.g.,
>> read, write, victimBM, etc.). This is also called shared counter
>> 3. There should be an option to assign events per domain.
>> 4. Currently, only two counters can be assigned per group, but the design
>> should allow flexibility to assign more in the future as the interface
>> evolves.
>> 5. Utilize the extended RMID read mode.
>>
>>
>> Here is my proposal using Peter's earlier example:
>>
>> # define event configurations
>>
>> ========================================================
>> Bits Mnemonics Description
>> ==== ========================================================
>> 6 VictimBW Dirty Victims from all types of memory
>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>> 2 LclNTWr Non-temporal writes to local NUMA domain
>> 1 mtFill Reads to memory in the non-local NUMA domain
>> 0 LclFill Reads to memory in the local NUMA domain
>> ==== ========================================================
>>
>> #Define flags based on combination of above event types.
>>
>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>> l = LclFill, LclNTWr, LclSlowFill
>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>> w = VictimBW,LclNTWr,RmtNTWr
>> v = VictimBW
>>
>> Peter suggested the following format earlier :
>>
>> /group0/0=t;1=t
>> /group1/0=t;1=t
>> /group2/0=_;1=t
>> /group3/0=rw;1=_
>
> After some inquiries within Google, it sounds like nobody has invested
> much into the current mbm_assign_control format yet, so it would be
> best to drop it and distribute the configuration around the filesystem
> hierarchy[1], which should allow us to produce something more flexible
> and cleaner to implement.
>
> Roughly what I had in mind:
>
> Use mkdir in a info/<resource>_MON subdirectory to create free-form
> names for the assignable configurations rather than being restricted
> to single letters. In the resulting directory, populate a file where
> we can specify the set of events the config should represent. I think
> we should use symbolic names for the events rather than raw BMEC field
> values. Moving forward we could come up with portable names for common
> events and only support the BMEC names on AMD machines for users who
> want specific events and don't care about portability.
I’m still processing this. Let me start with some initial questions.
So, we are creating event configurations here, which seems reasonable.
Yes, we should use portable names and are not limited to BMEC names.
How many configurations should we allow? Do we know?
>
> Next, put assignment-control file nodes in per-domain directories
> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
> counter-configuration name into the file would then allocate a counter
> in the domain, apply the named configuration, and monitor the parent
> group-directory. We can also put a group/resource-scoped assign_* file
> higher in the hierarchy to make it easier for users who want to
> configure all domains the same for a group.
What is the difference between shared and exclusive?
Having three files—assign_shared, assign_exclusive, and unassign—for each
domain seems excessive. In a system with 32 groups and 12 domains, this
results in 32 × 12 × 3 files, which is quite large.
There should be a more efficient way to handle this.
Initially, we started with a group-level file for this interface, but it
was rejected due to the high number of sysfs calls, making it inefficient.
Additionally, how can we list all assignments with a single sysfs call?
That was another problem we need to address.
>
> The configuration names listed in assign_* would result in files of
> the same name in the appropriate mon_data domain directories from
> which the count values can be read.
>
> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> LclFill
> LclNTWr
> LclSlowFill
I feel we can just have the configs. event_filter file is not required.
#cat info/L3_MON/counter_configs/mbm_local_bytes
LclFill <-rename these to generic names.
LclNTWr
LclSlowFill
>
> Note that we could also pre-populate info/L3_MON/counter_configs with
> the expected configuration for mbm_local_bytes and mbm_total_bytes for
> backwards compatibility.
>
> To manually allocate counters for "mbm_local_bytes":
>
> # mkdir test
> # echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
> # echo mbm_local_bytes > test/mon_data/mon_L3_01/assign_exclusive
> # echo mbm_local_bytes > test/mon_data/mon_L3_02/assign_exclusive
> [..]
>
> Which would result in the creation of test/mon_data/mon_L3_*/mbm_local_bytes
>
> For unassignment, we can just make an "unassign" node alongside
> "assign_exclusive" and "assign_shared". These should provide enough
> context to form resctrl_arch_config_cntr() calls.
>
> -Peter
>
> [1] https://lore.kernel.org/lkml/CALPaoCj1TH+GN6+dFnt5xuN406u=tB-8mj+UuMRSm5KWPJW2wg@mail.gmail.com/
>
Lets keep discussing.
--
Thanks
Babu Moger
Hi Babu,
On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>
> Hi Peter,
>
> On 3/4/25 10:44, Peter Newman wrote:
> > On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
> >>
> >> Hi Peter/Reinette,
> >>
> >> On 2/26/25 07:27, Peter Newman wrote:
> >>> Hi Babu,
> >>>
> >>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
> >>>>
> >>>> Hi Peter,
> >>>>
> >>>> On 2/25/25 11:11, Peter Newman wrote:
> >>>>> Hi Reinette,
> >>>>>
> >>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
> >>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>
> >>>>>> Hi Peter,
> >>>>>>
> >>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
> >>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
> >>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
> >>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> >>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
> >>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> >>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
> >>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> >>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
> >>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
> >>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>>>>>>>>>>>>>> <value>
> >>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
> >>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
> >>>>>>>>>>>>>> is low enough to be of concern.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
> >>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
> >>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
> >>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
> >>>>>>>>>>>>> investigation, I would question whether they know what they're looking
> >>>>>>>>>>>>> for.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
> >>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> >>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
> >>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
> >>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> >>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
> >>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
> >>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> >>>>>>>>>>>> customers.
> >>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
> >>>>>>>>>>>
> >>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
> >>>>>>>>>>> event-set for applying to a single counter rather than as individual
> >>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
> >>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
> >>>>>>>>>>> event names.
> >>>>>>>>>>
> >>>>>>>>>> Thank you for clarifying.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> In the letters as events model, choosing the events assigned to a
> >>>>>>>>>>> group wouldn't be enough information, since we would want to control
> >>>>>>>>>>> which events should share a counter and which should be counted by
> >>>>>>>>>>> separate counters. I think the amount of information that would need
> >>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
> >>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
> >>>>>>>>>>>
> >>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
> >>>>>>>>>>> writes in ABMC would look like...
> >>>>>>>>>>>
> >>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
> >>>>>>>>>>>
> >>>>>>>>>>> (per domain)
> >>>>>>>>>>> group 0:
> >>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>> group 1:
> >>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
> >>>>>>>>>> example and above the counter configuration appears to be global. You do mention
> >>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
> >>>>>>>>>> configuration is a requirement?
> >>>>>>>>>
> >>>>>>>>> If it's global and we want a particular group to be watched by more
> >>>>>>>>> counters, I wouldn't want this to result in allocating more counters
> >>>>>>>>> for that group in all domains, or allocating counters in domains where
> >>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
> >>>>>>>>> monitoring resources in domains where a job is not allowed to run so
> >>>>>>>>> there's less pressure on the counters.
> >>>>>>>>>
> >>>>>>>>> In Dave's proposal it looks like global configuration means
> >>>>>>>>> globally-defined "named counter configurations", which works because
> >>>>>>>>> it's really per-domain assignment of the configurations to however
> >>>>>>>>> many counters the group needs in each domain.
> >>>>>>>>
> >>>>>>>> I think I am becoming lost. Would a global configuration not break your
> >>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
> >>>>>>>> globally then it would not make it possible to support the full configurability
> >>>>>>>> of the hardware.
> >>>>>>>> Before I add more confusion, let me try with an example that builds on your
> >>>>>>>> earlier example copied below:
> >>>>>>>>
> >>>>>>>>>>> (per domain)
> >>>>>>>>>>> group 0:
> >>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>> group 1:
> >>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>> ...
> >>>>>>>>
> >>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
> >>>>>>>> I understand it:
> >>>>>>>>
> >>>>>>>> group 0:
> >>>>>>>> domain 0:
> >>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>> domain 1:
> >>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>> group 1:
> >>>>>>>> domain 0:
> >>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>> domain 1:
> >>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>
> >>>>>>>> You mention that you do not want counters to be allocated in domains that they
> >>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> >>>>>>>> in domain 1, resulting in:
> >>>>>>>>
> >>>>>>>> group 0:
> >>>>>>>> domain 0:
> >>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>> group 1:
> >>>>>>>> domain 0:
> >>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>> domain 1:
> >>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>
> >>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
> >>>>>>>> theoretically be configured to give group 1 more data in domain 1:
> >>>>>>>>
> >>>>>>>> group 0:
> >>>>>>>> domain 0:
> >>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>> group 1:
> >>>>>>>> domain 0:
> >>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>> domain 1:
> >>>>>>>> counter 0: LclFill,RmtFill
> >>>>>>>> counter 1: LclNTWr,RmtNTWr
> >>>>>>>> counter 2: LclSlowFill,RmtSlowFill
> >>>>>>>> counter 3: VictimBW
> >>>>>>>>
> >>>>>>>> The counters are shown with different per-domain configurations that seems to
> >>>>>>>> match with earlier goals of (a) choose events counted by each counter and
> >>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
> >>>>>>>> understand the above does contradict global counter configuration though.
> >>>>>>>> Or do you mean that only the *name* of the counter is global and then
> >>>>>>>> that it is reconfigured as part of every assignment?
> >>>>>>>
> >>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
> >>>>>>> system configuration, the user will settle on a handful of useful
> >>>>>>> groupings to count.
> >>>>>>>
> >>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
> >>>>>>>
> >>>>>>> # define global configurations (in ABMC terms), not necessarily in this
> >>>>>>> # syntax and probably not in the mbm_assign_control file.
> >>>>>>>
> >>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>> w=VictimBW,LclNTWr,RmtNTWr
> >>>>>>>
> >>>>>>> # legacy "total" configuration, effectively r+w
> >>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>>>
> >>>>>>> /group0/0=t;1=t
> >>>>>>> /group1/0=t;1=t
> >>>>>>> /group2/0=_;1=t
> >>>>>>> /group3/0=rw;1=_
> >>>>>>>
> >>>>>>> - group2 is restricted to domain 0
> >>>>>>> - group3 is restricted to domain 1
> >>>>>>> - the rest are unrestricted
> >>>>>>> - In group3, we decided we need to separate read and write traffic
> >>>>>>>
> >>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
> >>>>>>>
> >>>>>>
> >>>>>> I see. Thank you for the example.
> >>>>>>
> >>>>>> resctrl supports per-domain configurations with the following possible when
> >>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
> >>>>>>
> >>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>>
> >>>>>> /group0/0=t;1=t
> >>>>>> /group1/0=t;1=t
> >>>>>>
> >>>>>> Even though the flags are identical in all domains, the assigned counters will
> >>>>>> be configured differently in each domain.
> >>>>>>
> >>>>>> With this supported by hardware and currently also supported by resctrl it seems
> >>>>>> reasonable to carry this forward to what will be supported next.
> >>>>>
> >>>>> The hardware supports both a per-domain mode, where all groups in a
> >>>>> domain use the same configurations and are limited to two events per
> >>>>> group and a per-group mode where every group can be configured and
> >>>>> assigned freely. This series is using the legacy counter access mode
> >>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
> >>>>> in the domain can be read. If we chose to read the assigned counter
> >>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
> >>>>> rather than asking the hardware to find the counter by RMID, we would
> >>>>> not be limited to 2 counters per group/domain and the hardware would
> >>>>> have the same flexibility as on MPAM.
> >>>>
> >>>> In extended mode, the contents of a specific counter can be read by
> >>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
> >>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
> >>>> QM_CTR will then return the contents of the specified counter.
> >>>>
> >>>> It is documented below.
> >>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
> >>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
> >>>>
> >>>> We previously discussed this with you (off the public list) and I
> >>>> initially proposed the extended assignment mode.
> >>>>
> >>>> Yes, the extended mode allows greater flexibility by enabling multiple
> >>>> counters to be assigned to the same group, rather than being limited to
> >>>> just two.
> >>>>
> >>>> However, the challenge is that we currently lack the necessary interfaces
> >>>> to configure multiple events per group. Without these interfaces, the
> >>>> extended mode is not practical at this time.
> >>>>
> >>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
> >>>> require modifications to the existing interface, allowing us to continue
> >>>> using it as is.
> >>>>
> >>>>>
> >>>>> (I might have said something confusing in my last messages because I
> >>>>> had forgotten that I switched to the extended assignment mode when
> >>>>> prototyping with soft-ABMC and MPAM.)
> >>>>>
> >>>>> Forcing all groups on a domain to share the same 2 counter
> >>>>> configurations would not be acceptable for us, as the example I gave
> >>>>> earlier is one I've already been asked about.
> >>>>
> >>>> I don’t see this as a blocker. It should be considered an extension to the
> >>>> current ABMC series. We can easily build on top of this series once we
> >>>> finalize how to configure the multiple event interface for each group.
> >>>
> >>> I don't think it is, either. Only being able to use ABMC to assign
> >>> counters is fine for our use as an incremental step. My longer-term
> >>> concern is the domain-scoped mbm_total_bytes_config and
> >>> mbm_local_bytes_config files, but they were introduced with BMEC, so
> >>> there's already an expectation that the files are present when BMEC is
> >>> supported.
> >>>
> >>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
> >>> ABMC when only the BMEC-style event configuration interface exists.
> >>> The scope of my issue is just whether enabling "full" ABMC support
> >>> will require an additional opt-in, since that could remove the BMEC
> >>> interface. If it does, it's something we can live with.
> >>
> >> As you know, this series is currently blocked without further feedback.
> >>
> >> I’d like to begin reworking these patches to incorporate Peter’s feedback.
> >> Any input or suggestions would be appreciated.
> >>
> >> Here’s what we’ve learned so far:
> >>
> >> 1. Assignments should be independent of BMEC.
> >> 2. We should be able to specify multiple event types to a counter (e.g.,
> >> read, write, victimBM, etc.). This is also called shared counter
> >> 3. There should be an option to assign events per domain.
> >> 4. Currently, only two counters can be assigned per group, but the design
> >> should allow flexibility to assign more in the future as the interface
> >> evolves.
> >> 5. Utilize the extended RMID read mode.
> >>
> >>
> >> Here is my proposal using Peter's earlier example:
> >>
> >> # define event configurations
> >>
> >> ========================================================
> >> Bits Mnemonics Description
> >> ==== ========================================================
> >> 6 VictimBW Dirty Victims from all types of memory
> >> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
> >> 4 LclSlowFill Reads to slow memory in the local NUMA domain
> >> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
> >> 2 LclNTWr Non-temporal writes to local NUMA domain
> >> 1 mtFill Reads to memory in the non-local NUMA domain
> >> 0 LclFill Reads to memory in the local NUMA domain
> >> ==== ========================================================
> >>
> >> #Define flags based on combination of above event types.
> >>
> >> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >> l = LclFill, LclNTWr, LclSlowFill
> >> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >> w = VictimBW,LclNTWr,RmtNTWr
> >> v = VictimBW
> >>
> >> Peter suggested the following format earlier :
> >>
> >> /group0/0=t;1=t
> >> /group1/0=t;1=t
> >> /group2/0=_;1=t
> >> /group3/0=rw;1=_
> >
> > After some inquiries within Google, it sounds like nobody has invested
> > much into the current mbm_assign_control format yet, so it would be
> > best to drop it and distribute the configuration around the filesystem
> > hierarchy[1], which should allow us to produce something more flexible
> > and cleaner to implement.
> >
> > Roughly what I had in mind:
> >
> > Use mkdir in a info/<resource>_MON subdirectory to create free-form
> > names for the assignable configurations rather than being restricted
> > to single letters. In the resulting directory, populate a file where
> > we can specify the set of events the config should represent. I think
> > we should use symbolic names for the events rather than raw BMEC field
> > values. Moving forward we could come up with portable names for common
> > events and only support the BMEC names on AMD machines for users who
> > want specific events and don't care about portability.
>
>
> I’m still processing this. Let me start with some initial questions.
>
> So, we are creating event configurations here, which seems reasonable.
>
> Yes, we should use portable names and are not limited to BMEC names.
>
> How many configurations should we allow? Do we know?
Do we need an upper limit?
>
> >
> > Next, put assignment-control file nodes in per-domain directories
> > (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
> > counter-configuration name into the file would then allocate a counter
> > in the domain, apply the named configuration, and monitor the parent
> > group-directory. We can also put a group/resource-scoped assign_* file
> > higher in the hierarchy to make it easier for users who want to
> > configure all domains the same for a group.
>
> What is the difference between shared and exclusive?
Shared assignment[1] means that non-exclusively-assigned counters in
each domain will be scheduled round-robin to the groups requesting
shared access to a counter. In my tests, I assigned the counters long
enough to produce a single 1-second MB/s sample for the per-domain
aggregation files[2].
These do not need to be implemented immediately, but knowing that they
work addresses the overhead and scalability concerns of reassigning
counters and reading their values.
>
> Having three files—assign_shared, assign_exclusive, and unassign—for each
> domain seems excessive. In a system with 32 groups and 12 domains, this
> results in 32 × 12 × 3 files, which is quite large.
>
> There should be a more efficient way to handle this.
>
> Initially, we started with a group-level file for this interface, but it
> was rejected due to the high number of sysfs calls, making it inefficient.
I had rejected it due to the high-frequency of access of a large
number of files, which has since been addressed by shared assignment
(or automatic reassignment) and aggregated mbps files.
>
> Additionally, how can we list all assignments with a single sysfs call?
>
> That was another problem we need to address.
This is not a requirement I was aware of. If the user forgot where
they assigned counters (or forgot to disable auto-assignment), they
can read multiple sysfs nodes to remind themselves.
>
>
> >
> > The configuration names listed in assign_* would result in files of
> > the same name in the appropriate mon_data domain directories from
> > which the count values can be read.
> >
> > # mkdir info/L3_MON/counter_configs/mbm_local_bytes
> > # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > LclFill
> > LclNTWr
> > LclSlowFill
>
> I feel we can just have the configs. event_filter file is not required.
That's right, I forgot that we can implement kernfs_ops::open(). I was
only looking at struct kernfs_syscall_ops
>
> #cat info/L3_MON/counter_configs/mbm_local_bytes
> LclFill <-rename these to generic names.
> LclNTWr
> LclSlowFill
>
I think portable and non-portable event names should both be available
as options. There are simple bandwidth measurement mechanisms that
will be applied in general, but when they turn up an issue, it can
often lead to a more focused investigation, requiring more precise
events.
-Peter
Hi Peter,
On 3/5/25 04:40, Peter Newman wrote:
> Hi Babu,
>
> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>
>> Hi Peter,
>>
>> On 3/4/25 10:44, Peter Newman wrote:
>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>
>>>> Hi Peter/Reinette,
>>>>
>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>> Hi Reinette,
>>>>>>>
>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>> event names.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>
>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>
>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>
>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>
>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>
>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>> of the hardware.
>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>> earlier example copied below:
>>>>>>>>>>
>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>> I understand it:
>>>>>>>>>>
>>>>>>>>>> group 0:
>>>>>>>>>> domain 0:
>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> domain 1:
>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> group 1:
>>>>>>>>>> domain 0:
>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> domain 1:
>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>
>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>
>>>>>>>>>> group 0:
>>>>>>>>>> domain 0:
>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> group 1:
>>>>>>>>>> domain 0:
>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> domain 1:
>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>
>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>
>>>>>>>>>> group 0:
>>>>>>>>>> domain 0:
>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> group 1:
>>>>>>>>>> domain 0:
>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> domain 1:
>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>
>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>
>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>> groupings to count.
>>>>>>>>>
>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>
>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>
>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>
>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>
>>>>>>>>> /group0/0=t;1=t
>>>>>>>>> /group1/0=t;1=t
>>>>>>>>> /group2/0=_;1=t
>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>
>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>> - the rest are unrestricted
>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>
>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I see. Thank you for the example.
>>>>>>>>
>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>
>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>
>>>>>>>> /group0/0=t;1=t
>>>>>>>> /group1/0=t;1=t
>>>>>>>>
>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>> be configured differently in each domain.
>>>>>>>>
>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>
>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>> have the same flexibility as on MPAM.
>>>>>>
>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>
>>>>>> It is documented below.
>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>
>>>>>> We previously discussed this with you (off the public list) and I
>>>>>> initially proposed the extended assignment mode.
>>>>>>
>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>> just two.
>>>>>>
>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>> extended mode is not practical at this time.
>>>>>>
>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>> using it as is.
>>>>>>
>>>>>>>
>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>
>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>> earlier is one I've already been asked about.
>>>>>>
>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>
>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>> there's already an expectation that the files are present when BMEC is
>>>>> supported.
>>>>>
>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>> interface. If it does, it's something we can live with.
>>>>
>>>> As you know, this series is currently blocked without further feedback.
>>>>
>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>> Any input or suggestions would be appreciated.
>>>>
>>>> Here’s what we’ve learned so far:
>>>>
>>>> 1. Assignments should be independent of BMEC.
>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>> read, write, victimBM, etc.). This is also called shared counter
>>>> 3. There should be an option to assign events per domain.
>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>> should allow flexibility to assign more in the future as the interface
>>>> evolves.
>>>> 5. Utilize the extended RMID read mode.
>>>>
>>>>
>>>> Here is my proposal using Peter's earlier example:
>>>>
>>>> # define event configurations
>>>>
>>>> ========================================================
>>>> Bits Mnemonics Description
>>>> ==== ========================================================
>>>> 6 VictimBW Dirty Victims from all types of memory
>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>> ==== ========================================================
>>>>
>>>> #Define flags based on combination of above event types.
>>>>
>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>> l = LclFill, LclNTWr, LclSlowFill
>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>> v = VictimBW
>>>>
>>>> Peter suggested the following format earlier :
>>>>
>>>> /group0/0=t;1=t
>>>> /group1/0=t;1=t
>>>> /group2/0=_;1=t
>>>> /group3/0=rw;1=_
>>>
>>> After some inquiries within Google, it sounds like nobody has invested
>>> much into the current mbm_assign_control format yet, so it would be
>>> best to drop it and distribute the configuration around the filesystem
>>> hierarchy[1], which should allow us to produce something more flexible
>>> and cleaner to implement.
>>>
>>> Roughly what I had in mind:
>>>
>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>> names for the assignable configurations rather than being restricted
>>> to single letters. In the resulting directory, populate a file where
>>> we can specify the set of events the config should represent. I think
>>> we should use symbolic names for the events rather than raw BMEC field
>>> values. Moving forward we could come up with portable names for common
>>> events and only support the BMEC names on AMD machines for users who
>>> want specific events and don't care about portability.
>>
>>
>> I’m still processing this. Let me start with some initial questions.
>>
>> So, we are creating event configurations here, which seems reasonable.
>>
>> Yes, we should use portable names and are not limited to BMEC names.
>>
>> How many configurations should we allow? Do we know?
>
> Do we need an upper limit?
I think so. This needs to be maintained in some data structure. We can
start with 2 default configurations for now.
>
>>
>>>
>>> Next, put assignment-control file nodes in per-domain directories
>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>> counter-configuration name into the file would then allocate a counter
>>> in the domain, apply the named configuration, and monitor the parent
>>> group-directory. We can also put a group/resource-scoped assign_* file
>>> higher in the hierarchy to make it easier for users who want to
>>> configure all domains the same for a group.
>>
>> What is the difference between shared and exclusive?
>
> Shared assignment[1] means that non-exclusively-assigned counters in
> each domain will be scheduled round-robin to the groups requesting
> shared access to a counter. In my tests, I assigned the counters long
> enough to produce a single 1-second MB/s sample for the per-domain
> aggregation files[2].
>
> These do not need to be implemented immediately, but knowing that they
> work addresses the overhead and scalability concerns of reassigning
> counters and reading their values.
Ok. Lets focus on exclusive assignments for now.
>
>>
>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>> domain seems excessive. In a system with 32 groups and 12 domains, this
>> results in 32 × 12 × 3 files, which is quite large.
>>
>> There should be a more efficient way to handle this.
>>
>> Initially, we started with a group-level file for this interface, but it
>> was rejected due to the high number of sysfs calls, making it inefficient.
>
> I had rejected it due to the high-frequency of access of a large
> number of files, which has since been addressed by shared assignment
> (or automatic reassignment) and aggregated mbps files.
I think we should address this as well. Creating three extra files for
each group isn’t ideal when there are more efficient alternatives.
>
>>
>> Additionally, how can we list all assignments with a single sysfs call?
>>
>> That was another problem we need to address.
>
> This is not a requirement I was aware of. If the user forgot where
> they assigned counters (or forgot to disable auto-assignment), they
> can read multiple sysfs nodes to remind themselves.
I suggest, we should provide users with an option to list the assignments
of all groups in a single command. As the number of groups increases, it
becomes cumbersome to query each group individually.
To achieve this, we can reuse our existing mbm_assign_control interface
for this purpose. More details on this below.
>>
>>
>>>
>>> The configuration names listed in assign_* would result in files of
>>> the same name in the appropriate mon_data domain directories from
>>> which the count values can be read.
>>>
>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> LclFill
>>> LclNTWr
>>> LclSlowFill
>>
>> I feel we can just have the configs. event_filter file is not required.
>
> That's right, I forgot that we can implement kernfs_ops::open(). I was
> only looking at struct kernfs_syscall_ops
>
>>
>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>> LclFill <-rename these to generic names.
>> LclNTWr
>> LclSlowFill
>>
>
> I think portable and non-portable event names should both be available
> as options. There are simple bandwidth measurement mechanisms that
> will be applied in general, but when they turn up an issue, it can
> often lead to a more focused investigation, requiring more precise
> events.
I aggree. We should provide both portable and non-portable event names.
Here is my draft proposal based on the discussion so far and reusing some
of the current interface. Idea here is to start with basic assigment
feature with options to enhance it in the future. Feel free to
comment/suggest.
1. Event configurations will be in
/sys/fs/resctrl/info/L3_MON/counter_configs/.
There will be two pre-defined configurations by default.
#cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
#cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
LclFill, LclNTWr, LclSlowFill
2. Users will have options to update these configurations.
#echo "LclFill, LclNTWr, RmtFill" >
/sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
# #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
LclFill, LclNTWr, RmtFill
3. The default configurations will be used when user mounts the resctrl.
mount -t resctrl resctrl /sys/fs/resctrl/
mkdir /sys/fs/resctrl/test/
4. The resctrl group/domains can be in one of these assingnment states.
e: Exclusive
s: Shared
u: Unassigned
Exclusive mode is supported now. Shared mode will be supported in the
future.
5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
to list the assignment state of all the groups.
Format:
"<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
test//mbm_total_bytes:0=e;1=e
test//mbm_local_bytes:0=e;1=e
//mbm_total_bytes:0=e;1=e
//mbm_local_bytes:0=e;1=e
6. Users can modify the assignment state by writing to mbm_assign_control.
Format:
“<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
#echo "test//mbm_local_bytes:0=e;1=e" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control
#echo "test//mbm_local_bytes:0=u;1=u" >
/sys/fs/resctrl/info/L3_MON/mbm_assign_control
# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
test//mbm_total_bytes:0=u;1=u
test//mbm_local_bytes:0=u;1=u
//mbm_total_bytes:0=e;1=e
//mbm_local_bytes:0=e;1=e
The corresponding events will be read in
/sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
/sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
/sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
/sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
/sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
/sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
/sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
/sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
7. In the first stage, only two configurations(mbm_total_bytes and
mbm_local_bytes) will be supported.
8. In the future, there will be options to create multiple configurations
and corresponding directory will be created in
/sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
--
Thanks
Babu Moger
Hi All,
On 3/5/2025 1:34 PM, Moger, Babu wrote:
> Hi Peter,
>
> On 3/5/25 04:40, Peter Newman wrote:
>> Hi Babu,
>>
>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>
>>> Hi Peter,
>>>
>>> On 3/4/25 10:44, Peter Newman wrote:
>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>
>>>>> Hi Peter/Reinette,
>>>>>
>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>> Hi Babu,
>>>>>>
>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>> Hi Reinette,
>>>>>>>>
>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>
>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>
>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>
>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>> of the hardware.
>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>
>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>> I understand it:
>>>>>>>>>>>
>>>>>>>>>>> group 0:
>>>>>>>>>>> domain 0:
>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> domain 1:
>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> group 1:
>>>>>>>>>>> domain 0:
>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> domain 1:
>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>
>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>
>>>>>>>>>>> group 0:
>>>>>>>>>>> domain 0:
>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> group 1:
>>>>>>>>>>> domain 0:
>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> domain 1:
>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>
>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>
>>>>>>>>>>> group 0:
>>>>>>>>>>> domain 0:
>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> group 1:
>>>>>>>>>>> domain 0:
>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> domain 1:
>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>
>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>
>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>> groupings to count.
>>>>>>>>>>
>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>
>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>
>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>
>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>
>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>
>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>
>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>
>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>
>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>
>>>>>>>>> /group0/0=t;1=t
>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>
>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>> be configured differently in each domain.
>>>>>>>>>
>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>
>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>
>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>
>>>>>>> It is documented below.
>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>
>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>> initially proposed the extended assignment mode.
>>>>>>>
>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>> just two.
>>>>>>>
>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>> extended mode is not practical at this time.
>>>>>>>
>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>> using it as is.
>>>>>>>
>>>>>>>>
>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>
>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>> earlier is one I've already been asked about.
>>>>>>>
>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>
>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>> supported.
>>>>>>
>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>> interface. If it does, it's something we can live with.
>>>>>
>>>>> As you know, this series is currently blocked without further feedback.
>>>>>
>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>> Any input or suggestions would be appreciated.
>>>>>
>>>>> Here’s what we’ve learned so far:
>>>>>
>>>>> 1. Assignments should be independent of BMEC.
>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>> 3. There should be an option to assign events per domain.
>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>> should allow flexibility to assign more in the future as the interface
>>>>> evolves.
>>>>> 5. Utilize the extended RMID read mode.
>>>>>
>>>>>
>>>>> Here is my proposal using Peter's earlier example:
>>>>>
>>>>> # define event configurations
>>>>>
>>>>> ========================================================
>>>>> Bits Mnemonics Description
>>>>> ==== ========================================================
>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>> ==== ========================================================
>>>>>
>>>>> #Define flags based on combination of above event types.
>>>>>
>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>> v = VictimBW
>>>>>
>>>>> Peter suggested the following format earlier :
>>>>>
>>>>> /group0/0=t;1=t
>>>>> /group1/0=t;1=t
>>>>> /group2/0=_;1=t
>>>>> /group3/0=rw;1=_
>>>>
>>>> After some inquiries within Google, it sounds like nobody has invested
>>>> much into the current mbm_assign_control format yet, so it would be
>>>> best to drop it and distribute the configuration around the filesystem
>>>> hierarchy[1], which should allow us to produce something more flexible
>>>> and cleaner to implement.
>>>>
>>>> Roughly what I had in mind:
>>>>
>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>> names for the assignable configurations rather than being restricted
>>>> to single letters. In the resulting directory, populate a file where
>>>> we can specify the set of events the config should represent. I think
>>>> we should use symbolic names for the events rather than raw BMEC field
>>>> values. Moving forward we could come up with portable names for common
>>>> events and only support the BMEC names on AMD machines for users who
>>>> want specific events and don't care about portability.
>>>
>>>
>>> I’m still processing this. Let me start with some initial questions.
>>>
>>> So, we are creating event configurations here, which seems reasonable.
>>>
>>> Yes, we should use portable names and are not limited to BMEC names.
>>>
>>> How many configurations should we allow? Do we know?
>>
>> Do we need an upper limit?
>
> I think so. This needs to be maintained in some data structure. We can
> start with 2 default configurations for now.
>
>>
>>>
>>>>
>>>> Next, put assignment-control file nodes in per-domain directories
>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>> counter-configuration name into the file would then allocate a counter
>>>> in the domain, apply the named configuration, and monitor the parent
>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>> higher in the hierarchy to make it easier for users who want to
>>>> configure all domains the same for a group.
>>>
>>> What is the difference between shared and exclusive?
>>
>> Shared assignment[1] means that non-exclusively-assigned counters in
>> each domain will be scheduled round-robin to the groups requesting
>> shared access to a counter. In my tests, I assigned the counters long
>> enough to produce a single 1-second MB/s sample for the per-domain
>> aggregation files[2].
>>
>> These do not need to be implemented immediately, but knowing that they
>> work addresses the overhead and scalability concerns of reassigning
>> counters and reading their values.
>
> Ok. Lets focus on exclusive assignments for now.
>
>>
>>>
>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>> results in 32 × 12 × 3 files, which is quite large.
>>>
>>> There should be a more efficient way to handle this.
>>>
>>> Initially, we started with a group-level file for this interface, but it
>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>
>> I had rejected it due to the high-frequency of access of a large
>> number of files, which has since been addressed by shared assignment
>> (or automatic reassignment) and aggregated mbps files.
>
> I think we should address this as well. Creating three extra files for
> each group isn’t ideal when there are more efficient alternatives.
>
>>
>>>
>>> Additionally, how can we list all assignments with a single sysfs call?
>>>
>>> That was another problem we need to address.
>>
>> This is not a requirement I was aware of. If the user forgot where
>> they assigned counters (or forgot to disable auto-assignment), they
>> can read multiple sysfs nodes to remind themselves.
>
> I suggest, we should provide users with an option to list the assignments
> of all groups in a single command. As the number of groups increases, it
> becomes cumbersome to query each group individually.
>
> To achieve this, we can reuse our existing mbm_assign_control interface
> for this purpose. More details on this below.
>
>>>
>>>
>>>>
>>>> The configuration names listed in assign_* would result in files of
>>>> the same name in the appropriate mon_data domain directories from
>>>> which the count values can be read.
>>>>
>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>> LclFill
>>>> LclNTWr
>>>> LclSlowFill
>>>
>>> I feel we can just have the configs. event_filter file is not required.
>>
>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>> only looking at struct kernfs_syscall_ops
>>
>>>
>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>> LclFill <-rename these to generic names.
>>> LclNTWr
>>> LclSlowFill
>>>
>>
>> I think portable and non-portable event names should both be available
>> as options. There are simple bandwidth measurement mechanisms that
>> will be applied in general, but when they turn up an issue, it can
>> often lead to a more focused investigation, requiring more precise
>> events.
>
> I aggree. We should provide both portable and non-portable event names.
>
> Here is my draft proposal based on the discussion so far and reusing some
> of the current interface. Idea here is to start with basic assigment
> feature with options to enhance it in the future. Feel free to
> comment/suggest.
>
> 1. Event configurations will be in
> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>
> There will be two pre-defined configurations by default.
>
> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>
> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> LclFill, LclNTWr, LclSlowFill
>
> 2. Users will have options to update these configurations.
>
> #echo "LclFill, LclNTWr, RmtFill" >
> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>
> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> LclFill, LclNTWr, RmtFill
>
> 3. The default configurations will be used when user mounts the resctrl.
>
> mount -t resctrl resctrl /sys/fs/resctrl/
> mkdir /sys/fs/resctrl/test/
>
> 4. The resctrl group/domains can be in one of these assingnment states.
> e: Exclusive
> s: Shared
> u: Unassigned
>
> Exclusive mode is supported now. Shared mode will be supported in the
> future.
>
> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> to list the assignment state of all the groups.
>
> Format:
> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> test//mbm_total_bytes:0=e;1=e
> test//mbm_local_bytes:0=e;1=e
> //mbm_total_bytes:0=e;1=e
> //mbm_local_bytes:0=e;1=e
>
> 6. Users can modify the assignment state by writing to mbm_assign_control.
>
> Format:
> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>
> #echo "test//mbm_local_bytes:0=e;1=e" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> #echo "test//mbm_local_bytes:0=u;1=u" >
> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> test//mbm_total_bytes:0=u;1=u
> test//mbm_local_bytes:0=u;1=u
> //mbm_total_bytes:0=e;1=e
> //mbm_local_bytes:0=e;1=e
>
> The corresponding events will be read in
>
> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>
> 7. In the first stage, only two configurations(mbm_total_bytes and
> mbm_local_bytes) will be supported.
>
> 8. In the future, there will be options to create multiple configurations
> and corresponding directory will be created in
> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>
I know you are all busy with multiple series going on parallel. I am
still waiting for the inputs on this. It will be great if you can spend
some time on this to see if we can find common ground on the interface.
Thanks
Babu
On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
> Hi All,
>
> On 3/5/2025 1:34 PM, Moger, Babu wrote:
> > Hi Peter,
> >
> > On 3/5/25 04:40, Peter Newman wrote:
> > > Hi Babu,
> > >
> > > On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
> > > >
> > > > Hi Peter,
> > > >
> > > > On 3/4/25 10:44, Peter Newman wrote:
> > > > > On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
> > > > > >
> > > > > > Hi Peter/Reinette,
> > > > > >
> > > > > > On 2/26/25 07:27, Peter Newman wrote:
> > > > > > > Hi Babu,
> > > > > > >
> > > > > > > On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
> > > > > > > >
> > > > > > > > Hi Peter,
> > > > > > > >
> > > > > > > > On 2/25/25 11:11, Peter Newman wrote:
> > > > > > > > > Hi Reinette,
> > > > > > > > >
> > > > > > > > > On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
> > > > > > > > > <reinette.chatre@intel.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Peter,
> > > > > > > > > >
> > > > > > > > > > On 2/21/25 5:12 AM, Peter Newman wrote:
> > > > > > > > > > > On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
> > > > > > > > > > > <reinette.chatre@intel.com> wrote:
> > > > > > > > > > > > On 2/20/25 6:53 AM, Peter Newman wrote:
> > > > > > > > > > > > > On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> > > > > > > > > > > > > <reinette.chatre@intel.com> wrote:
> > > > > > > > > > > > > > On 2/19/25 3:28 AM, Peter Newman wrote:
> > > > > > > > > > > > > > > On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> > > > > > > > > > > > > > > <reinette.chatre@intel.com> wrote:
> > > > > > > > > > > > > > > > On 2/17/25 2:26 AM, Peter Newman wrote:
> > > > > > > > > > > > > > > > > On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> > > > > > > > > > > > > > > > > <reinette.chatre@intel.com> wrote:
> > > > > > > > > > > > > > > > > > On 2/14/25 10:31 AM, Moger, Babu wrote:
> > > > > > > > > > > > > > > > > > > On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> > > > > > > > > > > > > > > > > > > > On 2/13/25 9:37 AM, Dave Martin wrote:
> > > > > > > > > > > > > > > > > > > > > On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> > > > > > > > > > > > > > > > > > > > > > On 2/12/25 9:46 AM, Dave Martin wrote:
> > > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > (quoting relevant parts with goal to focus discussion on new possible syntax)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > I see the support for MPAM events distinct from the support of assignable counters.
> > > > > > > > > > > > > > > > > > > > > > Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> > > > > > > > > > > > > > > > > > > > > > Please help me understand if you see it differently.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Doing so would need to come up with alphabetical letters for these events,
> > > > > > > > > > > > > > > > > > > > > > which seems to be needed for your proposal also? If we use possible flags of:
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > mbm_local_read_bytes a
> > > > > > > > > > > > > > > > > > > > > > mbm_local_write_bytes b
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Then mbm_assign_control can be used as:
> > > > > > > > > > > > > > > > > > > > > > # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> > > > > > > > > > > > > > > > > > > > > > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> > > > > > > > > > > > > > > > > > > > > > <value>
> > > > > > > > > > > > > > > > > > > > > > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> > > > > > > > > > > > > > > > > > > > > > <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > One issue would be when resctrl needs to support more than 26 events (no more flags available),
> > > > > > > > > > > > > > > > > > > > > > assuming that upper case would be used for "shared" counters (unless this interface is defined
> > > > > > > > > > > > > > > > > > > > > > differently and only few uppercase letters used for it). Would this be too low of a limit?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > As mentioned above, one possible issue with existing interface is that
> > > > > > > > > > > > > > > > > > it is limited to 26 events (assuming only lower case letters are used). The limit
> > > > > > > > > > > > > > > > > > is low enough to be of concern.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The events which can be monitored by a single counter on ABMC and MPAM
> > > > > > > > > > > > > > > > > so far are combinable, so 26 counters per group today means it limits
> > > > > > > > > > > > > > > > > breaking down MBM traffic for each group 26 ways. If a user complained
> > > > > > > > > > > > > > > > > that a 26-way breakdown of a group's MBM traffic was limiting their
> > > > > > > > > > > > > > > > > investigation, I would question whether they know what they're looking
> > > > > > > > > > > > > > > > > for.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The key here is "so far" as well as the focus on MBM only.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It is impossible for me to predict what we will see in a couple of years
> > > > > > > > > > > > > > > > from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> > > > > > > > > > > > > > > > to support their users. Just looking at the Intel RDT spec the event register
> > > > > > > > > > > > > > > > has space for 32 events for each "CPU agent" resource. That does not take into
> > > > > > > > > > > > > > > > account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> > > > > > > > > > > > > > > > that he is working on patches [1] that will add new events and shared the idea
> > > > > > > > > > > > > > > > that we may be trending to support "perf" like events associated with RMID. I
> > > > > > > > > > > > > > > > expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> > > > > > > > > > > > > > > > customers.
> > > > > > > > > > > > > > > > This all makes me think that resctrl should be ready to support more events than 26.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I was thinking of the letters as representing a reusable, user-defined
> > > > > > > > > > > > > > > event-set for applying to a single counter rather than as individual
> > > > > > > > > > > > > > > events, since MPAM and ABMC allow us to choose the set of events each
> > > > > > > > > > > > > > > one counts. Wherever we define the letters, we could use more symbolic
> > > > > > > > > > > > > > > event names.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you for clarifying.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In the letters as events model, choosing the events assigned to a
> > > > > > > > > > > > > > > group wouldn't be enough information, since we would want to control
> > > > > > > > > > > > > > > which events should share a counter and which should be counted by
> > > > > > > > > > > > > > > separate counters. I think the amount of information that would need
> > > > > > > > > > > > > > > to be encoded into mbm_assign_control to represent the level of
> > > > > > > > > > > > > > > configurability supported by hardware would quickly get out of hand.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Maybe as an example, one counter for all reads, one counter for all
> > > > > > > > > > > > > > > writes in ABMC would look like...
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > (L3_QOS_ABMC_CFG.BwType field names below)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > (per domain)
> > > > > > > > > > > > > > > group 0:
> > > > > > > > > > > > > > > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > > > > counter 1: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > > > > group 1:
> > > > > > > > > > > > > > > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > > > > counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think this may also be what Dave was heading towards in [2] but in that
> > > > > > > > > > > > > > example and above the counter configuration appears to be global. You do mention
> > > > > > > > > > > > > > "configurability supported by hardware" so I wonder if per-domain counter
> > > > > > > > > > > > > > configuration is a requirement?
> > > > > > > > > > > > >
> > > > > > > > > > > > > If it's global and we want a particular group to be watched by more
> > > > > > > > > > > > > counters, I wouldn't want this to result in allocating more counters
> > > > > > > > > > > > > for that group in all domains, or allocating counters in domains where
> > > > > > > > > > > > > they're not needed. I want to encourage my users to avoid allocating
> > > > > > > > > > > > > monitoring resources in domains where a job is not allowed to run so
> > > > > > > > > > > > > there's less pressure on the counters.
> > > > > > > > > > > > >
> > > > > > > > > > > > > In Dave's proposal it looks like global configuration means
> > > > > > > > > > > > > globally-defined "named counter configurations", which works because
> > > > > > > > > > > > > it's really per-domain assignment of the configurations to however
> > > > > > > > > > > > > many counters the group needs in each domain.
> > > > > > > > > > > >
> > > > > > > > > > > > I think I am becoming lost. Would a global configuration not break your
> > > > > > > > > > > > view of "event-set applied to a single counter"? If a counter is configured
> > > > > > > > > > > > globally then it would not make it possible to support the full configurability
> > > > > > > > > > > > of the hardware.
> > > > > > > > > > > > Before I add more confusion, let me try with an example that builds on your
> > > > > > > > > > > > earlier example copied below:
> > > > > > > > > > > >
> > > > > > > > > > > > > > > (per domain)
> > > > > > > > > > > > > > > group 0:
> > > > > > > > > > > > > > > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > > > > counter 1: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > > > > group 1:
> > > > > > > > > > > > > > > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > > > > counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > > > > ...
> > > > > > > > > > > >
> > > > > > > > > > > > Since the above states "per domain" I rewrite the example to highlight that as
> > > > > > > > > > > > I understand it:
> > > > > > > > > > > >
> > > > > > > > > > > > group 0:
> > > > > > > > > > > > domain 0:
> > > > > > > > > > > > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > counter 1: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > domain 1:
> > > > > > > > > > > > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > counter 1: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > group 1:
> > > > > > > > > > > > domain 0:
> > > > > > > > > > > > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > domain 1:
> > > > > > > > > > > > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > >
> > > > > > > > > > > > You mention that you do not want counters to be allocated in domains that they
> > > > > > > > > > > > are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> > > > > > > > > > > > in domain 1, resulting in:
> > > > > > > > > > > >
> > > > > > > > > > > > group 0:
> > > > > > > > > > > > domain 0:
> > > > > > > > > > > > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > counter 1: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > group 1:
> > > > > > > > > > > > domain 0:
> > > > > > > > > > > > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > domain 1:
> > > > > > > > > > > > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > >
> > > > > > > > > > > > With counter 0 and counter 1 available in domain 1, these counters could
> > > > > > > > > > > > theoretically be configured to give group 1 more data in domain 1:
> > > > > > > > > > > >
> > > > > > > > > > > > group 0:
> > > > > > > > > > > > domain 0:
> > > > > > > > > > > > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > counter 1: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > group 1:
> > > > > > > > > > > > domain 0:
> > > > > > > > > > > > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > > counter 3: VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > > > domain 1:
> > > > > > > > > > > > counter 0: LclFill,RmtFill
> > > > > > > > > > > > counter 1: LclNTWr,RmtNTWr
> > > > > > > > > > > > counter 2: LclSlowFill,RmtSlowFill
> > > > > > > > > > > > counter 3: VictimBW
> > > > > > > > > > > >
> > > > > > > > > > > > The counters are shown with different per-domain configurations that seems to
> > > > > > > > > > > > match with earlier goals of (a) choose events counted by each counter and
> > > > > > > > > > > > (b) do not allocate counters in domains where they are not needed. As I
> > > > > > > > > > > > understand the above does contradict global counter configuration though.
> > > > > > > > > > > > Or do you mean that only the *name* of the counter is global and then
> > > > > > > > > > > > that it is reconfigured as part of every assignment?
> > > > > > > > > > >
> > > > > > > > > > > Yes, I meant only the *name* is global. I assume based on a particular
> > > > > > > > > > > system configuration, the user will settle on a handful of useful
> > > > > > > > > > > groupings to count.
> > > > > > > > > > >
> > > > > > > > > > > Perhaps mbm_assign_control syntax is the clearest way to express an example...
> > > > > > > > > > >
> > > > > > > > > > > # define global configurations (in ABMC terms), not necessarily in this
> > > > > > > > > > > # syntax and probably not in the mbm_assign_control file.
> > > > > > > > > > >
> > > > > > > > > > > r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > > > > > > w=VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > >
> > > > > > > > > > > # legacy "total" configuration, effectively r+w
> > > > > > > > > > > t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > >
> > > > > > > > > > > /group0/0=t;1=t
> > > > > > > > > > > /group1/0=t;1=t
> > > > > > > > > > > /group2/0=_;1=t
> > > > > > > > > > > /group3/0=rw;1=_
> > > > > > > > > > >
> > > > > > > > > > > - group2 is restricted to domain 0
> > > > > > > > > > > - group3 is restricted to domain 1
> > > > > > > > > > > - the rest are unrestricted
> > > > > > > > > > > - In group3, we decided we need to separate read and write traffic
> > > > > > > > > > >
> > > > > > > > > > > This consumes 4 counters in domain 0 and 3 counters in domain 1.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I see. Thank you for the example.
> > > > > > > > > >
> > > > > > > > > > resctrl supports per-domain configurations with the following possible when
> > > > > > > > > > using mbm_total_bytes_config and mbm_local_bytes_config:
> > > > > > > > > >
> > > > > > > > > > t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > > t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
> > > > > > > > > >
> > > > > > > > > > /group0/0=t;1=t
> > > > > > > > > > /group1/0=t;1=t
> > > > > > > > > >
> > > > > > > > > > Even though the flags are identical in all domains, the assigned counters will
> > > > > > > > > > be configured differently in each domain.
> > > > > > > > > >
> > > > > > > > > > With this supported by hardware and currently also supported by resctrl it seems
> > > > > > > > > > reasonable to carry this forward to what will be supported next.
> > > > > > > > >
> > > > > > > > > The hardware supports both a per-domain mode, where all groups in a
> > > > > > > > > domain use the same configurations and are limited to two events per
> > > > > > > > > group and a per-group mode where every group can be configured and
> > > > > > > > > assigned freely. This series is using the legacy counter access mode
> > > > > > > > > where only counters whose BwType matches an instance of QOS_EVT_CFG_n
> > > > > > > > > in the domain can be read. If we chose to read the assigned counter
> > > > > > > > > directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
> > > > > > > > > rather than asking the hardware to find the counter by RMID, we would
> > > > > > > > > not be limited to 2 counters per group/domain and the hardware would
> > > > > > > > > have the same flexibility as on MPAM.
> > > > > > > >
> > > > > > > > In extended mode, the contents of a specific counter can be read by
> > > > > > > > setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
> > > > > > > > [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
> > > > > > > > QM_CTR will then return the contents of the specified counter.
> > > > > > > >
> > > > > > > > It is documented below.
> > > > > > > > https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
> > > > > > > > Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
> > > > > > > >
> > > > > > > > We previously discussed this with you (off the public list) and I
> > > > > > > > initially proposed the extended assignment mode.
> > > > > > > >
> > > > > > > > Yes, the extended mode allows greater flexibility by enabling multiple
> > > > > > > > counters to be assigned to the same group, rather than being limited to
> > > > > > > > just two.
> > > > > > > >
> > > > > > > > However, the challenge is that we currently lack the necessary interfaces
> > > > > > > > to configure multiple events per group. Without these interfaces, the
> > > > > > > > extended mode is not practical at this time.
> > > > > > > >
> > > > > > > > Therefore, we ultimately agreed to use the legacy mode, as it does not
> > > > > > > > require modifications to the existing interface, allowing us to continue
> > > > > > > > using it as is.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > (I might have said something confusing in my last messages because I
> > > > > > > > > had forgotten that I switched to the extended assignment mode when
> > > > > > > > > prototyping with soft-ABMC and MPAM.)
> > > > > > > > >
> > > > > > > > > Forcing all groups on a domain to share the same 2 counter
> > > > > > > > > configurations would not be acceptable for us, as the example I gave
> > > > > > > > > earlier is one I've already been asked about.
> > > > > > > >
> > > > > > > > I don’t see this as a blocker. It should be considered an extension to the
> > > > > > > > current ABMC series. We can easily build on top of this series once we
> > > > > > > > finalize how to configure the multiple event interface for each group.
> > > > > > >
> > > > > > > I don't think it is, either. Only being able to use ABMC to assign
> > > > > > > counters is fine for our use as an incremental step. My longer-term
> > > > > > > concern is the domain-scoped mbm_total_bytes_config and
> > > > > > > mbm_local_bytes_config files, but they were introduced with BMEC, so
> > > > > > > there's already an expectation that the files are present when BMEC is
> > > > > > > supported.
> > > > > > >
> > > > > > > On ABMC hardware that also supports BMEC, I'm concerned about enabling
> > > > > > > ABMC when only the BMEC-style event configuration interface exists.
> > > > > > > The scope of my issue is just whether enabling "full" ABMC support
> > > > > > > will require an additional opt-in, since that could remove the BMEC
> > > > > > > interface. If it does, it's something we can live with.
> > > > > >
> > > > > > As you know, this series is currently blocked without further feedback.
> > > > > >
> > > > > > I’d like to begin reworking these patches to incorporate Peter’s feedback.
> > > > > > Any input or suggestions would be appreciated.
> > > > > >
> > > > > > Here’s what we’ve learned so far:
> > > > > >
> > > > > > 1. Assignments should be independent of BMEC.
> > > > > > 2. We should be able to specify multiple event types to a counter (e.g.,
> > > > > > read, write, victimBM, etc.). This is also called shared counter
> > > > > > 3. There should be an option to assign events per domain.
> > > > > > 4. Currently, only two counters can be assigned per group, but the design
> > > > > > should allow flexibility to assign more in the future as the interface
> > > > > > evolves.
> > > > > > 5. Utilize the extended RMID read mode.
> > > > > >
> > > > > >
> > > > > > Here is my proposal using Peter's earlier example:
> > > > > >
> > > > > > # define event configurations
> > > > > >
> > > > > > ========================================================
> > > > > > Bits Mnemonics Description
> > > > > > ==== ========================================================
> > > > > > 6 VictimBW Dirty Victims from all types of memory
> > > > > > 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
> > > > > > 4 LclSlowFill Reads to slow memory in the local NUMA domain
> > > > > > 3 RmtNTWr Non-temporal writes to non-local NUMA domain
> > > > > > 2 LclNTWr Non-temporal writes to local NUMA domain
> > > > > > 1 mtFill Reads to memory in the non-local NUMA domain
> > > > > > 0 LclFill Reads to memory in the local NUMA domain
> > > > > > ==== ========================================================
> > > > > >
> > > > > > #Define flags based on combination of above event types.
> > > > > >
> > > > > > t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> > > > > > l = LclFill, LclNTWr, LclSlowFill
> > > > > > r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
> > > > > > w = VictimBW,LclNTWr,RmtNTWr
> > > > > > v = VictimBW
> > > > > >
> > > > > > Peter suggested the following format earlier :
> > > > > >
> > > > > > /group0/0=t;1=t
> > > > > > /group1/0=t;1=t
> > > > > > /group2/0=_;1=t
> > > > > > /group3/0=rw;1=_
> > > > >
> > > > > After some inquiries within Google, it sounds like nobody has invested
> > > > > much into the current mbm_assign_control format yet, so it would be
> > > > > best to drop it and distribute the configuration around the filesystem
> > > > > hierarchy[1], which should allow us to produce something more flexible
> > > > > and cleaner to implement.
> > > > >
> > > > > Roughly what I had in mind:
> > > > >
> > > > > Use mkdir in a info/<resource>_MON subdirectory to create free-form
> > > > > names for the assignable configurations rather than being restricted
> > > > > to single letters. In the resulting directory, populate a file where
> > > > > we can specify the set of events the config should represent. I think
> > > > > we should use symbolic names for the events rather than raw BMEC field
> > > > > values. Moving forward we could come up with portable names for common
> > > > > events and only support the BMEC names on AMD machines for users who
> > > > > want specific events and don't care about portability.
> > > >
> > > >
> > > > I’m still processing this. Let me start with some initial questions.
> > > >
> > > > So, we are creating event configurations here, which seems reasonable.
> > > >
> > > > Yes, we should use portable names and are not limited to BMEC names.
> > > >
> > > > How many configurations should we allow? Do we know?
> > >
> > > Do we need an upper limit?
> >
> > I think so. This needs to be maintained in some data structure. We can
> > start with 2 default configurations for now.
> >
> > >
> > > >
> > > > >
> > > > > Next, put assignment-control file nodes in per-domain directories
> > > > > (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
> > > > > counter-configuration name into the file would then allocate a counter
> > > > > in the domain, apply the named configuration, and monitor the parent
> > > > > group-directory. We can also put a group/resource-scoped assign_* file
> > > > > higher in the hierarchy to make it easier for users who want to
> > > > > configure all domains the same for a group.
> > > >
> > > > What is the difference between shared and exclusive?
> > >
> > > Shared assignment[1] means that non-exclusively-assigned counters in
> > > each domain will be scheduled round-robin to the groups requesting
> > > shared access to a counter. In my tests, I assigned the counters long
> > > enough to produce a single 1-second MB/s sample for the per-domain
> > > aggregation files[2].
> > >
> > > These do not need to be implemented immediately, but knowing that they
> > > work addresses the overhead and scalability concerns of reassigning
> > > counters and reading their values.
> >
> > Ok. Lets focus on exclusive assignments for now.
> >
> > >
> > > >
> > > > Having three files—assign_shared, assign_exclusive, and unassign—for each
> > > > domain seems excessive. In a system with 32 groups and 12 domains, this
> > > > results in 32 × 12 × 3 files, which is quite large.
> > > >
> > > > There should be a more efficient way to handle this.
> > > >
> > > > Initially, we started with a group-level file for this interface, but it
> > > > was rejected due to the high number of sysfs calls, making it inefficient.
> > >
> > > I had rejected it due to the high-frequency of access of a large
> > > number of files, which has since been addressed by shared assignment
> > > (or automatic reassignment) and aggregated mbps files.
> >
> > I think we should address this as well. Creating three extra files for
> > each group isn’t ideal when there are more efficient alternatives.
> >
> > >
> > > >
> > > > Additionally, how can we list all assignments with a single sysfs call?
> > > >
> > > > That was another problem we need to address.
> > >
> > > This is not a requirement I was aware of. If the user forgot where
> > > they assigned counters (or forgot to disable auto-assignment), they
> > > can read multiple sysfs nodes to remind themselves.
> >
> > I suggest, we should provide users with an option to list the assignments
> > of all groups in a single command. As the number of groups increases, it
> > becomes cumbersome to query each group individually.
> >
> > To achieve this, we can reuse our existing mbm_assign_control interface
> > for this purpose. More details on this below.
> >
> > > >
> > > >
> > > > >
> > > > > The configuration names listed in assign_* would result in files of
> > > > > the same name in the appropriate mon_data domain directories from
> > > > > which the count values can be read.
> > > > >
> > > > > # mkdir info/L3_MON/counter_configs/mbm_local_bytes
> > > > > # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > > > > # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > > > > # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > > > > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > > > > LclFill
> > > > > LclNTWr
> > > > > LclSlowFill
> > > >
> > > > I feel we can just have the configs. event_filter file is not required.
> > >
> > > That's right, I forgot that we can implement kernfs_ops::open(). I was
> > > only looking at struct kernfs_syscall_ops
> > >
> > > >
> > > > #cat info/L3_MON/counter_configs/mbm_local_bytes
> > > > LclFill <-rename these to generic names.
> > > > LclNTWr
> > > > LclSlowFill
> > > >
> > >
> > > I think portable and non-portable event names should both be available
> > > as options. There are simple bandwidth measurement mechanisms that
> > > will be applied in general, but when they turn up an issue, it can
> > > often lead to a more focused investigation, requiring more precise
> > > events.
> >
> > I aggree. We should provide both portable and non-portable event names.
> >
> > Here is my draft proposal based on the discussion so far and reusing some
> > of the current interface. Idea here is to start with basic assigment
> > feature with options to enhance it in the future. Feel free to
> > comment/suggest.
> >
> > 1. Event configurations will be in
> > /sys/fs/resctrl/info/L3_MON/counter_configs/.
> >
> > There will be two pre-defined configurations by default.
> >
> > #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
> > LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
> >
> > #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> > LclFill, LclNTWr, LclSlowFill
> >
> > 2. Users will have options to update these configurations.
> >
> > #echo "LclFill, LclNTWr, RmtFill" >
> > /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
This part seems odd to me. Now the "mbm_local_bytes" files aren't
reporting "local_bytes" any more. They report something different,
and users only know if they come to check the options currently
configured in this file. Changing the contents without changing
the name seems confusing to me.
> >
> > # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> > LclFill, LclNTWr, RmtFill
> >
> > 3. The default configurations will be used when user mounts the resctrl.
> >
> > mount -t resctrl resctrl /sys/fs/resctrl/
> > mkdir /sys/fs/resctrl/test/
> >
> > 4. The resctrl group/domains can be in one of these assingnment states.
> > e: Exclusive
> > s: Shared
> > u: Unassigned
> >
> > Exclusive mode is supported now. Shared mode will be supported in the
> > future.
> >
> > 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> > to list the assignment state of all the groups.
> >
> > Format:
> > "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
> >
> > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> > test//mbm_total_bytes:0=e;1=e
> > test//mbm_local_bytes:0=e;1=e
> > //mbm_total_bytes:0=e;1=e
> > //mbm_local_bytes:0=e;1=e
> >
> > 6. Users can modify the assignment state by writing to mbm_assign_control.
> >
> > Format:
> > “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
> >
> > #echo "test//mbm_local_bytes:0=e;1=e" >
> > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >
> > #echo "test//mbm_local_bytes:0=u;1=u" >
> > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >
> > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> > test//mbm_total_bytes:0=u;1=u
> > test//mbm_local_bytes:0=u;1=u
> > //mbm_total_bytes:0=e;1=e
> > //mbm_local_bytes:0=e;1=e
> >
> > The corresponding events will be read in
> >
> > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> > /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
> > /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> > /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
> > /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
> > /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
> > /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
> > /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
> >
> > 7. In the first stage, only two configurations(mbm_total_bytes and
> > mbm_local_bytes) will be supported.
> >
> > 8. In the future, there will be options to create multiple configurations
> > and corresponding directory will be created in
> > /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
directory? Like this:
# echo "LclFill, LclNTWr, RmtFill" >
/sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
This seems OK (dependent on the user picking meaningful names for
the set of attributes picked ... but if they want to name this
monitor file "brian" then they have to live with any confusion
that they bring on themselves).
Would this involve an extension to kernfs? I don't see a function
pointer callback for file creation in kernfs_syscall_ops.
> >
>
> I know you are all busy with multiple series going on parallel. I am still
> waiting for the inputs on this. It will be great if you can spend some time
> on this to see if we can find common ground on the interface.
>
> Thanks
> Babu
-Tony
Hi Tony,
On 3/10/2025 6:22 PM, Luck, Tony wrote:
> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>> Hi All,
>>
>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>> Hi Peter,
>>>
>>> On 3/5/25 04:40, Peter Newman wrote:
>>>> Hi Babu,
>>>>
>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>
>>>>>>> Hi Peter/Reinette,
>>>>>>>
>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>> Hi Babu,
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>> Hi Reinette,
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>
>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>
>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>
>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>
>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>>>
>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>
>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>
>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>
>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>
>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>
>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>
>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>
>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>
>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>
>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>
>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>
>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>
>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>
>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>
>>>>>>>>> It is documented below.
>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>
>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>
>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>> just two.
>>>>>>>>>
>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>
>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>> using it as is.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>
>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>
>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>
>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>> supported.
>>>>>>>>
>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>
>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>
>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>
>>>>>>> Here’s what we’ve learned so far:
>>>>>>>
>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>> evolves.
>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>
>>>>>>>
>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>
>>>>>>> # define event configurations
>>>>>>>
>>>>>>> ========================================================
>>>>>>> Bits Mnemonics Description
>>>>>>> ==== ========================================================
>>>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>>>> ==== ========================================================
>>>>>>>
>>>>>>> #Define flags based on combination of above event types.
>>>>>>>
>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>> v = VictimBW
>>>>>>>
>>>>>>> Peter suggested the following format earlier :
>>>>>>>
>>>>>>> /group0/0=t;1=t
>>>>>>> /group1/0=t;1=t
>>>>>>> /group2/0=_;1=t
>>>>>>> /group3/0=rw;1=_
>>>>>>
>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>> and cleaner to implement.
>>>>>>
>>>>>> Roughly what I had in mind:
>>>>>>
>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>> names for the assignable configurations rather than being restricted
>>>>>> to single letters. In the resulting directory, populate a file where
>>>>>> we can specify the set of events the config should represent. I think
>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>> values. Moving forward we could come up with portable names for common
>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>> want specific events and don't care about portability.
>>>>>
>>>>>
>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>
>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>
>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>
>>>>> How many configurations should we allow? Do we know?
>>>>
>>>> Do we need an upper limit?
>>>
>>> I think so. This needs to be maintained in some data structure. We can
>>> start with 2 default configurations for now.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>> configure all domains the same for a group.
>>>>>
>>>>> What is the difference between shared and exclusive?
>>>>
>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>> each domain will be scheduled round-robin to the groups requesting
>>>> shared access to a counter. In my tests, I assigned the counters long
>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>> aggregation files[2].
>>>>
>>>> These do not need to be implemented immediately, but knowing that they
>>>> work addresses the overhead and scalability concerns of reassigning
>>>> counters and reading their values.
>>>
>>> Ok. Lets focus on exclusive assignments for now.
>>>
>>>>
>>>>>
>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>
>>>>> There should be a more efficient way to handle this.
>>>>>
>>>>> Initially, we started with a group-level file for this interface, but it
>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>
>>>> I had rejected it due to the high-frequency of access of a large
>>>> number of files, which has since been addressed by shared assignment
>>>> (or automatic reassignment) and aggregated mbps files.
>>>
>>> I think we should address this as well. Creating three extra files for
>>> each group isn’t ideal when there are more efficient alternatives.
>>>
>>>>
>>>>>
>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>
>>>>> That was another problem we need to address.
>>>>
>>>> This is not a requirement I was aware of. If the user forgot where
>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>> can read multiple sysfs nodes to remind themselves.
>>>
>>> I suggest, we should provide users with an option to list the assignments
>>> of all groups in a single command. As the number of groups increases, it
>>> becomes cumbersome to query each group individually.
>>>
>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>> for this purpose. More details on this below.
>>>
>>>>>
>>>>>
>>>>>>
>>>>>> The configuration names listed in assign_* would result in files of
>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>> which the count values can be read.
>>>>>>
>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>> LclFill
>>>>>> LclNTWr
>>>>>> LclSlowFill
>>>>>
>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>
>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>> only looking at struct kernfs_syscall_ops
>>>>
>>>>>
>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>> LclFill <-rename these to generic names.
>>>>> LclNTWr
>>>>> LclSlowFill
>>>>>
>>>>
>>>> I think portable and non-portable event names should both be available
>>>> as options. There are simple bandwidth measurement mechanisms that
>>>> will be applied in general, but when they turn up an issue, it can
>>>> often lead to a more focused investigation, requiring more precise
>>>> events.
>>>
>>> I aggree. We should provide both portable and non-portable event names.
>>>
>>> Here is my draft proposal based on the discussion so far and reusing some
>>> of the current interface. Idea here is to start with basic assigment
>>> feature with options to enhance it in the future. Feel free to
>>> comment/suggest.
>>>
>>> 1. Event configurations will be in
>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>
>>> There will be two pre-defined configurations by default.
>>>
>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>
>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>> LclFill, LclNTWr, LclSlowFill
>>>
>>> 2. Users will have options to update these configurations.
>>>
>>> #echo "LclFill, LclNTWr, RmtFill" >
>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>
> This part seems odd to me. Now the "mbm_local_bytes" files aren't
> reporting "local_bytes" any more. They report something different,
> and users only know if they come to check the options currently
> configured in this file. Changing the contents without changing
> the name seems confusing to me.
It is the same behaviour right now with BMEC. It is configurable.
By default it is mbm_local_bytes, but users can configure whatever they
want to monitor using /info/L3_MON/mbm_local_bytes_config.
We can continue the same behaviour with ABMC, but the configuration will
be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>
>>>
>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>> LclFill, LclNTWr, RmtFill
>>>
>>> 3. The default configurations will be used when user mounts the resctrl.
>>>
>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>> mkdir /sys/fs/resctrl/test/
>>>
>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>> e: Exclusive
>>> s: Shared
>>> u: Unassigned
>>>
>>> Exclusive mode is supported now. Shared mode will be supported in the
>>> future.
>>>
>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> to list the assignment state of all the groups.
>>>
>>> Format:
>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> test//mbm_total_bytes:0=e;1=e
>>> test//mbm_local_bytes:0=e;1=e
>>> //mbm_total_bytes:0=e;1=e
>>> //mbm_local_bytes:0=e;1=e
>>>
>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>
>>> Format:
>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>
>>> #echo "test//mbm_local_bytes:0=e;1=e" >
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>> #echo "test//mbm_local_bytes:0=u;1=u" >
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>> test//mbm_total_bytes:0=u;1=u
>>> test//mbm_local_bytes:0=u;1=u
>>> //mbm_total_bytes:0=e;1=e
>>> //mbm_local_bytes:0=e;1=e
>>>
>>> The corresponding events will be read in
>>>
>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>
>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>> mbm_local_bytes) will be supported.
>>>
>>> 8. In the future, there will be options to create multiple configurations
>>> and corresponding directory will be created in
>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>
> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
> directory? Like this:
>
> # echo "LclFill, LclNTWr, RmtFill" >
> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>
> This seems OK (dependent on the user picking meaningful names for
> the set of attributes picked ... but if they want to name this
> monitor file "brian" then they have to live with any confusion
> that they bring on themselves).
>
> Would this involve an extension to kernfs? I don't see a function
> pointer callback for file creation in kernfs_syscall_ops.
>
>>>
>>
>> I know you are all busy with multiple series going on parallel. I am still
>> waiting for the inputs on this. It will be great if you can spend some time
>> on this to see if we can find common ground on the interface.
>>
>> Thanks
>> Babu
>
> -Tony
>
thanks
Babu
On 3/10/25 6:44 PM, Moger, Babu wrote:
> Hi Tony,
>
> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>> Hi All,
>>>
>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>> Hi Peter,
>>>>
>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>> Hi Babu,
>>>>>
>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>
>>>>>>>> Hi Peter/Reinette,
>>>>>>>>
>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>> Hi Babu,
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Peter,
>>>>>>>>>>
>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>
>>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>
>>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>
>>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>
>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>>
>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>
>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>
>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>
>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>
>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>
>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>
>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>
>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>
>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>
>>>>>>>>>> It is documented below.
>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>
>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>
>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>> just two.
>>>>>>>>>>
>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>
>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>> using it as is.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>
>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>
>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>
>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>> supported.
>>>>>>>>>
>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>
>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>
>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>
>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>
>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>> evolves.
>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>
>>>>>>>>
>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>
>>>>>>>> # define event configurations
>>>>>>>>
>>>>>>>> ========================================================
>>>>>>>> Bits Mnemonics Description
>>>>>>>> ==== ========================================================
>>>>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>>>>> ==== ========================================================
>>>>>>>>
>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>
>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>> v = VictimBW
>>>>>>>>
>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>
>>>>>>>> /group0/0=t;1=t
>>>>>>>> /group1/0=t;1=t
>>>>>>>> /group2/0=_;1=t
>>>>>>>> /group3/0=rw;1=_
>>>>>>>
>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>> and cleaner to implement.
>>>>>>>
>>>>>>> Roughly what I had in mind:
>>>>>>>
>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>> to single letters. In the resulting directory, populate a file where
>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>> want specific events and don't care about portability.
>>>>>>
>>>>>>
>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>
>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>
>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>
>>>>>> How many configurations should we allow? Do we know?
>>>>>
>>>>> Do we need an upper limit?
>>>>
>>>> I think so. This needs to be maintained in some data structure. We can
>>>> start with 2 default configurations for now.
There is a big difference between no upper limit and 2. The hardware is
capable of supporting per-domain configurations so more flexibility is
certainly possible. Consider the example presented by Peter in:
https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>> configure all domains the same for a group.
>>>>>>
>>>>>> What is the difference between shared and exclusive?
>>>>>
>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>> aggregation files[2].
>>>>>
>>>>> These do not need to be implemented immediately, but knowing that they
>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>> counters and reading their values.
>>>>
>>>> Ok. Lets focus on exclusive assignments for now.
>>>>
>>>>>
>>>>>>
>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>
>>>>>> There should be a more efficient way to handle this.
>>>>>>
>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>
>>>>> I had rejected it due to the high-frequency of access of a large
>>>>> number of files, which has since been addressed by shared assignment
>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>
>>>> I think we should address this as well. Creating three extra files for
>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>
>>>>>
>>>>>>
>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>
>>>>>> That was another problem we need to address.
>>>>>
>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>> can read multiple sysfs nodes to remind themselves.
>>>>
>>>> I suggest, we should provide users with an option to list the assignments
>>>> of all groups in a single command. As the number of groups increases, it
>>>> becomes cumbersome to query each group individually.
>>>>
>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>> for this purpose. More details on this below.
>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>> which the count values can be read.
>>>>>>>
>>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>> LclFill
>>>>>>> LclNTWr
>>>>>>> LclSlowFill
>>>>>>
>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>
>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>> only looking at struct kernfs_syscall_ops
>>>>>
>>>>>>
>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>> LclFill <-rename these to generic names.
>>>>>> LclNTWr
>>>>>> LclSlowFill
>>>>>>
>>>>>
>>>>> I think portable and non-portable event names should both be available
>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>> will be applied in general, but when they turn up an issue, it can
>>>>> often lead to a more focused investigation, requiring more precise
>>>>> events.
>>>>
>>>> I aggree. We should provide both portable and non-portable event names.
>>>>
>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>> of the current interface. Idea here is to start with basic assigment
>>>> feature with options to enhance it in the future. Feel free to
>>>> comment/suggest.
>>>>
>>>> 1. Event configurations will be in
>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>
>>>> There will be two pre-defined configurations by default.
>>>>
>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>
>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>> LclFill, LclNTWr, LclSlowFill
>>>>
>>>> 2. Users will have options to update these configurations.
>>>>
>>>> #echo "LclFill, LclNTWr, RmtFill" >
>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>
>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>> reporting "local_bytes" any more. They report something different,
>> and users only know if they come to check the options currently
>> configured in this file. Changing the contents without changing
>> the name seems confusing to me.
>
> It is the same behaviour right now with BMEC. It is configurable.
> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>
> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
This could be supported by following Peter's original proposal where the name
of the counter configuration is provided by the user via a mkdir:
https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
>
>>
>>>>
>>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>> LclFill, LclNTWr, RmtFill
>>>>
>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>
>>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>>> mkdir /sys/fs/resctrl/test/
>>>>
>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>> e: Exclusive
>>>> s: Shared
>>>> u: Unassigned
>>>>
>>>> Exclusive mode is supported now. Shared mode will be supported in the
>>>> future.
>>>>
>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> to list the assignment state of all the groups.
>>>>
>>>> Format:
>>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>
>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> test//mbm_total_bytes:0=e;1=e
>>>> test//mbm_local_bytes:0=e;1=e
>>>> //mbm_total_bytes:0=e;1=e
>>>> //mbm_local_bytes:0=e;1=e
This would make mbm_assign_control even more unwieldy and quicker to exceed a
page of data (these examples never seem to reflect those AMD systems with the many
L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
and solved when/if going this route.
There seems to be two opinions about this file at moment. Would it be possible to
summarize the discussion with pros/cons raised to make an informed selection?
I understand that Google as represented by Peter no longer requires/requests this
file but the motivation for this change seems new and does not seem to reduce the
original motivation for this file. We may also want to separate requirements for reading
from and writing to this file.
>>>>
>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>
>>>> Format:
>>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>
>>>> #echo "test//mbm_local_bytes:0=e;1=e" >
>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>
>>>> #echo "test//mbm_local_bytes:0=u;1=u" >
>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>
>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>> test//mbm_total_bytes:0=u;1=u
>>>> test//mbm_local_bytes:0=u;1=u
>>>> //mbm_total_bytes:0=e;1=e
>>>> //mbm_local_bytes:0=e;1=e
>>>>
>>>> The corresponding events will be read in
>>>>
>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>
>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>> mbm_local_bytes) will be supported.
>>>>
>>>> 8. In the future, there will be options to create multiple configurations
>>>> and corresponding directory will be created in
>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>
>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>> directory? Like this:
>>
>> # echo "LclFill, LclNTWr, RmtFill" >
>> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>
>> This seems OK (dependent on the user picking meaningful names for
>> the set of attributes picked ... but if they want to name this
>> monitor file "brian" then they have to live with any confusion
>> that they bring on themselves).
>>
>> Would this involve an extension to kernfs? I don't see a function
>> pointer callback for file creation in kernfs_syscall_ops.
>>
>>>>
>>>
>>> I know you are all busy with multiple series going on parallel. I am still
>>> waiting for the inputs on this. It will be great if you can spend some time
>>> on this to see if we can find common ground on the interface.
>>>
>>> Thanks
>>> Babu
>>
>> -Tony
>>
>
>
> thanks
> Babu
Reinette
Hi All,
On 3/10/25 22:51, Reinette Chatre wrote:
>
>
> On 3/10/25 6:44 PM, Moger, Babu wrote:
>> Hi Tony,
>>
>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>> Hi All,
>>>>
>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>> Hi Peter,
>>>>>
>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>> Hi Babu,
>>>>>>
>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>
>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>> Hi Babu,
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>
>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>
>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>
>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>
>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>
>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>
>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>
>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>
>>>>>>>>>>> It is documented below.
>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>
>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>
>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>> just two.
>>>>>>>>>>>
>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>
>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>> using it as is.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>
>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>
>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>
>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>> supported.
>>>>>>>>>>
>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>
>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>
>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>
>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>
>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>> evolves.
>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>
>>>>>>>>> # define event configurations
>>>>>>>>>
>>>>>>>>> ========================================================
>>>>>>>>> Bits Mnemonics Description
>>>>>>>>> ==== ========================================================
>>>>>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>>>>>> ==== ========================================================
>>>>>>>>>
>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>
>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>> v = VictimBW
>>>>>>>>>
>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>
>>>>>>>>> /group0/0=t;1=t
>>>>>>>>> /group1/0=t;1=t
>>>>>>>>> /group2/0=_;1=t
>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>
>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>> and cleaner to implement.
>>>>>>>>
>>>>>>>> Roughly what I had in mind:
>>>>>>>>
>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>> to single letters. In the resulting directory, populate a file where
>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>> want specific events and don't care about portability.
>>>>>>>
>>>>>>>
>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>
>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>
>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>
>>>>>>> How many configurations should we allow? Do we know?
>>>>>>
>>>>>> Do we need an upper limit?
>>>>>
>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>> start with 2 default configurations for now.
>
> There is a big difference between no upper limit and 2. The hardware is
> capable of supporting per-domain configurations so more flexibility is
> certainly possible. Consider the example presented by Peter in:
> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>
>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>> configure all domains the same for a group.
>>>>>>>
>>>>>>> What is the difference between shared and exclusive?
>>>>>>
>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>> aggregation files[2].
>>>>>>
>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>> counters and reading their values.
>>>>>
>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>
>>>>>>> There should be a more efficient way to handle this.
>>>>>>>
>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>
>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>> number of files, which has since been addressed by shared assignment
>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>
>>>>> I think we should address this as well. Creating three extra files for
>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>
>>>>>>> That was another problem we need to address.
>>>>>>
>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>
>>>>> I suggest, we should provide users with an option to list the assignments
>>>>> of all groups in a single command. As the number of groups increases, it
>>>>> becomes cumbersome to query each group individually.
>>>>>
>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>> for this purpose. More details on this below.
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>> which the count values can be read.
>>>>>>>>
>>>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>> LclFill
>>>>>>>> LclNTWr
>>>>>>>> LclSlowFill
>>>>>>>
>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>
>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>
>>>>>>>
>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>> LclFill <-rename these to generic names.
>>>>>>> LclNTWr
>>>>>>> LclSlowFill
>>>>>>>
>>>>>>
>>>>>> I think portable and non-portable event names should both be available
>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>> events.
>>>>>
>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>
>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>> of the current interface. Idea here is to start with basic assigment
>>>>> feature with options to enhance it in the future. Feel free to
>>>>> comment/suggest.
>>>>>
>>>>> 1. Event configurations will be in
>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>
>>>>> There will be two pre-defined configurations by default.
>>>>>
>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>
>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>> LclFill, LclNTWr, LclSlowFill
>>>>>
>>>>> 2. Users will have options to update these configurations.
>>>>>
>>>>> #echo "LclFill, LclNTWr, RmtFill" >
>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>
>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>> reporting "local_bytes" any more. They report something different,
>>> and users only know if they come to check the options currently
>>> configured in this file. Changing the contents without changing
>>> the name seems confusing to me.
>>
>> It is the same behaviour right now with BMEC. It is configurable.
>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>
>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>
> This could be supported by following Peter's original proposal where the name
> of the counter configuration is provided by the user via a mkdir:
> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>
> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
Sure. We can do that. I was thinking in the first phase, just provide the
default pre-defined configuration and option to update the configuration.
We can add the mkdir support later. That way we can provide basic ABMC
support without too much code complexity with mkdir support.
>
>>
>>>
>>>>>
>>>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>> LclFill, LclNTWr, RmtFill
>>>>>
>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>
>>>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>>>> mkdir /sys/fs/resctrl/test/
>>>>>
>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>> e: Exclusive
>>>>> s: Shared
>>>>> u: Unassigned
>>>>>
>>>>> Exclusive mode is supported now. Shared mode will be supported in the
>>>>> future.
>>>>>
>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>> to list the assignment state of all the groups.
>>>>>
>>>>> Format:
>>>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>
>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>> test//mbm_total_bytes:0=e;1=e
>>>>> test//mbm_local_bytes:0=e;1=e
>>>>> //mbm_total_bytes:0=e;1=e
>>>>> //mbm_local_bytes:0=e;1=e
>
> This would make mbm_assign_control even more unwieldy and quicker to exceed a
> page of data (these examples never seem to reflect those AMD systems with the many
> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
> and solved when/if going this route.
This problem is not specific this series. I feel it is a generic problem
to many of the semilar interfaces. I dont know how it is addressed. May
have to investigate on this. Any pointers would be helpful.
>
> There seems to be two opinions about this file at moment. Would it be possible to
> summarize the discussion with pros/cons raised to make an informed selection?
> I understand that Google as represented by Peter no longer requires/requests this
> file but the motivation for this change seems new and does not seem to reduce the
> original motivation for this file. We may also want to separate requirements for reading
> from and writing to this file.
Yea. We can just use mbm_assign_control for reading the assignment states.
Summary: We have two proposals.
First one from Peter:
https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
Pros
a. Allows flexible creation of free-form names for assignable
configurations, stored in info/L3_MON/counter_configs/.
b. Events can be accessed using corresponding free-form names in the
mon_data directory, making it clear to users what each event represents.
Cons:
a. Requires three separate files for assignment in each group
(assign_exclusive, assign_shared, unassign), which might be excessive.
b. No built-in listing support, meaning users must query each group
individually to check assignment states.
Second Proposal (Mine)
https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
Pros:
a. Maintains the flexibility of free-form names for assignable
configurations (info/L3_MON/counter_configs/).
b. Events remain accessible via free-form names in mon_data, ensuring
clarity on their purpose.
c. Adds the ability to list assignment states for all groups in a single
command.
Cons:
a. Potential buffer overflow issues when handling a large number of
groups and domains and code complexity to fix the issue.
Third Option: A Hybrid Approach
We could combine elements from both proposals:
a. Retain the free-form naming approach for assignable configurations in
info/L3_MON/counter_configs/.
b. Use the assignment method from the first proposal:
$mkdir test
$echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
c. Introduce listing support via the info/L3_MON/mbm_assign_control
interface, enabling users to read assignment states for all groups in one
place. Only reading support.
>
>>>>>
>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>
>>>>> Format:
>>>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>
>>>>> #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>
>>>>> #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>
>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>> test//mbm_total_bytes:0=u;1=u
>>>>> test//mbm_local_bytes:0=u;1=u
>>>>> //mbm_total_bytes:0=e;1=e
>>>>> //mbm_local_bytes:0=e;1=e
>>>>>
>>>>> The corresponding events will be read in
>>>>>
>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>
>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>> mbm_local_bytes) will be supported.
>>>>>
>>>>> 8. In the future, there will be options to create multiple configurations
>>>>> and corresponding directory will be created in
>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>
>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>> directory? Like this:
>>>
>>> # echo "LclFill, LclNTWr, RmtFill" >
>>> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>
>>> This seems OK (dependent on the user picking meaningful names for
>>> the set of attributes picked ... but if they want to name this
>>> monitor file "brian" then they have to live with any confusion
>>> that they bring on themselves).
>>>
>>> Would this involve an extension to kernfs? I don't see a function
>>> pointer callback for file creation in kernfs_syscall_ops.
>>>
>>>>>
>>>>
>>>> I know you are all busy with multiple series going on parallel. I am still
>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>> on this to see if we can find common ground on the interface.
>>>>
>>>> Thanks
>>>> Babu
>>>
>>> -Tony
>>>
>>
>>
>> thanks
>> Babu
>
> Reinette
>
>
--
Thanks
Babu Moger
Hi Babu,
On 3/11/25 1:35 PM, Moger, Babu wrote:
> Hi All,
>
> On 3/10/25 22:51, Reinette Chatre wrote:
>>
>>
>> On 3/10/25 6:44 PM, Moger, Babu wrote:
>>> Hi Tony,
>>>
>>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>>> Hi All,
>>>>>
>>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>>> Hi Peter,
>>>>>>
>>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>>> Hi Babu,
>>>>>>>
>>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>>
>>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>>
>>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>>
>>>>>>>>>>>> It is documented below.
>>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>>
>>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>>> just two.
>>>>>>>>>>>>
>>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>>
>>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>>> using it as is.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>>
>>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>>
>>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>>> supported.
>>>>>>>>>>>
>>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>>
>>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>>
>>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>>
>>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>>
>>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>>> evolves.
>>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>>
>>>>>>>>>> # define event configurations
>>>>>>>>>>
>>>>>>>>>> ========================================================
>>>>>>>>>> Bits Mnemonics Description
>>>>>>>>>> ==== ========================================================
>>>>>>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>>>>>>> ==== ========================================================
>>>>>>>>>>
>>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>>
>>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> v = VictimBW
>>>>>>>>>>
>>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>>
>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>
>>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>>> and cleaner to implement.
>>>>>>>>>
>>>>>>>>> Roughly what I had in mind:
>>>>>>>>>
>>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>>> to single letters. In the resulting directory, populate a file where
>>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>>> want specific events and don't care about portability.
>>>>>>>>
>>>>>>>>
>>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>>
>>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>>
>>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>>
>>>>>>>> How many configurations should we allow? Do we know?
>>>>>>>
>>>>>>> Do we need an upper limit?
>>>>>>
>>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>>> start with 2 default configurations for now.
>>
>> There is a big difference between no upper limit and 2. The hardware is
>> capable of supporting per-domain configurations so more flexibility is
>> certainly possible. Consider the example presented by Peter in:
>> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>>
>>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>>> configure all domains the same for a group.
>>>>>>>>
>>>>>>>> What is the difference between shared and exclusive?
>>>>>>>
>>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>>> aggregation files[2].
>>>>>>>
>>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>>> counters and reading their values.
>>>>>>
>>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>>
>>>>>>>> There should be a more efficient way to handle this.
>>>>>>>>
>>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>>
>>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>>> number of files, which has since been addressed by shared assignment
>>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>>
>>>>>> I think we should address this as well. Creating three extra files for
>>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>>
>>>>>>>> That was another problem we need to address.
>>>>>>>
>>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>>
>>>>>> I suggest, we should provide users with an option to list the assignments
>>>>>> of all groups in a single command. As the number of groups increases, it
>>>>>> becomes cumbersome to query each group individually.
>>>>>>
>>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>>> for this purpose. More details on this below.
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>>> which the count values can be read.
>>>>>>>>>
>>>>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>> LclFill
>>>>>>>>> LclNTWr
>>>>>>>>> LclSlowFill
>>>>>>>>
>>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>>
>>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>>
>>>>>>>>
>>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>> LclFill <-rename these to generic names.
>>>>>>>> LclNTWr
>>>>>>>> LclSlowFill
>>>>>>>>
>>>>>>>
>>>>>>> I think portable and non-portable event names should both be available
>>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>>> events.
>>>>>>
>>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>>
>>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>>> of the current interface. Idea here is to start with basic assigment
>>>>>> feature with options to enhance it in the future. Feel free to
>>>>>> comment/suggest.
>>>>>>
>>>>>> 1. Event configurations will be in
>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>>
>>>>>> There will be two pre-defined configurations by default.
>>>>>>
>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>>
>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>> LclFill, LclNTWr, LclSlowFill
>>>>>>
>>>>>> 2. Users will have options to update these configurations.
>>>>>>
>>>>>> #echo "LclFill, LclNTWr, RmtFill" >
>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>
>>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>>> reporting "local_bytes" any more. They report something different,
>>>> and users only know if they come to check the options currently
>>>> configured in this file. Changing the contents without changing
>>>> the name seems confusing to me.
>>>
>>> It is the same behaviour right now with BMEC. It is configurable.
>>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>>
>>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>>
>> This could be supported by following Peter's original proposal where the name
>> of the counter configuration is provided by the user via a mkdir:
>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>
>> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
>
> Sure. We can do that. I was thinking in the first phase, just provide the
> default pre-defined configuration and option to update the configuration.
>
> We can add the mkdir support later. That way we can provide basic ABMC
> support without too much code complexity with mkdir support.
This is not clear to me how you envision the "first phase". Is it what you
proposed above, for example:
#echo "LclFill, LclNTWr, RmtFill" >
/sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
In above the counter configuration name is a file.
How could mkdir support be added to this later if there are already files present?
>
>>
>>>
>>>>
>>>>>>
>>>>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>> LclFill, LclNTWr, RmtFill
>>>>>>
>>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>>
>>>>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>> mkdir /sys/fs/resctrl/test/
>>>>>>
>>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>>> e: Exclusive
>>>>>> s: Shared
>>>>>> u: Unassigned
>>>>>>
>>>>>> Exclusive mode is supported now. Shared mode will be supported in the
>>>>>> future.
>>>>>>
>>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>> to list the assignment state of all the groups.
>>>>>>
>>>>>> Format:
>>>>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>>
>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>> test//mbm_total_bytes:0=e;1=e
>>>>>> test//mbm_local_bytes:0=e;1=e
>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>> //mbm_local_bytes:0=e;1=e
>>
>> This would make mbm_assign_control even more unwieldy and quicker to exceed a
>> page of data (these examples never seem to reflect those AMD systems with the many
>> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
>> and solved when/if going this route.
>
> This problem is not specific this series. I feel it is a generic problem
> to many of the semilar interfaces. I dont know how it is addressed. May
> have to investigate on this. Any pointers would be helpful.
Dave Martin already did a lot of analysis here. What other pointers do you need?
>
>
>>
>> There seems to be two opinions about this file at moment. Would it be possible to
>> summarize the discussion with pros/cons raised to make an informed selection?
>> I understand that Google as represented by Peter no longer requires/requests this
>> file but the motivation for this change seems new and does not seem to reduce the
>> original motivation for this file. We may also want to separate requirements for reading
>> from and writing to this file.
>
> Yea. We can just use mbm_assign_control for reading the assignment states.
>
> Summary: We have two proposals.
>
> First one from Peter:
>
> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>
>
> Pros
> a. Allows flexible creation of free-form names for assignable
> configurations, stored in info/L3_MON/counter_configs/.
>
> b. Events can be accessed using corresponding free-form names in the
> mon_data directory, making it clear to users what each event represents.
>
>
> Cons:
> a. Requires three separate files for assignment in each group
> (assign_exclusive, assign_shared, unassign), which might be excessive.
>
> b. No built-in listing support, meaning users must query each group
> individually to check assignment states.
>
>
> Second Proposal (Mine)
>
> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
>
> Pros:
>
> a. Maintains the flexibility of free-form names for assignable
> configurations (info/L3_MON/counter_configs/).
>
> b. Events remain accessible via free-form names in mon_data, ensuring
> clarity on their purpose.
>
> c. Adds the ability to list assignment states for all groups in a single
> command.
>
> Cons:
> a. Potential buffer overflow issues when handling a large number of
> groups and domains and code complexity to fix the issue.
>
>
> Third Option: A Hybrid Approach
>
> We could combine elements from both proposals:
>
> a. Retain the free-form naming approach for assignable configurations in
> info/L3_MON/counter_configs/.
>
> b. Use the assignment method from the first proposal:
> $mkdir test
> $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
>
> c. Introduce listing support via the info/L3_MON/mbm_assign_control
> interface, enabling users to read assignment states for all groups in one
> place. Only reading support.
>
>
>>
>>>>>>
>>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>>
>>>>>> Format:
>>>>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>>
>>>>>> #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>
>>>>>> #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>
>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>> test//mbm_total_bytes:0=u;1=u
>>>>>> test//mbm_local_bytes:0=u;1=u
>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>> //mbm_local_bytes:0=e;1=e
>>>>>>
>>>>>> The corresponding events will be read in
>>>>>>
>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>
>>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>>> mbm_local_bytes) will be supported.
>>>>>>
>>>>>> 8. In the future, there will be options to create multiple configurations
>>>>>> and corresponding directory will be created in
>>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>>
>>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>>> directory? Like this:
>>>>
>>>> # echo "LclFill, LclNTWr, RmtFill" >
>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>>
>>>> This seems OK (dependent on the user picking meaningful names for
>>>> the set of attributes picked ... but if they want to name this
>>>> monitor file "brian" then they have to live with any confusion
>>>> that they bring on themselves).
>>>>
>>>> Would this involve an extension to kernfs? I don't see a function
>>>> pointer callback for file creation in kernfs_syscall_ops.
>>>>
>>>>>>
>>>>>
>>>>> I know you are all busy with multiple series going on parallel. I am still
>>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>>> on this to see if we can find common ground on the interface.
>>>>>
>>>>> Thanks
>>>>> Babu
>>>>
>>>> -Tony
>>>>
>>>
>>>
>>> thanks
>>> Babu
>>
>> Reinette
>>
>>
>
Hi Reinette,
On 3/12/25 10:07, Reinette Chatre wrote:
> Hi Babu,
>
> On 3/11/25 1:35 PM, Moger, Babu wrote:
>> Hi All,
>>
>> On 3/10/25 22:51, Reinette Chatre wrote:
>>>
>>>
>>> On 3/10/25 6:44 PM, Moger, Babu wrote:
>>>> Hi Tony,
>>>>
>>>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>>>> Hi Babu,
>>>>>>>>
>>>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>>>
>>>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It is documented below.
>>>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>>>
>>>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>>>> just two.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>>>> using it as is.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>>>> supported.
>>>>>>>>>>>>
>>>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>>>
>>>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>>>
>>>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>>>
>>>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>>>
>>>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>>>> evolves.
>>>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>>>
>>>>>>>>>>> # define event configurations
>>>>>>>>>>>
>>>>>>>>>>> ========================================================
>>>>>>>>>>> Bits Mnemonics Description
>>>>>>>>>>> ==== ========================================================
>>>>>>>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>>>>>>>> ==== ========================================================
>>>>>>>>>>>
>>>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>>>
>>>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> v = VictimBW
>>>>>>>>>>>
>>>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>>>
>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>
>>>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>>>> and cleaner to implement.
>>>>>>>>>>
>>>>>>>>>> Roughly what I had in mind:
>>>>>>>>>>
>>>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>>>> to single letters. In the resulting directory, populate a file where
>>>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>>>> want specific events and don't care about portability.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>>>
>>>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>>>
>>>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>>>
>>>>>>>>> How many configurations should we allow? Do we know?
>>>>>>>>
>>>>>>>> Do we need an upper limit?
>>>>>>>
>>>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>>>> start with 2 default configurations for now.
>>>
>>> There is a big difference between no upper limit and 2. The hardware is
>>> capable of supporting per-domain configurations so more flexibility is
>>> certainly possible. Consider the example presented by Peter in:
>>> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>>>
>>>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>>>> configure all domains the same for a group.
>>>>>>>>>
>>>>>>>>> What is the difference between shared and exclusive?
>>>>>>>>
>>>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>>>> aggregation files[2].
>>>>>>>>
>>>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>>>> counters and reading their values.
>>>>>>>
>>>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>>>
>>>>>>>>> There should be a more efficient way to handle this.
>>>>>>>>>
>>>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>>>
>>>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>>>> number of files, which has since been addressed by shared assignment
>>>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>>>
>>>>>>> I think we should address this as well. Creating three extra files for
>>>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>>>
>>>>>>>>> That was another problem we need to address.
>>>>>>>>
>>>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>>>
>>>>>>> I suggest, we should provide users with an option to list the assignments
>>>>>>> of all groups in a single command. As the number of groups increases, it
>>>>>>> becomes cumbersome to query each group individually.
>>>>>>>
>>>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>>>> for this purpose. More details on this below.
>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>>>> which the count values can be read.
>>>>>>>>>>
>>>>>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> LclFill
>>>>>>>>>> LclNTWr
>>>>>>>>>> LclSlowFill
>>>>>>>>>
>>>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>>>
>>>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>>>
>>>>>>>>>
>>>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>> LclFill <-rename these to generic names.
>>>>>>>>> LclNTWr
>>>>>>>>> LclSlowFill
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think portable and non-portable event names should both be available
>>>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>>>> events.
>>>>>>>
>>>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>>>
>>>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>>>> of the current interface. Idea here is to start with basic assigment
>>>>>>> feature with options to enhance it in the future. Feel free to
>>>>>>> comment/suggest.
>>>>>>>
>>>>>>> 1. Event configurations will be in
>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>>>
>>>>>>> There will be two pre-defined configurations by default.
>>>>>>>
>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>>>
>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>> LclFill, LclNTWr, LclSlowFill
>>>>>>>
>>>>>>> 2. Users will have options to update these configurations.
>>>>>>>
>>>>>>> #echo "LclFill, LclNTWr, RmtFill" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>
>>>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>>>> reporting "local_bytes" any more. They report something different,
>>>>> and users only know if they come to check the options currently
>>>>> configured in this file. Changing the contents without changing
>>>>> the name seems confusing to me.
>>>>
>>>> It is the same behaviour right now with BMEC. It is configurable.
>>>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>>>
>>>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>>>
>>> This could be supported by following Peter's original proposal where the name
>>> of the counter configuration is provided by the user via a mkdir:
>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>
>>> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
>>
>> Sure. We can do that. I was thinking in the first phase, just provide the
>> default pre-defined configuration and option to update the configuration.
>>
>> We can add the mkdir support later. That way we can provide basic ABMC
>> support without too much code complexity with mkdir support.
>
> This is not clear to me how you envision the "first phase". Is it what you
> proposed above, for example:
> #echo "LclFill, LclNTWr, RmtFill" >
> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>
> In above the counter configuration name is a file.
Yes. That is correct.
There will be two configuration files by default when resctrl is mounted
when ABMC is enabled.
/sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
/sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>
> How could mkdir support be added to this later if there are already files present?
We already have these directories when resctrl is mounted.
/sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
/sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
/sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
/sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
We dont need "mkdir" support for default configurations.
My plan was to support only the default configurations in the first phase.
That way there is no difference in the usage model with ABMC when mounted.
>
>>
>>>
>>>>
>>>>>
>>>>>>>
>>>>>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>> LclFill, LclNTWr, RmtFill
>>>>>>>
>>>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>>>
>>>>>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>>> mkdir /sys/fs/resctrl/test/
>>>>>>>
>>>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>>>> e: Exclusive
>>>>>>> s: Shared
>>>>>>> u: Unassigned
>>>>>>>
>>>>>>> Exclusive mode is supported now. Shared mode will be supported in the
>>>>>>> future.
>>>>>>>
>>>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>> to list the assignment state of all the groups.
>>>>>>>
>>>>>>> Format:
>>>>>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>>>
>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>> test//mbm_total_bytes:0=e;1=e
>>>>>>> test//mbm_local_bytes:0=e;1=e
>>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>>> //mbm_local_bytes:0=e;1=e
>>>
>>> This would make mbm_assign_control even more unwieldy and quicker to exceed a
>>> page of data (these examples never seem to reflect those AMD systems with the many
>>> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
>>> and solved when/if going this route.
>>
>> This problem is not specific this series. I feel it is a generic problem
>> to many of the semilar interfaces. I dont know how it is addressed. May
>> have to investigate on this. Any pointers would be helpful.
>
> Dave Martin already did a lot of analysis here. What other pointers do you need?
>
>>
>>
>>>
>>> There seems to be two opinions about this file at moment. Would it be possible to
>>> summarize the discussion with pros/cons raised to make an informed selection?
>>> I understand that Google as represented by Peter no longer requires/requests this
>>> file but the motivation for this change seems new and does not seem to reduce the
>>> original motivation for this file. We may also want to separate requirements for reading
>>> from and writing to this file.
>>
>> Yea. We can just use mbm_assign_control for reading the assignment states.
>>
>> Summary: We have two proposals.
>>
>> First one from Peter:
>>
>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>
>>
>> Pros
>> a. Allows flexible creation of free-form names for assignable
>> configurations, stored in info/L3_MON/counter_configs/.
>>
>> b. Events can be accessed using corresponding free-form names in the
>> mon_data directory, making it clear to users what each event represents.
>>
>>
>> Cons:
>> a. Requires three separate files for assignment in each group
>> (assign_exclusive, assign_shared, unassign), which might be excessive.
>>
>> b. No built-in listing support, meaning users must query each group
>> individually to check assignment states.
>>
>>
>> Second Proposal (Mine)
>>
>> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
>>
>> Pros:
>>
>> a. Maintains the flexibility of free-form names for assignable
>> configurations (info/L3_MON/counter_configs/).
>>
>> b. Events remain accessible via free-form names in mon_data, ensuring
>> clarity on their purpose.
>>
>> c. Adds the ability to list assignment states for all groups in a single
>> command.
>>
>> Cons:
>> a. Potential buffer overflow issues when handling a large number of
>> groups and domains and code complexity to fix the issue.
>>
>>
>> Third Option: A Hybrid Approach
>>
>> We could combine elements from both proposals:
>>
>> a. Retain the free-form naming approach for assignable configurations in
>> info/L3_MON/counter_configs/.
>>
>> b. Use the assignment method from the first proposal:
>> $mkdir test
>> $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
>>
>> c. Introduce listing support via the info/L3_MON/mbm_assign_control
>> interface, enabling users to read assignment states for all groups in one
>> place. Only reading support.
>>
>>
>>>
>>>>>>>
>>>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>>>
>>>>>>> Format:
>>>>>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>>>
>>>>>>> #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>
>>>>>>> #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>
>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>> test//mbm_total_bytes:0=u;1=u
>>>>>>> test//mbm_local_bytes:0=u;1=u
>>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>>> //mbm_local_bytes:0=e;1=e
>>>>>>>
>>>>>>> The corresponding events will be read in
>>>>>>>
>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>
>>>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>>>> mbm_local_bytes) will be supported.
>>>>>>>
>>>>>>> 8. In the future, there will be options to create multiple configurations
>>>>>>> and corresponding directory will be created in
>>>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>>>
>>>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>>>> directory? Like this:
>>>>>
>>>>> # echo "LclFill, LclNTWr, RmtFill" >
>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>>>
>>>>> This seems OK (dependent on the user picking meaningful names for
>>>>> the set of attributes picked ... but if they want to name this
>>>>> monitor file "brian" then they have to live with any confusion
>>>>> that they bring on themselves).
>>>>>
>>>>> Would this involve an extension to kernfs? I don't see a function
>>>>> pointer callback for file creation in kernfs_syscall_ops.
>>>>>
>>>>>>>
>>>>>>
>>>>>> I know you are all busy with multiple series going on parallel. I am still
>>>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>>>> on this to see if we can find common ground on the interface.
>>>>>>
>>>>>> Thanks
>>>>>> Babu
>>>>>
>>>>> -Tony
>>>>>
>>>>
>>>>
>>>> thanks
>>>> Babu
>>>
>>> Reinette
>>>
>>>
>>
>
>
--
Thanks
Babu Moger
Hi Babu,
On 3/12/25 9:03 AM, Moger, Babu wrote:
> Hi Reinette,
>
> On 3/12/25 10:07, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 3/11/25 1:35 PM, Moger, Babu wrote:
>>> Hi All,
>>>
>>> On 3/10/25 22:51, Reinette Chatre wrote:
>>>>
>>>>
>>>> On 3/10/25 6:44 PM, Moger, Babu wrote:
>>>>> Hi Tony,
>>>>>
>>>>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>>>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>>>>> Hi Babu,
>>>>>>>>>
>>>>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Peter,
>>>>>>>>>>
>>>>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It is documented below.
>>>>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>>>>> just two.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>>>>> using it as is.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>>>>> supported.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>>>>
>>>>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>>>>
>>>>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>>>>
>>>>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>>>>> evolves.
>>>>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>>>>
>>>>>>>>>>>> # define event configurations
>>>>>>>>>>>>
>>>>>>>>>>>> ========================================================
>>>>>>>>>>>> Bits Mnemonics Description
>>>>>>>>>>>> ==== ========================================================
>>>>>>>>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>>>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>>>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>>>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>>>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>>>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>>>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>>>>>>>>> ==== ========================================================
>>>>>>>>>>>>
>>>>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>>>>
>>>>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>> v = VictimBW
>>>>>>>>>>>>
>>>>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>>>>
>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>
>>>>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>>>>> and cleaner to implement.
>>>>>>>>>>>
>>>>>>>>>>> Roughly what I had in mind:
>>>>>>>>>>>
>>>>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>>>>> to single letters. In the resulting directory, populate a file where
>>>>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>>>>> want specific events and don't care about portability.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>>>>
>>>>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>>>>
>>>>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>>>>
>>>>>>>>>> How many configurations should we allow? Do we know?
>>>>>>>>>
>>>>>>>>> Do we need an upper limit?
>>>>>>>>
>>>>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>>>>> start with 2 default configurations for now.
>>>>
>>>> There is a big difference between no upper limit and 2. The hardware is
>>>> capable of supporting per-domain configurations so more flexibility is
>>>> certainly possible. Consider the example presented by Peter in:
>>>> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>>>>
>>>>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>>>>> configure all domains the same for a group.
>>>>>>>>>>
>>>>>>>>>> What is the difference between shared and exclusive?
>>>>>>>>>
>>>>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>>>>> aggregation files[2].
>>>>>>>>>
>>>>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>>>>> counters and reading their values.
>>>>>>>>
>>>>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>>>>
>>>>>>>>>> There should be a more efficient way to handle this.
>>>>>>>>>>
>>>>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>>>>
>>>>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>>>>> number of files, which has since been addressed by shared assignment
>>>>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>>>>
>>>>>>>> I think we should address this as well. Creating three extra files for
>>>>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>>>>
>>>>>>>>>> That was another problem we need to address.
>>>>>>>>>
>>>>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>>>>
>>>>>>>> I suggest, we should provide users with an option to list the assignments
>>>>>>>> of all groups in a single command. As the number of groups increases, it
>>>>>>>> becomes cumbersome to query each group individually.
>>>>>>>>
>>>>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>>>>> for this purpose. More details on this below.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>>>>> which the count values can be read.
>>>>>>>>>>>
>>>>>>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>> LclFill
>>>>>>>>>>> LclNTWr
>>>>>>>>>>> LclSlowFill
>>>>>>>>>>
>>>>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>>>>
>>>>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>> LclFill <-rename these to generic names.
>>>>>>>>>> LclNTWr
>>>>>>>>>> LclSlowFill
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think portable and non-portable event names should both be available
>>>>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>>>>> events.
>>>>>>>>
>>>>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>>>>
>>>>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>>>>> of the current interface. Idea here is to start with basic assigment
>>>>>>>> feature with options to enhance it in the future. Feel free to
>>>>>>>> comment/suggest.
>>>>>>>>
>>>>>>>> 1. Event configurations will be in
>>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>>>>
>>>>>>>> There will be two pre-defined configurations by default.
>>>>>>>>
>>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>>>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>>>>
>>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>> LclFill, LclNTWr, LclSlowFill
>>>>>>>>
>>>>>>>> 2. Users will have options to update these configurations.
>>>>>>>>
>>>>>>>> #echo "LclFill, LclNTWr, RmtFill" >
>>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>
>>>>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>>>>> reporting "local_bytes" any more. They report something different,
>>>>>> and users only know if they come to check the options currently
>>>>>> configured in this file. Changing the contents without changing
>>>>>> the name seems confusing to me.
>>>>>
>>>>> It is the same behaviour right now with BMEC. It is configurable.
>>>>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>>>>
>>>>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>>>>
>>>> This could be supported by following Peter's original proposal where the name
>>>> of the counter configuration is provided by the user via a mkdir:
>>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>>
>>>> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
>>>
>>> Sure. We can do that. I was thinking in the first phase, just provide the
>>> default pre-defined configuration and option to update the configuration.
>>>
>>> We can add the mkdir support later. That way we can provide basic ABMC
>>> support without too much code complexity with mkdir support.
>>
>> This is not clear to me how you envision the "first phase". Is it what you
>> proposed above, for example:
>> #echo "LclFill, LclNTWr, RmtFill" >
>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>
>> In above the counter configuration name is a file.
>
> Yes. That is correct.
>
> There will be two configuration files by default when resctrl is mounted
> when ABMC is enabled.
> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>
>>
>> How could mkdir support be added to this later if there are already files present?
>
> We already have these directories when resctrl is mounted.
> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>
> We dont need "mkdir" support for default configurations.
I was referring to the "mkdir" support for additional configurations that
I understood you are thinking about adding later. For example,
(copied from Peter's message
https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/):
# mkdir info/L3_MON/counter_configs/mbm_local_bytes
# echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
# echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
# echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
# cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
LclFill
LclNTWr
LclSlowFill
Any "later" work needs to be backward compatible with the first phase.
If the first phase starts with a file:
/sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
... I do not see how second phase can be backward compatible when that work
needs a directory with the same name that contains a file for configuration:
/sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes/event_filter
sidenote: I think interactions with the "event_filter" file needs more
descriptions since it is not clear with the provided example how user space
may want to interact with the file when adding vs replacing event configurations.
>
> My plan was to support only the default configurations in the first phase.
> That way there is no difference in the usage model with ABMC when mounted.
>
>
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>> LclFill, LclNTWr, RmtFill
>>>>>>>>
>>>>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>>>>
>>>>>>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>>>> mkdir /sys/fs/resctrl/test/
>>>>>>>>
>>>>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>>>>> e: Exclusive
>>>>>>>> s: Shared
>>>>>>>> u: Unassigned
>>>>>>>>
>>>>>>>> Exclusive mode is supported now. Shared mode will be supported in the
>>>>>>>> future.
>>>>>>>>
>>>>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>> to list the assignment state of all the groups.
>>>>>>>>
>>>>>>>> Format:
>>>>>>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>>>>
>>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>> test//mbm_total_bytes:0=e;1=e
>>>>>>>> test//mbm_local_bytes:0=e;1=e
>>>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>>>> //mbm_local_bytes:0=e;1=e
>>>>
>>>> This would make mbm_assign_control even more unwieldy and quicker to exceed a
>>>> page of data (these examples never seem to reflect those AMD systems with the many
>>>> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
>>>> and solved when/if going this route.
>>>
>>> This problem is not specific this series. I feel it is a generic problem
>>> to many of the semilar interfaces. I dont know how it is addressed. May
>>> have to investigate on this. Any pointers would be helpful.
>>
>> Dave Martin already did a lot of analysis here. What other pointers do you need?
>>
>>>
>>>
>>>>
>>>> There seems to be two opinions about this file at moment. Would it be possible to
>>>> summarize the discussion with pros/cons raised to make an informed selection?
>>>> I understand that Google as represented by Peter no longer requires/requests this
>>>> file but the motivation for this change seems new and does not seem to reduce the
>>>> original motivation for this file. We may also want to separate requirements for reading
>>>> from and writing to this file.
>>>
>>> Yea. We can just use mbm_assign_control for reading the assignment states.
>>>
>>> Summary: We have two proposals.
>>>
>>> First one from Peter:
>>>
>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>
>>>
>>> Pros
>>> a. Allows flexible creation of free-form names for assignable
>>> configurations, stored in info/L3_MON/counter_configs/.
>>>
>>> b. Events can be accessed using corresponding free-form names in the
>>> mon_data directory, making it clear to users what each event represents.
>>>
>>>
>>> Cons:
>>> a. Requires three separate files for assignment in each group
>>> (assign_exclusive, assign_shared, unassign), which might be excessive.
>>>
>>> b. No built-in listing support, meaning users must query each group
>>> individually to check assignment states.
>>>
>>>
>>> Second Proposal (Mine)
>>>
>>> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
>>>
>>> Pros:
>>>
>>> a. Maintains the flexibility of free-form names for assignable
>>> configurations (info/L3_MON/counter_configs/).
>>>
>>> b. Events remain accessible via free-form names in mon_data, ensuring
>>> clarity on their purpose.
>>>
>>> c. Adds the ability to list assignment states for all groups in a single
>>> command.
>>>
>>> Cons:
>>> a. Potential buffer overflow issues when handling a large number of
>>> groups and domains and code complexity to fix the issue.
>>>
>>>
>>> Third Option: A Hybrid Approach
>>>
>>> We could combine elements from both proposals:
>>>
>>> a. Retain the free-form naming approach for assignable configurations in
>>> info/L3_MON/counter_configs/.
>>>
>>> b. Use the assignment method from the first proposal:
>>> $mkdir test
>>> $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
>>>
>>> c. Introduce listing support via the info/L3_MON/mbm_assign_control
>>> interface, enabling users to read assignment states for all groups in one
>>> place. Only reading support.
>>>
>>>
>>>>
>>>>>>>>
>>>>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>>>>
>>>>>>>> Format:
>>>>>>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>>>>
>>>>>>>> #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>
>>>>>>>> #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>
>>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>> test//mbm_total_bytes:0=u;1=u
>>>>>>>> test//mbm_local_bytes:0=u;1=u
>>>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>>>> //mbm_local_bytes:0=e;1=e
>>>>>>>>
>>>>>>>> The corresponding events will be read in
>>>>>>>>
>>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>>
>>>>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>>>>> mbm_local_bytes) will be supported.
>>>>>>>>
>>>>>>>> 8. In the future, there will be options to create multiple configurations
>>>>>>>> and corresponding directory will be created in
>>>>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>>>>
>>>>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>>>>> directory? Like this:
>>>>>>
>>>>>> # echo "LclFill, LclNTWr, RmtFill" >
>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>>>>
>>>>>> This seems OK (dependent on the user picking meaningful names for
>>>>>> the set of attributes picked ... but if they want to name this
>>>>>> monitor file "brian" then they have to live with any confusion
>>>>>> that they bring on themselves).
>>>>>>
>>>>>> Would this involve an extension to kernfs? I don't see a function
>>>>>> pointer callback for file creation in kernfs_syscall_ops.
>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> I know you are all busy with multiple series going on parallel. I am still
>>>>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>>>>> on this to see if we can find common ground on the interface.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Babu
>>>>>>
>>>>>> -Tony
>>>>>>
>>>>>
>>>>>
>>>>> thanks
>>>>> Babu
>>>>
>>>> Reinette
>>>>
>>>>
>>>
>>
>>
>
Hi Reinette,
On 3/12/25 12:14, Reinette Chatre wrote:
> Hi Babu,
>
> On 3/12/25 9:03 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 3/12/25 10:07, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 3/11/25 1:35 PM, Moger, Babu wrote:
>>>> Hi All,
>>>>
>>>> On 3/10/25 22:51, Reinette Chatre wrote:
>>>>>
>>>>>
>>>>> On 3/10/25 6:44 PM, Moger, Babu wrote:
>>>>>> Hi Tony,
>>>>>>
>>>>>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>>>>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>>>>>> Hi Babu,
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>
>>>>>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It is documented below.
>>>>>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>>>>>> just two.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>>>>>> using it as is.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>>>>>> supported.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>>>>>> evolves.
>>>>>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>>>>>
>>>>>>>>>>>>> # define event configurations
>>>>>>>>>>>>>
>>>>>>>>>>>>> ========================================================
>>>>>>>>>>>>> Bits Mnemonics Description
>>>>>>>>>>>>> ==== ========================================================
>>>>>>>>>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>>>>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>>>>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>>>>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>>>>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>>>>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>>>>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>>>>>>>>>> ==== ========================================================
>>>>>>>>>>>>>
>>>>>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>>>>>
>>>>>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>> v = VictimBW
>>>>>>>>>>>>>
>>>>>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>>>>>
>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>
>>>>>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>>>>>> and cleaner to implement.
>>>>>>>>>>>>
>>>>>>>>>>>> Roughly what I had in mind:
>>>>>>>>>>>>
>>>>>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>>>>>> to single letters. In the resulting directory, populate a file where
>>>>>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>>>>>> want specific events and don't care about portability.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>>>>>
>>>>>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>>>>>
>>>>>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>>>>>
>>>>>>>>>>> How many configurations should we allow? Do we know?
>>>>>>>>>>
>>>>>>>>>> Do we need an upper limit?
>>>>>>>>>
>>>>>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>>>>>> start with 2 default configurations for now.
>>>>>
>>>>> There is a big difference between no upper limit and 2. The hardware is
>>>>> capable of supporting per-domain configurations so more flexibility is
>>>>> certainly possible. Consider the example presented by Peter in:
>>>>> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>>>>>
>>>>>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>>>>>> configure all domains the same for a group.
>>>>>>>>>>>
>>>>>>>>>>> What is the difference between shared and exclusive?
>>>>>>>>>>
>>>>>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>>>>>> aggregation files[2].
>>>>>>>>>>
>>>>>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>>>>>> counters and reading their values.
>>>>>>>>>
>>>>>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>>>>>
>>>>>>>>>>> There should be a more efficient way to handle this.
>>>>>>>>>>>
>>>>>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>>>>>
>>>>>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>>>>>> number of files, which has since been addressed by shared assignment
>>>>>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>>>>>
>>>>>>>>> I think we should address this as well. Creating three extra files for
>>>>>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>>>>>
>>>>>>>>>>> That was another problem we need to address.
>>>>>>>>>>
>>>>>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>>>>>
>>>>>>>>> I suggest, we should provide users with an option to list the assignments
>>>>>>>>> of all groups in a single command. As the number of groups increases, it
>>>>>>>>> becomes cumbersome to query each group individually.
>>>>>>>>>
>>>>>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>>>>>> for this purpose. More details on this below.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>>>>>> which the count values can be read.
>>>>>>>>>>>>
>>>>>>>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>>> LclFill
>>>>>>>>>>>> LclNTWr
>>>>>>>>>>>> LclSlowFill
>>>>>>>>>>>
>>>>>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>>>>>
>>>>>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>>> LclFill <-rename these to generic names.
>>>>>>>>>>> LclNTWr
>>>>>>>>>>> LclSlowFill
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think portable and non-portable event names should both be available
>>>>>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>>>>>> events.
>>>>>>>>>
>>>>>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>>>>>
>>>>>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>>>>>> of the current interface. Idea here is to start with basic assigment
>>>>>>>>> feature with options to enhance it in the future. Feel free to
>>>>>>>>> comment/suggest.
>>>>>>>>>
>>>>>>>>> 1. Event configurations will be in
>>>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>>>>>
>>>>>>>>> There will be two pre-defined configurations by default.
>>>>>>>>>
>>>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>>>>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>>>>>
>>>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>> LclFill, LclNTWr, LclSlowFill
>>>>>>>>>
>>>>>>>>> 2. Users will have options to update these configurations.
>>>>>>>>>
>>>>>>>>> #echo "LclFill, LclNTWr, RmtFill" >
>>>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>
>>>>>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>>>>>> reporting "local_bytes" any more. They report something different,
>>>>>>> and users only know if they come to check the options currently
>>>>>>> configured in this file. Changing the contents without changing
>>>>>>> the name seems confusing to me.
>>>>>>
>>>>>> It is the same behaviour right now with BMEC. It is configurable.
>>>>>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>>>>>
>>>>>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>>>>>
>>>>> This could be supported by following Peter's original proposal where the name
>>>>> of the counter configuration is provided by the user via a mkdir:
>>>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>>>
>>>>> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
>>>>
>>>> Sure. We can do that. I was thinking in the first phase, just provide the
>>>> default pre-defined configuration and option to update the configuration.
>>>>
>>>> We can add the mkdir support later. That way we can provide basic ABMC
>>>> support without too much code complexity with mkdir support.
>>>
>>> This is not clear to me how you envision the "first phase". Is it what you
>>> proposed above, for example:
>>> #echo "LclFill, LclNTWr, RmtFill" >
>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>
>>> In above the counter configuration name is a file.
>>
>> Yes. That is correct.
>>
>> There will be two configuration files by default when resctrl is mounted
>> when ABMC is enabled.
>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>
>>>
>>> How could mkdir support be added to this later if there are already files present?
>>
>> We already have these directories when resctrl is mounted.
>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>
>> We dont need "mkdir" support for default configurations.
>
> I was referring to the "mkdir" support for additional configurations that
> I understood you are thinking about adding later. For example,
> (copied from Peter's message
> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/):
>
>
> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> LclFill
> LclNTWr
> LclSlowFill
>
> Any "later" work needs to be backward compatible with the first phase.
Actually, we dont need extra file "event_filter".
This was discussed here.
https://lore.kernel.org/lkml/CALPaoChLL8p49eANYgQ0dJiFs7G=223fGae+LJyx3DwEhNeR8A@mail.gmail.com/
# echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes
# echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes
# echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes
# cat info/L3_MON/counter_configs/mbm_local_bytes
LclFill
LclNTWr
LclSlowFill
In the future, we can add mkdir support.
# mkdir info/L3_MON/counter_configs/mbm_read_only
# echo LclFill > info/L3_MON/counter_configs/mbm_read_only
# cat info/L3_MON/counter_configs/mbm_read_only
LclFill
#echo mbm_read_only > test/mon_data/mon_L3_00/assign_exclusive
Which would result in the creation of test/mon_data/mon_L3_*/mbm_read_only
So, there is not breakage of backword compatibility.
>
> If the first phase starts with a file:
> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> ... I do not see how second phase can be backward compatible when that work
> needs a directory with the same name that contains a file for configuration:
> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>
> sidenote: I think interactions with the "event_filter" file needs more
> descriptions since it is not clear with the provided example how user space
> may want to interact with the file when adding vs replacing event configurations.
>
>>
>> My plan was to support only the default configurations in the first phase.
>> That way there is no difference in the usage model with ABMC when mounted.
>>
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>> LclFill, LclNTWr, RmtFill
>>>>>>>>>
>>>>>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>>>>>
>>>>>>>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>>>>> mkdir /sys/fs/resctrl/test/
>>>>>>>>>
>>>>>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>>>>>> e: Exclusive
>>>>>>>>> s: Shared
>>>>>>>>> u: Unassigned
>>>>>>>>>
>>>>>>>>> Exclusive mode is supported now. Shared mode will be supported in the
>>>>>>>>> future.
>>>>>>>>>
>>>>>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>> to list the assignment state of all the groups.
>>>>>>>>>
>>>>>>>>> Format:
>>>>>>>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>>>>>
>>>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>> test//mbm_total_bytes:0=e;1=e
>>>>>>>>> test//mbm_local_bytes:0=e;1=e
>>>>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>>>>> //mbm_local_bytes:0=e;1=e
>>>>>
>>>>> This would make mbm_assign_control even more unwieldy and quicker to exceed a
>>>>> page of data (these examples never seem to reflect those AMD systems with the many
>>>>> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
>>>>> and solved when/if going this route.
>>>>
>>>> This problem is not specific this series. I feel it is a generic problem
>>>> to many of the semilar interfaces. I dont know how it is addressed. May
>>>> have to investigate on this. Any pointers would be helpful.
>>>
>>> Dave Martin already did a lot of analysis here. What other pointers do you need?
Yea. He did. I still need little more details on implementation of that.
Will come back to that when we decide which way to go.
>>>
>>>>
>>>>
>>>>>
>>>>> There seems to be two opinions about this file at moment. Would it be possible to
>>>>> summarize the discussion with pros/cons raised to make an informed selection?
>>>>> I understand that Google as represented by Peter no longer requires/requests this
>>>>> file but the motivation for this change seems new and does not seem to reduce the
>>>>> original motivation for this file. We may also want to separate requirements for reading
>>>>> from and writing to this file.
>>>>
>>>> Yea. We can just use mbm_assign_control for reading the assignment states.
>>>>
>>>> Summary: We have two proposals.
>>>>
>>>> First one from Peter:
>>>>
>>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>>
>>>>
>>>> Pros
>>>> a. Allows flexible creation of free-form names for assignable
>>>> configurations, stored in info/L3_MON/counter_configs/.
>>>>
>>>> b. Events can be accessed using corresponding free-form names in the
>>>> mon_data directory, making it clear to users what each event represents.
>>>>
>>>>
>>>> Cons:
>>>> a. Requires three separate files for assignment in each group
>>>> (assign_exclusive, assign_shared, unassign), which might be excessive.
>>>>
>>>> b. No built-in listing support, meaning users must query each group
>>>> individually to check assignment states.
>>>>
>>>>
>>>> Second Proposal (Mine)
>>>>
>>>> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
>>>>
>>>> Pros:
>>>>
>>>> a. Maintains the flexibility of free-form names for assignable
>>>> configurations (info/L3_MON/counter_configs/).
>>>>
>>>> b. Events remain accessible via free-form names in mon_data, ensuring
>>>> clarity on their purpose.
>>>>
>>>> c. Adds the ability to list assignment states for all groups in a single
>>>> command.
>>>>
>>>> Cons:
>>>> a. Potential buffer overflow issues when handling a large number of
>>>> groups and domains and code complexity to fix the issue.
>>>>
>>>>
>>>> Third Option: A Hybrid Approach
>>>>
>>>> We could combine elements from both proposals:
>>>>
>>>> a. Retain the free-form naming approach for assignable configurations in
>>>> info/L3_MON/counter_configs/.
>>>>
>>>> b. Use the assignment method from the first proposal:
>>>> $mkdir test
>>>> $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
>>>>
>>>> c. Introduce listing support via the info/L3_MON/mbm_assign_control
>>>> interface, enabling users to read assignment states for all groups in one
>>>> place. Only reading support.
>>>>
>>>>
>>>>>
>>>>>>>>>
>>>>>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>>>>>
>>>>>>>>> Format:
>>>>>>>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>>>>>
>>>>>>>>> #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>
>>>>>>>>> #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>
>>>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>> test//mbm_total_bytes:0=u;1=u
>>>>>>>>> test//mbm_local_bytes:0=u;1=u
>>>>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>>>>> //mbm_local_bytes:0=e;1=e
>>>>>>>>>
>>>>>>>>> The corresponding events will be read in
>>>>>>>>>
>>>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>>>
>>>>>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>>>>>> mbm_local_bytes) will be supported.
>>>>>>>>>
>>>>>>>>> 8. In the future, there will be options to create multiple configurations
>>>>>>>>> and corresponding directory will be created in
>>>>>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>>>>>
>>>>>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>>>>>> directory? Like this:
>>>>>>>
>>>>>>> # echo "LclFill, LclNTWr, RmtFill" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>>>>>
>>>>>>> This seems OK (dependent on the user picking meaningful names for
>>>>>>> the set of attributes picked ... but if they want to name this
>>>>>>> monitor file "brian" then they have to live with any confusion
>>>>>>> that they bring on themselves).
>>>>>>>
>>>>>>> Would this involve an extension to kernfs? I don't see a function
>>>>>>> pointer callback for file creation in kernfs_syscall_ops.
>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> I know you are all busy with multiple series going on parallel. I am still
>>>>>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>>>>>> on this to see if we can find common ground on the interface.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Babu
>>>>>>>
>>>>>>> -Tony
>>>>>>>
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>> Babu
>>>>>
>>>>> Reinette
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
--
Thanks
Babu Moger
Hi Babu,
On 3/12/25 11:14 AM, Moger, Babu wrote:
> Hi Reinette,
>
> On 3/12/25 12:14, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 3/12/25 9:03 AM, Moger, Babu wrote:
>>> Hi Reinette,
>>>
>>> On 3/12/25 10:07, Reinette Chatre wrote:
>>>> Hi Babu,
>>>>
>>>> On 3/11/25 1:35 PM, Moger, Babu wrote:
>>>>> Hi All,
>>>>>
>>>>> On 3/10/25 22:51, Reinette Chatre wrote:
>>>>>>
>>>>>>
>>>>>> On 3/10/25 6:44 PM, Moger, Babu wrote:
>>>>>>> Hi Tony,
>>>>>>>
>>>>>>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>>>>>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>>>>>>> Hi Peter,
>>>>>>>>>>
>>>>>>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It is documented below.
>>>>>>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>>>>>>> just two.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>>>>>>> using it as is.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>>>>>>> supported.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>>>>>>> evolves.
>>>>>>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> # define event configurations
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ========================================================
>>>>>>>>>>>>>> Bits Mnemonics Description
>>>>>>>>>>>>>> ==== ========================================================
>>>>>>>>>>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>>>>>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>>>>>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>>>>>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>>>>>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>>>>>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>>>>>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>>>>>>>>>>> ==== ========================================================
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>> v = VictimBW
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>>
>>>>>>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>>>>>>> and cleaner to implement.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Roughly what I had in mind:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>>>>>>> to single letters. In the resulting directory, populate a file where
>>>>>>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>>>>>>> want specific events and don't care about portability.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>>>>>>
>>>>>>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>>>>>>
>>>>>>>>>>>> How many configurations should we allow? Do we know?
>>>>>>>>>>>
>>>>>>>>>>> Do we need an upper limit?
>>>>>>>>>>
>>>>>>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>>>>>>> start with 2 default configurations for now.
>>>>>>
>>>>>> There is a big difference between no upper limit and 2. The hardware is
>>>>>> capable of supporting per-domain configurations so more flexibility is
>>>>>> certainly possible. Consider the example presented by Peter in:
>>>>>> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>>>>>>
>>>>>>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>>>>>>> configure all domains the same for a group.
>>>>>>>>>>>>
>>>>>>>>>>>> What is the difference between shared and exclusive?
>>>>>>>>>>>
>>>>>>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>>>>>>> aggregation files[2].
>>>>>>>>>>>
>>>>>>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>>>>>>> counters and reading their values.
>>>>>>>>>>
>>>>>>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>>>>>>
>>>>>>>>>>>> There should be a more efficient way to handle this.
>>>>>>>>>>>>
>>>>>>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>>>>>>
>>>>>>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>>>>>>> number of files, which has since been addressed by shared assignment
>>>>>>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>>>>>>
>>>>>>>>>> I think we should address this as well. Creating three extra files for
>>>>>>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>>>>>>
>>>>>>>>>>>> That was another problem we need to address.
>>>>>>>>>>>
>>>>>>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>>>>>>
>>>>>>>>>> I suggest, we should provide users with an option to list the assignments
>>>>>>>>>> of all groups in a single command. As the number of groups increases, it
>>>>>>>>>> becomes cumbersome to query each group individually.
>>>>>>>>>>
>>>>>>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>>>>>>> for this purpose. More details on this below.
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>>>>>>> which the count values can be read.
>>>>>>>>>>>>>
>>>>>>>>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>>>>> LclFill
>>>>>>>>>>>>> LclNTWr
>>>>>>>>>>>>> LclSlowFill
>>>>>>>>>>>>
>>>>>>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>>>>>>
>>>>>>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>>>> LclFill <-rename these to generic names.
>>>>>>>>>>>> LclNTWr
>>>>>>>>>>>> LclSlowFill
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think portable and non-portable event names should both be available
>>>>>>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>>>>>>> events.
>>>>>>>>>>
>>>>>>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>>>>>>
>>>>>>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>>>>>>> of the current interface. Idea here is to start with basic assigment
>>>>>>>>>> feature with options to enhance it in the future. Feel free to
>>>>>>>>>> comment/suggest.
>>>>>>>>>>
>>>>>>>>>> 1. Event configurations will be in
>>>>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>>>>>>
>>>>>>>>>> There will be two pre-defined configurations by default.
>>>>>>>>>>
>>>>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>>>>>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>>>>>>
>>>>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>> LclFill, LclNTWr, LclSlowFill
>>>>>>>>>>
>>>>>>>>>> 2. Users will have options to update these configurations.
>>>>>>>>>>
>>>>>>>>>> #echo "LclFill, LclNTWr, RmtFill" >
>>>>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>
>>>>>>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>>>>>>> reporting "local_bytes" any more. They report something different,
>>>>>>>> and users only know if they come to check the options currently
>>>>>>>> configured in this file. Changing the contents without changing
>>>>>>>> the name seems confusing to me.
>>>>>>>
>>>>>>> It is the same behaviour right now with BMEC. It is configurable.
>>>>>>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>>>>>>
>>>>>>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>>>>>>
>>>>>> This could be supported by following Peter's original proposal where the name
>>>>>> of the counter configuration is provided by the user via a mkdir:
>>>>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>>>>
>>>>>> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
>>>>>
>>>>> Sure. We can do that. I was thinking in the first phase, just provide the
>>>>> default pre-defined configuration and option to update the configuration.
>>>>>
>>>>> We can add the mkdir support later. That way we can provide basic ABMC
>>>>> support without too much code complexity with mkdir support.
>>>>
>>>> This is not clear to me how you envision the "first phase". Is it what you
>>>> proposed above, for example:
>>>> #echo "LclFill, LclNTWr, RmtFill" >
>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>
>>>> In above the counter configuration name is a file.
>>>
>>> Yes. That is correct.
>>>
>>> There will be two configuration files by default when resctrl is mounted
>>> when ABMC is enabled.
>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>
>>>>
>>>> How could mkdir support be added to this later if there are already files present?
>>>
>>> We already have these directories when resctrl is mounted.
>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>
>>> We dont need "mkdir" support for default configurations.
>>
>> I was referring to the "mkdir" support for additional configurations that
>> I understood you are thinking about adding later. For example,
>> (copied from Peter's message
>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/):
>>
>>
>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>> LclFill
>> LclNTWr
>> LclSlowFill
>>
>> Any "later" work needs to be backward compatible with the first phase.
>
> Actually, we dont need extra file "event_filter".
> This was discussed here.
> https://lore.kernel.org/lkml/CALPaoChLL8p49eANYgQ0dJiFs7G=223fGae+LJyx3DwEhNeR8A@mail.gmail.com/
I undestand from that exchange that it is possible to read/write from
an *existing* kernfs file but it is not obvious to me how that file is
planned to be created.
My understanding of the motivation behind support for "mkdir" is to enable
user space to create custom counter configurations.
I understand that ABMC support aims to start with existing mbm_total_bytes/mbm_local_bytes
configurations but I believe the consensus is that custom configurations need
to be supported in the future.
If resctrl starts with support where counter configuration as
managed with a *file*, for example:
/sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
how will user space create future custom configurations?
As I understand that is only possible with mkdir.
>
> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes
> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes
> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes
> # cat info/L3_MON/counter_configs/mbm_local_bytes
> LclFill
> LclNTWr
> LclSlowFill
>
> In the future, we can add mkdir support.
>
> # mkdir info/L3_MON/counter_configs/mbm_read_only
This is exactly my concern. resctrl should not start with a user space where
a counter configuration is a file (mbm_local_bytes/mbm_total_bytes) and then
switch user space interface to have counter configuration be done with
directories.
> # echo LclFill > info/L3_MON/counter_configs/mbm_read_only
> # cat info/L3_MON/counter_configs/mbm_read_only
> LclFill
... wait ... user space writes to the directory?
>
> #echo mbm_read_only > test/mon_data/mon_L3_00/assign_exclusive
>
> Which would result in the creation of test/mon_data/mon_L3_*/mbm_read_only
>
> So, there is not breakage of backword compatibility.
The way I understand it I am seeing many incompatibilities. Perhaps I am missing
something. Could you please provide detailed steps of how first phase and
second phase would look?
>
>>
>> If the first phase starts with a file:
>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>> ... I do not see how second phase can be backward compatible when that work
>> needs a directory with the same name that contains a file for configuration:
>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>
>> sidenote: I think interactions with the "event_filter" file needs more
>> descriptions since it is not clear with the provided example how user space
>> may want to interact with the file when adding vs replacing event configurations.
>>
>>>
>>> My plan was to support only the default configurations in the first phase.
>>> That way there is no difference in the usage model with ABMC when mounted.
>>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>> LclFill, LclNTWr, RmtFill
>>>>>>>>>>
>>>>>>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>>>>>>
>>>>>>>>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>>>>>> mkdir /sys/fs/resctrl/test/
>>>>>>>>>>
>>>>>>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>>>>>>> e: Exclusive
>>>>>>>>>> s: Shared
>>>>>>>>>> u: Unassigned
>>>>>>>>>>
>>>>>>>>>> Exclusive mode is supported now. Shared mode will be supported in the
>>>>>>>>>> future.
>>>>>>>>>>
>>>>>>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>> to list the assignment state of all the groups.
>>>>>>>>>>
>>>>>>>>>> Format:
>>>>>>>>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>>>>>>
>>>>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>> test//mbm_total_bytes:0=e;1=e
>>>>>>>>>> test//mbm_local_bytes:0=e;1=e
>>>>>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>>>>>> //mbm_local_bytes:0=e;1=e
>>>>>>
>>>>>> This would make mbm_assign_control even more unwieldy and quicker to exceed a
>>>>>> page of data (these examples never seem to reflect those AMD systems with the many
>>>>>> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
>>>>>> and solved when/if going this route.
>>>>>
>>>>> This problem is not specific this series. I feel it is a generic problem
>>>>> to many of the semilar interfaces. I dont know how it is addressed. May
>>>>> have to investigate on this. Any pointers would be helpful.
>>>>
>>>> Dave Martin already did a lot of analysis here. What other pointers do you need?
>
> Yea. He did. I still need little more details on implementation of that.
> Will come back to that when we decide which way to go.
>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> There seems to be two opinions about this file at moment. Would it be possible to
>>>>>> summarize the discussion with pros/cons raised to make an informed selection?
>>>>>> I understand that Google as represented by Peter no longer requires/requests this
>>>>>> file but the motivation for this change seems new and does not seem to reduce the
>>>>>> original motivation for this file. We may also want to separate requirements for reading
>>>>>> from and writing to this file.
>>>>>
>>>>> Yea. We can just use mbm_assign_control for reading the assignment states.
>>>>>
>>>>> Summary: We have two proposals.
>>>>>
>>>>> First one from Peter:
>>>>>
>>>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>>>
>>>>>
>>>>> Pros
>>>>> a. Allows flexible creation of free-form names for assignable
>>>>> configurations, stored in info/L3_MON/counter_configs/.
>>>>>
>>>>> b. Events can be accessed using corresponding free-form names in the
>>>>> mon_data directory, making it clear to users what each event represents.
>>>>>
>>>>>
>>>>> Cons:
>>>>> a. Requires three separate files for assignment in each group
>>>>> (assign_exclusive, assign_shared, unassign), which might be excessive.
>>>>>
>>>>> b. No built-in listing support, meaning users must query each group
>>>>> individually to check assignment states.
>>>>>
>>>>>
>>>>> Second Proposal (Mine)
>>>>>
>>>>> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
>>>>>
>>>>> Pros:
>>>>>
>>>>> a. Maintains the flexibility of free-form names for assignable
>>>>> configurations (info/L3_MON/counter_configs/).
>>>>>
>>>>> b. Events remain accessible via free-form names in mon_data, ensuring
>>>>> clarity on their purpose.
>>>>>
>>>>> c. Adds the ability to list assignment states for all groups in a single
>>>>> command.
>>>>>
>>>>> Cons:
>>>>> a. Potential buffer overflow issues when handling a large number of
>>>>> groups and domains and code complexity to fix the issue.
>>>>>
>>>>>
>>>>> Third Option: A Hybrid Approach
>>>>>
>>>>> We could combine elements from both proposals:
>>>>>
>>>>> a. Retain the free-form naming approach for assignable configurations in
>>>>> info/L3_MON/counter_configs/.
>>>>>
>>>>> b. Use the assignment method from the first proposal:
>>>>> $mkdir test
>>>>> $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
>>>>>
>>>>> c. Introduce listing support via the info/L3_MON/mbm_assign_control
>>>>> interface, enabling users to read assignment states for all groups in one
>>>>> place. Only reading support.
>>>>>
>>>>>
>>>>>>
>>>>>>>>>>
>>>>>>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>>>>>>
>>>>>>>>>> Format:
>>>>>>>>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>>>>>>
>>>>>>>>>> #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>
>>>>>>>>>> #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>
>>>>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>> test//mbm_total_bytes:0=u;1=u
>>>>>>>>>> test//mbm_local_bytes:0=u;1=u
>>>>>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>>>>>> //mbm_local_bytes:0=e;1=e
>>>>>>>>>>
>>>>>>>>>> The corresponding events will be read in
>>>>>>>>>>
>>>>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>>>>
>>>>>>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>>>>>>> mbm_local_bytes) will be supported.
>>>>>>>>>>
>>>>>>>>>> 8. In the future, there will be options to create multiple configurations
>>>>>>>>>> and corresponding directory will be created in
>>>>>>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>>>>>>
>>>>>>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>>>>>>> directory? Like this:
>>>>>>>>
>>>>>>>> # echo "LclFill, LclNTWr, RmtFill" >
>>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>>>>>>
>>>>>>>> This seems OK (dependent on the user picking meaningful names for
>>>>>>>> the set of attributes picked ... but if they want to name this
>>>>>>>> monitor file "brian" then they have to live with any confusion
>>>>>>>> that they bring on themselves).
>>>>>>>>
>>>>>>>> Would this involve an extension to kernfs? I don't see a function
>>>>>>>> pointer callback for file creation in kernfs_syscall_ops.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I know you are all busy with multiple series going on parallel. I am still
>>>>>>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>>>>>>> on this to see if we can find common ground on the interface.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Babu
>>>>>>>>
>>>>>>>> -Tony
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> thanks
>>>>>>> Babu
>>>>>>
>>>>>> Reinette
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
Hi Reinette,
On 3/13/25 11:08, Reinette Chatre wrote:
> Hi Babu,
>
> On 3/12/25 11:14 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 3/12/25 12:14, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 3/12/25 9:03 AM, Moger, Babu wrote:
>>>> Hi Reinette,
>>>>
>>>> On 3/12/25 10:07, Reinette Chatre wrote:
>>>>> Hi Babu,
>>>>>
..
>>>>>> We can add the mkdir support later. That way we can provide basic ABMC
>>>>>> support without too much code complexity with mkdir support.
>>>>>
>>>>> This is not clear to me how you envision the "first phase". Is it what you
>>>>> proposed above, for example:
>>>>> #echo "LclFill, LclNTWr, RmtFill" >
>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>
>>>>> In above the counter configuration name is a file.
>>>>
>>>> Yes. That is correct.
>>>>
>>>> There will be two configuration files by default when resctrl is mounted
>>>> when ABMC is enabled.
>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>
>>>>>
>>>>> How could mkdir support be added to this later if there are already files present?
>>>>
>>>> We already have these directories when resctrl is mounted.
>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>
>>>> We dont need "mkdir" support for default configurations.
>>>
>>> I was referring to the "mkdir" support for additional configurations that
>>> I understood you are thinking about adding later. For example,
>>> (copied from Peter's message
>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/):
>>>
>>>
>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> LclFill
>>> LclNTWr
>>> LclSlowFill
>>>
>>> Any "later" work needs to be backward compatible with the first phase.
>>
>> Actually, we dont need extra file "event_filter".
>> This was discussed here.
>> https://lore.kernel.org/lkml/CALPaoChLL8p49eANYgQ0dJiFs7G=223fGae+LJyx3DwEhNeR8A@mail.gmail.com/
>
> I undestand from that exchange that it is possible to read/write from
> an *existing* kernfs file but it is not obvious to me how that file is
> planned to be created.
My bad.. I misspoke here. We need "event_filter" file under each
configuration.
>
> My understanding of the motivation behind support for "mkdir" is to enable
> user space to create custom counter configurations.
>
That is correct.
> I understand that ABMC support aims to start with existing mbm_total_bytes/mbm_local_bytes
> configurations but I believe the consensus is that custom configurations need
> to be supported in the future.
> If resctrl starts with support where counter configuration as
> managed with a *file*, for example:
> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
> how will user space create future custom configurations?
> As I understand that is only possible with mkdir.
>
>>
>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes
>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes
>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes
>> # cat info/L3_MON/counter_configs/mbm_local_bytes
>> LclFill
>> LclNTWr
>> LclSlowFill
>>
>> In the future, we can add mkdir support.
>>
>> # mkdir info/L3_MON/counter_configs/mbm_read_only
>
> This is exactly my concern. resctrl should not start with a user space where
> a counter configuration is a file (mbm_local_bytes/mbm_total_bytes) and then
> switch user space interface to have counter configuration be done with
> directories.
>
>> # echo LclFill > info/L3_MON/counter_configs/mbm_read_only
>> # cat info/L3_MON/counter_configs/mbm_read_only
>> LclFill
>
> ... wait ... user space writes to the directory?
>
My bad. This is wrong. Let me rewrite the steps below.
>
>
>>
>> #echo mbm_read_only > test/mon_data/mon_L3_00/assign_exclusive
>>
>> Which would result in the creation of test/mon_data/mon_L3_*/mbm_read_only
>>
>> So, there is not breakage of backword compatibility.
>
> The way I understand it I am seeing many incompatibilities. Perhaps I am missing
> something. Could you please provide detailed steps of how first phase and
> second phase would look?
No. You didn't miss anything. I misspoke on few steps.
Here are the steps. Just copying steps from Peters proposal.
https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
1. Mount the resctrl
mount -t resctrl resctrl /sys/fs/resctrl
2. When ABMC is supported two default configurations will be created.
a. info/L3_MON/counter_configs/mbm_total_bytes/event_filter
b. info/L3_MON/counter_configs/mbm_local_bytes/event_filter
These files will be populated with default total and local events
# cat info/L3_MON/counter_configs/mbm_total_bytes/event_filter
VictimBW
RmtSlowFill
RmtNTWr
RmtFill
LclFill
LclNTWr
LclSlowFill
# cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
LclFill,
LclNTWr
LclSlowFill
3. Users will have options to update the event configuration.
echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
4. As usual the events can be read from the mon_data directories.
#mkdir /sys/fs/resctrl/test
#cd /sys/fs/resctr/test
#cat test/mon_data/mon_data/mon_L3_00/mbm_tota_bytes
101010
#cat test/mon_data/mon_data/mon_L3_00/mbm_local_bytes
32323
5. There will be 3 files created in each group's mon_data directory when
ABMC is supported.
a. test/mon_data/mon_L3_00/assign_exclusive
b. test/mon_data/mon_L3_00/assign_shared
c. test/mon_data/mon_L3_00/unassign
6. Events can be assigned/unassigned by these commands
# echo mbm_total_bytes > test/mon_data/mon_L3_00/assign_exclusive
# echo mbm_local_bytes > test/mon_data/mon_L3_01/assign_exclusive
# echo mbm_local_bytes > test/mon_data/mon_L3_01/unassign
Note:
I feel 3 files are excessive here. We can probably achieve everything in
just one file.
Not sure about mbm_assign_control interface as there are concerns with
group listing holding the lock for long.
-----------------------------------------------------------------------
Second phase, we can add support for "mkdir"
1. mkdir info/L3_MON/counter_configs/mbm_read_only
2. mkdir option will create "event_filter" file.
info/L3_MON/counter_configs/mbm_read_only/event_filter
3. Users can modify event configuration.
echo LclFill > info/L3_MON/counter_configs/mbm_read_only/event_filter
4. Users can assign the events
echo mbm_read_only > test/mon_data/mon_L3_00/assign_exclusive
5. Events can be read in
test/mon_data/mon_data/mon_L3_00/mbm_read_only
--
Thanks
Babu Moger
Hi Babu, On 3/13/25 1:13 PM, Moger, Babu wrote: > On 3/13/25 11:08, Reinette Chatre wrote: >> On 3/12/25 11:14 AM, Moger, Babu wrote: >>> On 3/12/25 12:14, Reinette Chatre wrote: >>>> On 3/12/25 9:03 AM, Moger, Babu wrote: >>>>> On 3/12/25 10:07, Reinette Chatre wrote: > Here are the steps. Just copying steps from Peters proposal. > https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/ Thank you very much for detailing the steps. It is starting the fall into place for me. > > > 1. Mount the resctrl > mount -t resctrl resctrl /sys/fs/resctrl I assume that on ABMC system the plan remains to have ABMC enabled by default, which will continue to depend on BMEC. How would the existing BMEC implementation be impacted in this case? Without any changes to BMEC support the mbm_total_bytes_config and mbm_local_bytes_config files will remain and user space may continue to use them to change the event configurations with confusing expecations/results on an ABMC system. One possibility may be that a user may see below on ABMC system even if BMEC is supported: # cat /sys/fs/resctrl/info/L3_MON/mon_features llc_occupancy mbm_total_bytes mbm_local_bytes With the above a user cannot be expected to want to interact with mbm_total_bytes_config and mbm_local_bytes_config, which may be the simplest to do. To follow that, we should also consider how "mon_features" will change with this implementation. > > 2. When ABMC is supported two default configurations will be created. > > a. info/L3_MON/counter_configs/mbm_total_bytes/event_filter > b. info/L3_MON/counter_configs/mbm_local_bytes/event_filter > > These files will be populated with default total and local events > # cat info/L3_MON/counter_configs/mbm_total_bytes/event_filter > VictimBW > RmtSlowFill > RmtNTWr > RmtFill > LclFill > LclNTWr > LclSlowFill Looks good. Here we could perhaps start nitpicking about naming and line separation. I think it may be easier if the fields are separated by comma, but more on that below ... > > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter > LclFill, > LclNTWr > LclSlowFill > > 3. Users will have options to update the event configuration. > echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter We need to be clear on how user space interacts with this file. For example, can user space "append" configurations? Specifically, if the file has contents like your earlier example: # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter LclFill LclNTWr LclSlowFill Should above be created with (note "append" needed for second and third): echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter echo LclNTWr >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter echo LclSlowFill >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter Is it possible to set multiple configurations in one write like below? echo "LclFill,LclNTWr,LclSlowFill" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter (note above where it may be easier for user space to use comma (or some other field separator) when providing multiple configurations at a time, with this, to match, having output in commas may be easier since it makes user interface copy&paste easier) If file has content like: # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter LclNTWr LclSlowFill What is impact of the following: echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter Is it (append): # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter LclFill LclNTWr LclSlowFill or (overwrite): # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter LclFill I do think the interface will be more intuitive it if follows regular file operations wrt "append" and such. I have not looked into how kernfs supports "append". As alternative, we can try to work the previous mbm_assign_control syntax in here (use + and -). For example: # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter LclNTWr # echo "+LclFill,-LclNTWr,+LclSlowFill" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter LclFill,LclSlowFill With something like above resctrl just deals with file writes as before. > > 4. As usual the events can be read from the mon_data directories. > #mkdir /sys/fs/resctrl/test > #cd /sys/fs/resctr/test > #cat test/mon_data/mon_data/mon_L3_00/mbm_tota_bytes > 101010 > #cat test/mon_data/mon_data/mon_L3_00/mbm_local_bytes > 32323 > > 5. There will be 3 files created in each group's mon_data directory when > ABMC is supported. > > a. test/mon_data/mon_L3_00/assign_exclusive > b. test/mon_data/mon_L3_00/assign_shared > c. test/mon_data/mon_L3_00/unassign > > > 6. Events can be assigned/unassigned by these commands > > # echo mbm_total_bytes > test/mon_data/mon_L3_00/assign_exclusive > # echo mbm_local_bytes > test/mon_data/mon_L3_01/assign_exclusive > # echo mbm_local_bytes > test/mon_data/mon_L3_01/unassign > > > Note: > I feel 3 files are excessive here. We can probably achieve everything in > just one file. Could you please elaborate what your concern is? You mention that it is excessive but it is not clear to me what issues may arise by having three files instead of one. I do think, and Peter also mentioned [1] this, that it may be useful, to "put a group/resource-scoped assign_* file higher in the hierarchy to make it easier for users who want to configure all domains the same for a group." Placing *additional* files higher in hierarchy (used to manage counters in all domains) may be more useful that trying to provide the shared/exclusive/unassign in one file per domain. > > Not sure about mbm_assign_control interface as there are concerns with > group listing holding the lock for long. > > ----------------------------------------------------------------------- > Second phase, we can add support for "mkdir" > > 1. mkdir info/L3_MON/counter_configs/mbm_read_only > > 2. mkdir option will create "event_filter" file. > info/L3_MON/counter_configs/mbm_read_only/event_filter > Got it! > 3. Users can modify event configuration. > echo LclFill > info/L3_MON/counter_configs/mbm_read_only/event_filter > > 4. Users can assign the events > > echo mbm_read_only > test/mon_data/mon_L3_00/assign_exclusive > > 5. Events can be read in > > test/mon_data/mon_data/mon_L3_00/mbm_read_only > Related to comment from Tony [2] about rmdir, please also consider that original mbm_local_bytes/mbm_total_bytes could also be removed because at this point they should not appear different from other counter configurations ... apart from being pre-populated for backward compatibility. Thank you. Reinette [1] https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/ [2] https://lore.kernel.org/lkml/Z9NB0wd8ZewLjNAd@agluck-desk3/
Hi Reinette,
On Thu, Mar 13, 2025 at 10:22 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Babu,
>
> On 3/13/25 1:13 PM, Moger, Babu wrote:
> > On 3/13/25 11:08, Reinette Chatre wrote:
> >> On 3/12/25 11:14 AM, Moger, Babu wrote:
> >>> On 3/12/25 12:14, Reinette Chatre wrote:
> >>>> On 3/12/25 9:03 AM, Moger, Babu wrote:
> >>>>> On 3/12/25 10:07, Reinette Chatre wrote:
>
>
> > Here are the steps. Just copying steps from Peters proposal.
> > https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>
> Thank you very much for detailing the steps. It is starting the fall into place
> for me.
>
> >
> >
> > 1. Mount the resctrl
> > mount -t resctrl resctrl /sys/fs/resctrl
>
> I assume that on ABMC system the plan remains to have ABMC enabled by default, which
> will continue to depend on BMEC.
>
> How would the existing BMEC implementation be impacted in this case?
>
> Without any changes to BMEC support the mbm_total_bytes_config and mbm_local_bytes_config
> files will remain and user space may continue to use them to change the event
> configurations with confusing expecations/results on an ABMC system.
>
> One possibility may be that a user may see below on ABMC system even if BMEC is supported:
> # cat /sys/fs/resctrl/info/L3_MON/mon_features
> llc_occupancy
> mbm_total_bytes
> mbm_local_bytes
>
> With the above a user cannot be expected to want to interact with mbm_total_bytes_config
> and mbm_local_bytes_config, which may be the simplest to do.
How about making mbm_local_bytes and mbm_total_bytes always be
configured using mbm_{local,total}_bytes_config and only allowing the
full ABMC configurability on user-defined configurations. This could
resolve the issue of backwards compatibility with the BMEC files and
remove the need for the user opting-in to ABMC mode.
It will be less clean implementation-wise, since there will be two
classes of event configuration to deal with, but I think it seems
logical from the user's side.
>
> To follow that, we should also consider how "mon_features" will change with this
> implementation.
>
> >
> > 2. When ABMC is supported two default configurations will be created.
> >
> > a. info/L3_MON/counter_configs/mbm_total_bytes/event_filter
> > b. info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> >
> > These files will be populated with default total and local events
> > # cat info/L3_MON/counter_configs/mbm_total_bytes/event_filter
> > VictimBW
> > RmtSlowFill
> > RmtNTWr
> > RmtFill
> > LclFill
> > LclNTWr
> > LclSlowFill
>
> Looks good. Here we could perhaps start nitpicking about naming and line separation.
> I think it may be easier if the fields are separated by comma, but more on that
> below ...
>
> >
> > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> > LclFill,
> > LclNTWr
> > LclSlowFill
> >
> > 3. Users will have options to update the event configuration.
> > echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>
> We need to be clear on how user space interacts with this file. For example,
> can user space "append" configurations? Specifically, if the file has
> contents like your earlier example:
> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> LclFill
> LclNTWr
> LclSlowFill
>
> Should above be created with (note "append" needed for second and third):
> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> echo LclNTWr >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> echo LclSlowFill >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>
> Is it possible to set multiple configurations in one write like below?
> echo "LclFill,LclNTWr,LclSlowFill" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>
> (note above where it may be easier for user space to use comma (or some other field separator)
> when providing multiple configurations at a time, with this, to match, having output in
> commas may be easier since it makes user interface copy&paste easier)
>
> If file has content like:
> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> LclNTWr
> LclSlowFill
>
> What is impact of the following:
> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>
> Is it (append):
> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> LclFill
> LclNTWr
> LclSlowFill
>
> or (overwrite):
> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> LclFill
>
> I do think the interface will be more intuitive it if follows regular file
> operations wrt "append" and such. I have not looked into how kernfs supports
> "append".
I expect specifying counter_configs to be a rare or one-time
operation, so I think ease-of-use is the only concern. I think
multiple, appending writes is the most straightforward to implement
and invoke (for a shell user), but I think commas are easy enough to
support as well, even though it would look better when reading back to
see the entries on separate lines.
I believe you can inspect the file descriptor's flags from the
kernfs_open_file reference: of->file->f_flags & O_APPEND
I haven't tried this, though.
-Peter
Hi Peter,
On 3/17/2025 11:27 AM, Peter Newman wrote:
> Hi Reinette,
>
> On Thu, Mar 13, 2025 at 10:22 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Babu,
>>
>> On 3/13/25 1:13 PM, Moger, Babu wrote:
>>> On 3/13/25 11:08, Reinette Chatre wrote:
>>>> On 3/12/25 11:14 AM, Moger, Babu wrote:
>>>>> On 3/12/25 12:14, Reinette Chatre wrote:
>>>>>> On 3/12/25 9:03 AM, Moger, Babu wrote:
>>>>>>> On 3/12/25 10:07, Reinette Chatre wrote:
>>
>>
>>> Here are the steps. Just copying steps from Peters proposal.
>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>
>> Thank you very much for detailing the steps. It is starting the fall into place
>> for me.
>>
>>>
>>>
>>> 1. Mount the resctrl
>>> mount -t resctrl resctrl /sys/fs/resctrl
>>
>> I assume that on ABMC system the plan remains to have ABMC enabled by default, which
>> will continue to depend on BMEC.
>>
>> How would the existing BMEC implementation be impacted in this case?
>>
>> Without any changes to BMEC support the mbm_total_bytes_config and mbm_local_bytes_config
>> files will remain and user space may continue to use them to change the event
>> configurations with confusing expecations/results on an ABMC system.
>>
>> One possibility may be that a user may see below on ABMC system even if BMEC is supported:
>> # cat /sys/fs/resctrl/info/L3_MON/mon_features
>> llc_occupancy
>> mbm_total_bytes
>> mbm_local_bytes
>>
>> With the above a user cannot be expected to want to interact with mbm_total_bytes_config
>> and mbm_local_bytes_config, which may be the simplest to do.
>
> How about making mbm_local_bytes and mbm_total_bytes always be
> configured using mbm_{local,total}_bytes_config and only allowing the
> full ABMC configurability on user-defined configurations. This could
> resolve the issue of backwards compatibility with the BMEC files and
> remove the need for the user opting-in to ABMC mode.
There is no opt-in mode. ABMC will be enabled by default if supported.
Users will have option to go back to legacy mode.
The default configurations will be used for total(0x7f equivalent to
enable all) and local(0x15 equivalent to all local events).
Same thing will show up at
a. info/L3_MON/counter_configs/mbm_total_bytes/event_filter
b. info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>
> It will be less clean implementation-wise, since there will be two
> classes of event configuration to deal with, but I think it seems
> logical from the user's side.
>
>>
>> To follow that, we should also consider how "mon_features" will change with this
>> implementation.
>>
>>>
>>> 2. When ABMC is supported two default configurations will be created.
>>>
>>> a. info/L3_MON/counter_configs/mbm_total_bytes/event_filter
>>> b. info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>
>>> These files will be populated with default total and local events
>>> # cat info/L3_MON/counter_configs/mbm_total_bytes/event_filter
>>> VictimBW
>>> RmtSlowFill
>>> RmtNTWr
>>> RmtFill
>>> LclFill
>>> LclNTWr
>>> LclSlowFill
>>
>> Looks good. Here we could perhaps start nitpicking about naming and line separation.
>> I think it may be easier if the fields are separated by comma, but more on that
>> below ...
>>
>>>
>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> LclFill,
>>> LclNTWr
>>> LclSlowFill
>>>
>>> 3. Users will have options to update the event configuration.
>>> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>
>> We need to be clear on how user space interacts with this file. For example,
>> can user space "append" configurations? Specifically, if the file has
>> contents like your earlier example:
>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>> LclFill
>> LclNTWr
>> LclSlowFill
>>
>> Should above be created with (note "append" needed for second and third):
>> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>> echo LclNTWr >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>> echo LclSlowFill >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>
>> Is it possible to set multiple configurations in one write like below?
>> echo "LclFill,LclNTWr,LclSlowFill" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>
>> (note above where it may be easier for user space to use comma (or some other field separator)
>> when providing multiple configurations at a time, with this, to match, having output in
>> commas may be easier since it makes user interface copy&paste easier)
>>
>> If file has content like:
>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>> LclNTWr
>> LclSlowFill
>>
>> What is impact of the following:
>> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>
>> Is it (append):
>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>> LclFill
>> LclNTWr
>> LclSlowFill
>>
>> or (overwrite):
>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>> LclFill
>>
>> I do think the interface will be more intuitive it if follows regular file
>> operations wrt "append" and such. I have not looked into how kernfs supports
>> "append".
>
> I expect specifying counter_configs to be a rare or one-time
> operation, so I think ease-of-use is the only concern. I think
> multiple, appending writes is the most straightforward to implement
> and invoke (for a shell user), but I think commas are easy enough to
> support as well, even though it would look better when reading back to
> see the entries on separate lines.
>
> I believe you can inspect the file descriptor's flags from the
> kernfs_open_file reference: of->file->f_flags & O_APPEND
>
> I haven't tried this, though.
>
> -Peter
>
Hi Babu and Peter,
On 3/17/25 4:00 PM, Moger, Babu wrote:
> Hi Peter,
>
> On 3/17/2025 11:27 AM, Peter Newman wrote:
>> Hi Reinette,
>>
>> On Thu, Mar 13, 2025 at 10:22 PM Reinette Chatre
>> <reinette.chatre@intel.com> wrote:
>>>
>>> Hi Babu,
>>>
>>> On 3/13/25 1:13 PM, Moger, Babu wrote:
>>>> On 3/13/25 11:08, Reinette Chatre wrote:
>>>>> On 3/12/25 11:14 AM, Moger, Babu wrote:
>>>>>> On 3/12/25 12:14, Reinette Chatre wrote:
>>>>>>> On 3/12/25 9:03 AM, Moger, Babu wrote:
>>>>>>>> On 3/12/25 10:07, Reinette Chatre wrote:
>>>
>>>
>>>> Here are the steps. Just copying steps from Peters proposal.
>>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>
>>> Thank you very much for detailing the steps. It is starting the fall into place
>>> for me.
>>>
>>>>
>>>>
>>>> 1. Mount the resctrl
>>>> mount -t resctrl resctrl /sys/fs/resctrl
>>>
>>> I assume that on ABMC system the plan remains to have ABMC enabled by default, which
>>> will continue to depend on BMEC.
>>>
>>> How would the k existing BMEC implementation be impacted in this case?
>>>
>>> Without any changes to BMEC support the mbm_total_bytes_config and mbm_local_bytes_config
>>> files will remain and user space may continue to use them to change the event
>>> configurations with confusing expecations/results on an ABMC system.
>>>
>>> One possibility may be that a user may see below on ABMC system even if BMEC is supported:
>>> # cat /sys/fs/resctrl/info/L3_MON/mon_features
>>> llc_occupancy
>>> mbm_total_bytes
>>> mbm_local_bytes
>>>
>>> With the above a user cannot be expected to want to interact with mbm_total_bytes_config
>>> and mbm_local_bytes_config, which may be the simplest to do.
>>
>> How about making mbm_local_bytes and mbm_total_bytes always be
>> configured using mbm_{local,total}_bytes_config and only allowing the
>> full ABMC configurability on user-defined configurations. This could
>> resolve the issue of backwards compatibility with the BMEC files and
>> remove the need for the user opting-in to ABMC mode.
hmmm, yes, backward compatibility is a big issue with an earlier suggestion
from me. Users with scripts/tools using mbm_{local,total}_bytes_config
would expect that to continue to work on systems that support BMEC.
resctrl could continue to use mbm_{local,total}_bytes_config
even though the inconsistent interface is not ideal
>
> There is no opt-in mode. ABMC will be enabled by default if supported.
> Users will have option to go back to legacy mode.
I assume there will still be the opt-in for automatic counter assignment
on monitor group creation (mkdir)?
>
> The default configurations will be used for total(0x7f equivalent to enable all) and local(0x15 equivalent to all local events).
>
> Same thing will show up at
> a. info/L3_MON/counter_configs/mbm_total_bytes/event_filter
> b. info/L3_MON/counter_configs/mbm_local_bytes/event_filter
These files could possibly be read-only but the moment user space uses
mbm_{local,total}_bytes_config to change the configurations between domains
this will be invalid. In this case the file could also perhaps
read "Configured using <path to>mbm_{local,total}_bytes_config". It is
not clear to me what would be most intuitive to user space.
>
>>
>> It will be less clean implementation-wise, since there will be two
>> classes of event configuration to deal with, but I think it seems
>> logical from the user's side.
>>
>>>
>>> To follow that, we should also consider how "mon_features" will change with this
>>> implementation.
>>>
>>>>
>>>> 2. When ABMC is supported two default configurations will be created.
>>>>
>>>> a. info/L3_MON/counter_configs/mbm_total_bytes/event_filter
>>>> b. info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>
>>>> These files will be populated with default total and local events
>>>> # cat info/L3_MON/counter_configs/mbm_total_bytes/event_filter
>>>> VictimBW
>>>> RmtSlowFill
>>>> RmtNTWr
>>>> RmtFill
>>>> LclFill
>>>> LclNTWr
>>>> LclSlowFill
>>>
>>> Looks good. Here we could perhaps start nitpicking about naming and line separation.
>>> I think it may be easier if the fields are separated by comma, but more on that
>>> below ...
>>>
>>>>
>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>> LclFill,
>>>> LclNTWr
>>>> LclSlowFill
>>>>
>>>> 3. Users will have options to update the event configuration.
>>>> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>
>>> We need to be clear on how user space interacts with this file. For example,
>>> can user space "append" configurations? Specifically, if the file has
>>> contents like your earlier example:
>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> LclFill
>>> LclNTWr
>>> LclSlowFill
>>>
>>> Should above be created with (note "append" needed for second and third):
>>> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> echo LclNTWr >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> echo LclSlowFill >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>
>>> Is it possible to set multiple configurations in one write like below?
>>> echo "LclFill,LclNTWr,LclSlowFill" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>
>>> (note above where it may be easier for user space to use comma (or some other field separator)
>>> when providing multiple configurations at a time, with this, to match, having output in
>>> commas may be easier since it makes user interface copy&paste easier)
>>>
>>> If file has content like:
>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> LclNTWr
>>> LclSlowFill
>>>
>>> What is impact of the following:
>>> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>
>>> Is it (append):
>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> LclFill
>>> LclNTWr
>>> LclSlowFill
>>>
>>> or (overwrite):
>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>> LclFill
>>>
>>> I do think the interface will be more intuitive it if follows regular file
>>> operations wrt "append" and such. I have not looked into how kernfs supports
>>> "append".
>>
>> I expect specifying counter_configs to be a rare or one-time
>> operation, so I think ease-of-use is the only concern. I think
>> multiple, appending writes is the most straightforward to implement
>> and invoke (for a shell user), but I think commas are easy enough to
>> support as well, even though it would look better when reading back to
>> see the entries on separate lines.
When the counter configuration consist out of multiple events then it may
be convenient to just write it all in one go and having a shell user use
newline as field separator does not seem convenient. Appending writes sound
good no matter the field separator.
Reading back we may have to consider both what looks good to user space and
what is easy to parse by a script.
>>
>> I believe you can inspect the file descriptor's flags from the
>> kernfs_open_file reference: of->file->f_flags & O_APPEND
>>
>> I haven't tried this, though.
Thanks for looking this up.
Reinette
Hi Reinette,
On 3/19/25 15:53, Reinette Chatre wrote:
> Hi Babu and Peter,
>
> On 3/17/25 4:00 PM, Moger, Babu wrote:
>> Hi Peter,
>>
>> On 3/17/2025 11:27 AM, Peter Newman wrote:
>>> Hi Reinette,
>>>
>>> On Thu, Mar 13, 2025 at 10:22 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>>
>>>> Hi Babu,
>>>>
>>>> On 3/13/25 1:13 PM, Moger, Babu wrote:
>>>>> On 3/13/25 11:08, Reinette Chatre wrote:
>>>>>> On 3/12/25 11:14 AM, Moger, Babu wrote:
>>>>>>> On 3/12/25 12:14, Reinette Chatre wrote:
>>>>>>>> On 3/12/25 9:03 AM, Moger, Babu wrote:
>>>>>>>>> On 3/12/25 10:07, Reinette Chatre wrote:
>>>>
>>>>
>>>>> Here are the steps. Just copying steps from Peters proposal.
>>>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>>
>>>> Thank you very much for detailing the steps. It is starting the fall into place
>>>> for me.
>>>>
>>>>>
>>>>>
>>>>> 1. Mount the resctrl
>>>>> mount -t resctrl resctrl /sys/fs/resctrl
>>>>
>>>> I assume that on ABMC system the plan remains to have ABMC enabled by default, which
>>>> will continue to depend on BMEC.
>>>>
>>>> How would the k existing BMEC implementation be impacted in this case?
>>>>
>>>> Without any changes to BMEC support the mbm_total_bytes_config and mbm_local_bytes_config
>>>> files will remain and user space may continue to use them to change the event
>>>> configurations with confusing expecations/results on an ABMC system.
>>>>
>>>> One possibility may be that a user may see below on ABMC system even if BMEC is supported:
>>>> # cat /sys/fs/resctrl/info/L3_MON/mon_features
>>>> llc_occupancy
>>>> mbm_total_bytes
>>>> mbm_local_bytes
>>>>
>>>> With the above a user cannot be expected to want to interact with mbm_total_bytes_config
>>>> and mbm_local_bytes_config, which may be the simplest to do.
>>>
>>> How about making mbm_local_bytes and mbm_total_bytes always be
>>> configured using mbm_{local,total}_bytes_config and only allowing the
>>> full ABMC configurability on user-defined configurations. This could
>>> resolve the issue of backwards compatibility with the BMEC files and
>>> remove the need for the user opting-in to ABMC mode.
>
> hmmm, yes, backward compatibility is a big issue with an earlier suggestion
> from me. Users with scripts/tools using mbm_{local,total}_bytes_config
> would expect that to continue to work on systems that support BMEC.
> resctrl could continue to use mbm_{local,total}_bytes_config
> even though the inconsistent interface is not ideal
>
>>
>> There is no opt-in mode. ABMC will be enabled by default if supported.
>> Users will have option to go back to legacy mode.
>
> I assume there will still be the opt-in for automatic counter assignment
> on monitor group creation (mkdir)?
Yes. It will be available.
--
Thanks
Babu Moger
Hi Reinette, On 3/13/2025 4:21 PM, Reinette Chatre wrote: > Hi Babu, > > On 3/13/25 1:13 PM, Moger, Babu wrote: >> On 3/13/25 11:08, Reinette Chatre wrote: >>> On 3/12/25 11:14 AM, Moger, Babu wrote: >>>> On 3/12/25 12:14, Reinette Chatre wrote: >>>>> On 3/12/25 9:03 AM, Moger, Babu wrote: >>>>>> On 3/12/25 10:07, Reinette Chatre wrote: > > >> Here are the steps. Just copying steps from Peters proposal. >> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/ > > Thank you very much for detailing the steps. It is starting the fall into place > for me. > >> >> >> 1. Mount the resctrl >> mount -t resctrl resctrl /sys/fs/resctrl > > I assume that on ABMC system the plan remains to have ABMC enabled by default, which > will continue to depend on BMEC. Yes. ABMC will be enabled by default. ABMC will use the configurations from info/L3_MON/counter_configs. ABMC will not depend on BMEC. > How would the existing BMEC implementation be impacted in this case? BMEC will only work with pre-ABMC(or default) mode. > > Without any changes to BMEC support the mbm_total_bytes_config and mbm_local_bytes_config > files will remain and user space may continue to use them to change the event > configurations with confusing expecations/results on an ABMC system. > > One possibility may be that a user may see below on ABMC system even if BMEC is supported: > # cat /sys/fs/resctrl/info/L3_MON/mon_features > llc_occupancy > mbm_total_bytes > mbm_local_bytes > > With the above a user cannot be expected to want to interact with mbm_total_bytes_config > and mbm_local_bytes_config, which may be the simplest to do. yes. > > To follow that, we should also consider how "mon_features" will change with this > implementation. May be # cat /sys/fs/resctrl/info/L3_MON/mon_features llc_occupancy mbm_total_bytes mbm_local_bytes counter_configs/mbm_total_bytes/event_filter counter_configs/mbm_local_bytes/event_filter > >> >> 2. When ABMC is supported two default configurations will be created. >> >> a. info/L3_MON/counter_configs/mbm_total_bytes/event_filter >> b. info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> >> These files will be populated with default total and local events >> # cat info/L3_MON/counter_configs/mbm_total_bytes/event_filter >> VictimBW >> RmtSlowFill >> RmtNTWr >> RmtFill >> LclFill >> LclNTWr >> LclSlowFill > > Looks good. Here we could perhaps start nitpicking about naming and line separation. > I think it may be easier if the fields are separated by comma, but more on that > below ... > >> >> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> LclFill, >> LclNTWr >> LclSlowFill >> >> 3. Users will have options to update the event configuration. >> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter > > We need to be clear on how user space interacts with this file. For example, > can user space "append" configurations? Specifically, if the file has > contents like your earlier example: > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter > LclFill > LclNTWr > LclSlowFill > > Should above be created with (note "append" needed for second and third): > echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter > echo LclNTWr >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter > echo LclSlowFill >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter > > Is it possible to set multiple configurations in one write like below? > echo "LclFill,LclNTWr,LclSlowFill" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter Yes. We should support that. > > (note above where it may be easier for user space to use comma (or some other field separator) > when providing multiple configurations at a time, with this, to match, having output in > commas may be easier since it makes user interface copy&paste easier) > > If file has content like: > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter > LclNTWr > LclSlowFill > > What is impact of the following: > echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter > > Is it (append): > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter > LclFill > LclNTWr > LclSlowFill > > or (overwrite): > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter > LclFill > > I do think the interface will be more intuitive it if follows regular file > operations wrt "append" and such. I have not looked into how kernfs supports > "append". Just searching quickly, I have not seen any append operations on kernfs. > As alternative, we can try to work the previous mbm_assign_control syntax in here (use + and -). > > For example: > > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter > LclNTWr > # echo "+LclFill,-LclNTWr,+LclSlowFill" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter > LclFill,LclSlowFill > > With something like above resctrl just deals with file writes as before. Or without complicating much we can just support basic operations. # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter LclFill, LclNTWr, LclSlowFill # echo "LclFill, LclNTWr, LclSlowFill, VictimBW" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter LclFill, LclNTWr, LclSlowFill, VictimBW # echo "LclFill, LclNTWr" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter LclFill, LclNTWr > > >> >> 4. As usual the events can be read from the mon_data directories. >> #mkdir /sys/fs/resctrl/test >> #cd /sys/fs/resctr/test >> #cat test/mon_data/mon_data/mon_L3_00/mbm_tota_bytes >> 101010 >> #cat test/mon_data/mon_data/mon_L3_00/mbm_local_bytes >> 32323 >> >> 5. There will be 3 files created in each group's mon_data directory when >> ABMC is supported. >> >> a. test/mon_data/mon_L3_00/assign_exclusive >> b. test/mon_data/mon_L3_00/assign_shared >> c. test/mon_data/mon_L3_00/unassign >> >> >> 6. Events can be assigned/unassigned by these commands >> >> # echo mbm_total_bytes > test/mon_data/mon_L3_00/assign_exclusive >> # echo mbm_local_bytes > test/mon_data/mon_L3_01/assign_exclusive >> # echo mbm_local_bytes > test/mon_data/mon_L3_01/unassign >> >> >> Note: >> I feel 3 files are excessive here. We can probably achieve everything in >> just one file. > > Could you please elaborate what your concern is? You mention that it is > excessive but it is not clear to me what issues may arise by > having three files instead of one. All these 3 properties are mutually exclusive. Only one can true at a time. Example: #cat assign_exclusive 0 #cat assign_shared 0 #cat uassigned 1 Three operations to find out the assign state. Instead of that #cat mon_l3_assignments unassigned > > I do think, and Peter also mentioned [1] this, that it may be useful, > to "put a group/resource-scoped assign_* file higher in the hierarchy > to make it easier for users who want to configure all domains the > same for a group." > > Placing *additional* files higher in hierarchy (used to manage counters in all > domains) may be more useful that trying to provide the shared/exclusive/unassign > in one file per domain. Yea. To make it better we can add "mon_l3_assignments" in groups main directory. We can do all the operation in just one file. https://lore.kernel.org/lkml/efb5293f-b0ef-4c94-bf10-9ca7ebb3b53f@amd.com/ > >> >> Not sure about mbm_assign_control interface as there are concerns with >> group listing holding the lock for long. >> >> ----------------------------------------------------------------------- >> Second phase, we can add support for "mkdir" >> >> 1. mkdir info/L3_MON/counter_configs/mbm_read_only >> >> 2. mkdir option will create "event_filter" file. >> info/L3_MON/counter_configs/mbm_read_only/event_filter >> > > Got it! > >> 3. Users can modify event configuration. >> echo LclFill > info/L3_MON/counter_configs/mbm_read_only/event_filter >> >> 4. Users can assign the events >> >> echo mbm_read_only > test/mon_data/mon_L3_00/assign_exclusive >> >> 5. Events can be read in >> >> test/mon_data/mon_data/mon_L3_00/mbm_read_only >> > > Related to comment from Tony [2] about rmdir, please also consider that > original mbm_local_bytes/mbm_total_bytes could also be removed because at this > point they should not appear different from other counter configurations ... apart > from being pre-populated for backward compatibility. Sure. > > Thank you. > > Reinette > > > [1] https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/ > [2] https://lore.kernel.org/lkml/Z9NB0wd8ZewLjNAd@agluck-desk3/ > Thanks Babu
Hi Babu, On 3/14/25 9:18 AM, Moger, Babu wrote: > On 3/13/2025 4:21 PM, Reinette Chatre wrote: >> On 3/13/25 1:13 PM, Moger, Babu wrote: >>> On 3/13/25 11:08, Reinette Chatre wrote: >>>> On 3/12/25 11:14 AM, Moger, Babu wrote: >>>>> On 3/12/25 12:14, Reinette Chatre wrote: >>>>>> On 3/12/25 9:03 AM, Moger, Babu wrote: >>>>>>> On 3/12/25 10:07, Reinette Chatre wrote: >> >> >>> Here are the steps. Just copying steps from Peters proposal. >>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/ >> >> Thank you very much for detailing the steps. It is starting the fall into place >> for me. >> >>> >>> >>> 1. Mount the resctrl >>> mount -t resctrl resctrl /sys/fs/resctrl >> >> I assume that on ABMC system the plan remains to have ABMC enabled by default, which >> will continue to depend on BMEC. > > Yes. ABMC will be enabled by default. ABMC will use the configurations from info/L3_MON/counter_configs. ABMC will not depend on BMEC. I see. The previous dependency was thus just something enforced by OS to support the chosen implementation? Looks like the two features share some registers. > >> How would the existing BMEC implementation be impacted in this case? > > BMEC will only work with pre-ABMC(or default) mode. ok. Does this mean that if a user boots kernel with "rdt=!bmec" then ABMC will keep working? >> Without any changes to BMEC support the mbm_total_bytes_config and mbm_local_bytes_config >> files will remain and user space may continue to use them to change the event >> configurations with confusing expecations/results on an ABMC system. >> >> One possibility may be that a user may see below on ABMC system even if BMEC is supported: >> # cat /sys/fs/resctrl/info/L3_MON/mon_features >> llc_occupancy >> mbm_total_bytes >> mbm_local_bytes >> >> With the above a user cannot be expected to want to interact with mbm_total_bytes_config >> and mbm_local_bytes_config, which may be the simplest to do. > > yes. > >> >> To follow that, we should also consider how "mon_features" will change with this >> implementation. > > May be > > # cat /sys/fs/resctrl/info/L3_MON/mon_features > llc_occupancy > mbm_total_bytes > mbm_local_bytes > counter_configs/mbm_total_bytes/event_filter > counter_configs/mbm_local_bytes/event_filter > I read the docs again to understand what kind of flexibility we have here. As I interpret it the "mon_features" is associated with "events" and there is a clear documented association between the "events" listed in this file and which files a user can expect to exist in the "mon_data" directory. Considering this I think it may be helpful to provide the counter configurations in this file. This matches well with mbm_total_bytes/mbm_local_bytes being treated as "counter configurations". Whether counter configuration is supported could be determined by existence of the "counter_configs" directory? For example, # cat /sys/fs/resctrl/info/L3_MON/mon_features llc_occupancy mbm_total_bytes mbm_local_bytes # mkdir /sys/fs/resctrl/info/L3_MON/counter_configs/only_read_fills # cat /sys/fs/resctrl/info/L3_MON/mon_features llc_occupancy mbm_total_bytes mbm_local_bytes only_read_fills This could possibly be a way to support user interface when configuring the counter. For example, a user may easily create a new counter configuration by creating a directory, but there may be some requirements wrt its configuration that need to be met before that configuration/event may appear in the "mon_features" file. >>> 2. When ABMC is supported two default configurations will be created. >>> >>> a. info/L3_MON/counter_configs/mbm_total_bytes/event_filter >>> b. info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> >>> These files will be populated with default total and local events >>> # cat info/L3_MON/counter_configs/mbm_total_bytes/event_filter >>> VictimBW >>> RmtSlowFill >>> RmtNTWr >>> RmtFill >>> LclFill >>> LclNTWr >>> LclSlowFill >> >> Looks good. Here we could perhaps start nitpicking about naming and line separation. >> I think it may be easier if the fields are separated by comma, but more on that >> below ... >> >>> >>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> LclFill, >>> LclNTWr >>> LclSlowFill >>> >>> 3. Users will have options to update the event configuration. >>> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> >> We need to be clear on how user space interacts with this file. For example, >> can user space "append" configurations? Specifically, if the file has >> contents like your earlier example: >> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> LclFill >> LclNTWr >> LclSlowFill >> >> Should above be created with (note "append" needed for second and third): >> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> echo LclNTWr >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> echo LclSlowFill >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> >> Is it possible to set multiple configurations in one write like below? >> echo "LclFill,LclNTWr,LclSlowFill" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter > > Yes. We should support that. Reading Peter's response (https://lore.kernel.org/lkml/CALPaoCj7aSVxHisQTdKQ5KN0-aNzN8rRkRPVc7pjGMLSxfPvrA@mail.gmail.com/) it sounds as though this part is now in the fine-tuning phase. If there are other formats that is more convenient for user space then we should surely consider that. > >> >> (note above where it may be easier for user space to use comma (or some other field separator) >> when providing multiple configurations at a time, with this, to match, having output in >> commas may be easier since it makes user interface copy&paste easier) >> >> If file has content like: >> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> LclNTWr >> LclSlowFill >> >> What is impact of the following: >> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> >> Is it (append): >> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> LclFill >> LclNTWr >> LclSlowFill >> >> or (overwrite): >> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> LclFill >> >> I do think the interface will be more intuitive it if follows regular file >> operations wrt "append" and such. I have not looked into how kernfs supports >> "append". > > Just searching quickly, I have not seen any append operations on kernfs. > > >> As alternative, we can try to work the previous mbm_assign_control syntax in here (use + and -). >> >> For example: >> >> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> LclNTWr >> # echo "+LclFill,-LclNTWr,+LclSlowFill" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> LclFill,LclSlowFill >> >> With something like above resctrl just deals with file writes as before. > > Or without complicating much we can just support basic operations. > > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter > LclFill, LclNTWr, LclSlowFill > > # echo "LclFill, LclNTWr, LclSlowFill, VictimBW" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter > > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter > LclFill, LclNTWr, LclSlowFill, VictimBW > > # echo "LclFill, LclNTWr" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter > > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter > LclFill, LclNTWr > Looks good to me. As I see it this could be expanded to support "append" if needed. >> >> >>> >>> 4. As usual the events can be read from the mon_data directories. >>> #mkdir /sys/fs/resctrl/test >>> #cd /sys/fs/resctr/test >>> #cat test/mon_data/mon_data/mon_L3_00/mbm_tota_bytes >>> 101010 >>> #cat test/mon_data/mon_data/mon_L3_00/mbm_local_bytes >>> 32323 >>> >>> 5. There will be 3 files created in each group's mon_data directory when >>> ABMC is supported. >>> >>> a. test/mon_data/mon_L3_00/assign_exclusive >>> b. test/mon_data/mon_L3_00/assign_shared >>> c. test/mon_data/mon_L3_00/unassign >>> >>> >>> 6. Events can be assigned/unassigned by these commands >>> >>> # echo mbm_total_bytes > test/mon_data/mon_L3_00/assign_exclusive >>> # echo mbm_local_bytes > test/mon_data/mon_L3_01/assign_exclusive >>> # echo mbm_local_bytes > test/mon_data/mon_L3_01/unassign >>> >>> >>> Note: >>> I feel 3 files are excessive here. We can probably achieve everything in >>> just one file. >> >> Could you please elaborate what your concern is? You mention that it is >> excessive but it is not clear to me what issues may arise by >> having three files instead of one. > > All these 3 properties are mutually exclusive. Only one can true at a time. Example: > #cat assign_exclusive > 0 > #cat assign_shared > 0 > #cat uassigned > 1 > > Three operations to find out the assign state. ah - good point. > > Instead of that > #cat mon_l3_assignments > unassigned > > >> >> I do think, and Peter also mentioned [1] this, that it may be useful, >> to "put a group/resource-scoped assign_* file higher in the hierarchy >> to make it easier for users who want to configure all domains the >> same for a group." >> >> Placing *additional* files higher in hierarchy (used to manage counters in all >> domains) may be more useful that trying to provide the shared/exclusive/unassign >> in one file per domain. > > Yea. To make it better we can add "mon_l3_assignments" in groups main directory. We can do all the operation in just one file. > > https://lore.kernel.org/lkml/efb5293f-b0ef-4c94-bf10-9ca7ebb3b53f@amd.com/ I am hesitant to respond to that message considering the corporate preamble that sneaked in so I'll just add some thoughts here: Having the file higher in hierarchy does seem more useful. It may be useful to reduce amount of parsing to get to needed information. Compare with below two examples that can be used to convey the same information: # cat /sys/fs/resctrl/test/mon_L3_assignments mbm_total_bytes: 0=unassigned; 1=unassigned mbm_local_bytes: 0=unassigned; 1=unassigned #cat /sys/fs/resctrl/test/mon_L3_assignments 0=_; 1=_ We need to take care that it is always clear what "0" or "1" means ... Tony has been mentioning a lot of interesting things about scope changes. I assume the "L3" in mon_L3_assignments will dictate the scope? With a syntax like above the needed information can be presented in one line. For example, #cat /sys/fs/resctrl/test/mon_L3_assignments 0=mbm_total_bytes; 1=mbm_local_bytes The caveat is that is only for assigned counters, not shared, so this needs to change. To support shared assignment ... I wonder if it will be useful to users to get the information on which other monitor groups the counter is shared _with_? Some examples: a) Just indicate whether a counter is shared or dedicated. (Introduce flags). #cat /sys/fs/resctrl/test/mon_L3_assignments 0=mbm_total_bytes:s; 1=mbm_local_bytes:d b) Indicate which groups a counter is shared with: #cat /sys/fs/resctrl/testA/mon_L3_assignments 0=mbm_total_bytes:s(testB); 1=mbm_local_bytes:d(not needed but perhaps empty for consistent interface?) #cat /sys/fs/resctrl/testB/mon_L3_assignments 0=mbm_total_bytes:s(testA); 1=mbm_local_bytes:d(?) ... (b) may just be overkill and we should instead follow Tony's guideline (see https://lore.kernel.org/lkml/Z9CiwLrhuTODruCj@agluck-desk3/ ) that users should be able to keep track themselves. Reinette
Hi Reinette, On 3/19/25 13:36, Reinette Chatre wrote: > Hi Babu, > > On 3/14/25 9:18 AM, Moger, Babu wrote: >> On 3/13/2025 4:21 PM, Reinette Chatre wrote: >>> On 3/13/25 1:13 PM, Moger, Babu wrote: >>>> On 3/13/25 11:08, Reinette Chatre wrote: >>>>> On 3/12/25 11:14 AM, Moger, Babu wrote: >>>>>> On 3/12/25 12:14, Reinette Chatre wrote: >>>>>>> On 3/12/25 9:03 AM, Moger, Babu wrote: >>>>>>>> On 3/12/25 10:07, Reinette Chatre wrote: >>> >>> >>>> Here are the steps. Just copying steps from Peters proposal. >>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/ >>> >>> Thank you very much for detailing the steps. It is starting the fall into place >>> for me. >>> >>>> >>>> >>>> 1. Mount the resctrl >>>> mount -t resctrl resctrl /sys/fs/resctrl >>> >>> I assume that on ABMC system the plan remains to have ABMC enabled by default, which >>> will continue to depend on BMEC. >> >> Yes. ABMC will be enabled by default. ABMC will use the configurations from info/L3_MON/counter_configs. ABMC will not depend on BMEC. > > I see. The previous dependency was thus just something enforced by OS to support the > chosen implementation? Yes. That is correct. We went that route mainly not to change the rmid_read operation. With ABMC, we need to set Extended EVTID and ABMC bit in QM_EVTSEL register while reading the cntr_id events. Will add those patches in next version to make it clear. > Looks like the two features share some registers. > >> >>> How would the existing BMEC implementation be impacted in this case? >> >> BMEC will only work with pre-ABMC(or default) mode. > > ok. Does this mean that if a user boots kernel with "rdt=!bmec" then ABMC will keep working? Yes. That is correct. > > >>> Without any changes to BMEC support the mbm_total_bytes_config and mbm_local_bytes_config >>> files will remain and user space may continue to use them to change the event >>> configurations with confusing expecations/results on an ABMC system. >>> >>> One possibility may be that a user may see below on ABMC system even if BMEC is supported: >>> # cat /sys/fs/resctrl/info/L3_MON/mon_features >>> llc_occupancy >>> mbm_total_bytes >>> mbm_local_bytes >>> >>> With the above a user cannot be expected to want to interact with mbm_total_bytes_config >>> and mbm_local_bytes_config, which may be the simplest to do. >> >> yes. >> >>> >>> To follow that, we should also consider how "mon_features" will change with this >>> implementation. >> >> May be >> >> # cat /sys/fs/resctrl/info/L3_MON/mon_features >> llc_occupancy >> mbm_total_bytes >> mbm_local_bytes >> counter_configs/mbm_total_bytes/event_filter >> counter_configs/mbm_local_bytes/event_filter >> > > I read the docs again to understand what kind of flexibility we have here. As I interpret it > the "mon_features" is associated with "events" and there is a clear documented association > between the "events" listed in this file and which files a user can expect to exist in the > "mon_data" directory. Considering this I think it may be helpful to provide the > counter configurations in this file. This matches well with mbm_total_bytes/mbm_local_bytes > being treated as "counter configurations". > > Whether counter configuration is supported could be determined by existence of > the "counter_configs" directory? > > For example, > # cat /sys/fs/resctrl/info/L3_MON/mon_features > llc_occupancy > mbm_total_bytes > mbm_local_bytes > > # mkdir /sys/fs/resctrl/info/L3_MON/counter_configs/only_read_fills > > # cat /sys/fs/resctrl/info/L3_MON/mon_features > llc_occupancy > mbm_total_bytes > mbm_local_bytes > only_read_fills > > This could possibly be a way to support user interface when configuring the > counter. For example, a user may easily create a new counter configuration > by creating a directory, but there may be some requirements wrt its configuration > that need to be met before that configuration/event may appear in the > "mon_features" file. Yes. I am fine with this approach. > >>>> 2. When ABMC is supported two default configurations will be created. >>>> >>>> a. info/L3_MON/counter_configs/mbm_total_bytes/event_filter >>>> b. info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>>> >>>> These files will be populated with default total and local events >>>> # cat info/L3_MON/counter_configs/mbm_total_bytes/event_filter >>>> VictimBW >>>> RmtSlowFill >>>> RmtNTWr >>>> RmtFill >>>> LclFill >>>> LclNTWr >>>> LclSlowFill >>> >>> Looks good. Here we could perhaps start nitpicking about naming and line separation. >>> I think it may be easier if the fields are separated by comma, but more on that >>> below ... >>> >>>> >>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>>> LclFill, >>>> LclNTWr >>>> LclSlowFill >>>> >>>> 3. Users will have options to update the event configuration. >>>> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> >>> We need to be clear on how user space interacts with this file. For example, >>> can user space "append" configurations? Specifically, if the file has >>> contents like your earlier example: >>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> LclFill >>> LclNTWr >>> LclSlowFill >>> >>> Should above be created with (note "append" needed for second and third): >>> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> echo LclNTWr >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> echo LclSlowFill >> info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> >>> Is it possible to set multiple configurations in one write like below? >>> echo "LclFill,LclNTWr,LclSlowFill" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> >> Yes. We should support that. > > Reading Peter's response (https://lore.kernel.org/lkml/CALPaoCj7aSVxHisQTdKQ5KN0-aNzN8rRkRPVc7pjGMLSxfPvrA@mail.gmail.com/) > it sounds as though this part is now in the fine-tuning phase. > If there are other formats that is more convenient for user space then we should surely > consider that. I aggee. We can revise it further as we review. > >> >>> >>> (note above where it may be easier for user space to use comma (or some other field separator) >>> when providing multiple configurations at a time, with this, to match, having output in >>> commas may be easier since it makes user interface copy&paste easier) >>> >>> If file has content like: >>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> LclNTWr >>> LclSlowFill >>> >>> What is impact of the following: >>> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> >>> Is it (append): >>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> LclFill >>> LclNTWr >>> LclSlowFill >>> >>> or (overwrite): >>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> LclFill >>> >>> I do think the interface will be more intuitive it if follows regular file >>> operations wrt "append" and such. I have not looked into how kernfs supports >>> "append". >> >> Just searching quickly, I have not seen any append operations on kernfs. >> >> >>> As alternative, we can try to work the previous mbm_assign_control syntax in here (use + and -). >>> >>> For example: >>> >>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> LclNTWr >>> # echo "+LclFill,-LclNTWr,+LclSlowFill" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >>> LclFill,LclSlowFill >>> >>> With something like above resctrl just deals with file writes as before. >> >> Or without complicating much we can just support basic operations. >> >> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> LclFill, LclNTWr, LclSlowFill >> >> # echo "LclFill, LclNTWr, LclSlowFill, VictimBW" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> >> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> LclFill, LclNTWr, LclSlowFill, VictimBW >> >> # echo "LclFill, LclNTWr" > info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> >> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> LclFill, LclNTWr >> > > Looks good to me. As I see it this could be expanded to support "append" if needed. thanks. > >>> >>> >>>> >>>> 4. As usual the events can be read from the mon_data directories. >>>> #mkdir /sys/fs/resctrl/test >>>> #cd /sys/fs/resctr/test >>>> #cat test/mon_data/mon_data/mon_L3_00/mbm_tota_bytes >>>> 101010 >>>> #cat test/mon_data/mon_data/mon_L3_00/mbm_local_bytes >>>> 32323 >>>> >>>> 5. There will be 3 files created in each group's mon_data directory when >>>> ABMC is supported. >>>> >>>> a. test/mon_data/mon_L3_00/assign_exclusive >>>> b. test/mon_data/mon_L3_00/assign_shared >>>> c. test/mon_data/mon_L3_00/unassign >>>> >>>> >>>> 6. Events can be assigned/unassigned by these commands >>>> >>>> # echo mbm_total_bytes > test/mon_data/mon_L3_00/assign_exclusive >>>> # echo mbm_local_bytes > test/mon_data/mon_L3_01/assign_exclusive >>>> # echo mbm_local_bytes > test/mon_data/mon_L3_01/unassign >>>> >>>> >>>> Note: >>>> I feel 3 files are excessive here. We can probably achieve everything in >>>> just one file. >>> >>> Could you please elaborate what your concern is? You mention that it is >>> excessive but it is not clear to me what issues may arise by >>> having three files instead of one. >> >> All these 3 properties are mutually exclusive. Only one can true at a time. Example: >> #cat assign_exclusive >> 0 >> #cat assign_shared >> 0 >> #cat uassigned >> 1 >> >> Three operations to find out the assign state. > > ah - good point. > >> >> Instead of that >> #cat mon_l3_assignments >> unassigned >> >> >>> >>> I do think, and Peter also mentioned [1] this, that it may be useful, >>> to "put a group/resource-scoped assign_* file higher in the hierarchy >>> to make it easier for users who want to configure all domains the >>> same for a group." >>> >>> Placing *additional* files higher in hierarchy (used to manage counters in all >>> domains) may be more useful that trying to provide the shared/exclusive/unassign >>> in one file per domain. >> >> Yea. To make it better we can add "mon_l3_assignments" in groups main directory. We can do all the operation in just one file. >> >> https://lore.kernel.org/lkml/efb5293f-b0ef-4c94-bf10-9ca7ebb3b53f@amd.com/ > > I am hesitant to respond to that message considering the corporate preamble that > sneaked in so I'll just add some thoughts here: Yea. I noticed it later. Will take care next time. > > Having the file higher in hierarchy does seem more useful. It may be useful to reduce > amount of parsing to get to needed information. Compare with below two examples that can > be used to convey the same information: > > # cat /sys/fs/resctrl/test/mon_L3_assignments > mbm_total_bytes: 0=unassigned; 1=unassigned > mbm_local_bytes: 0=unassigned; 1=unassigned > > #cat /sys/fs/resctrl/test/mon_L3_assignments > 0=_; 1=_ > > We need to take care that it is always clear what "0" or "1" means ... > Tony has been mentioning a lot of interesting things about scope > changes. I assume the "L3" in mon_L3_assignments will dictate the scope? I didnt think about the scope here. I was thinking of changing it to "mbm_assignments". > > With a syntax like above the needed information can be presented in one line. > For example, > > #cat /sys/fs/resctrl/test/mon_L3_assignments > 0=mbm_total_bytes; 1=mbm_local_bytes > > The caveat is that is only for assigned counters, not shared, so this needs > to change. > > To support shared assignment ... I wonder if it will be useful to users to > get the information on which other monitor groups the counter is shared _with_? > > Some examples: > > a) Just indicate whether a counter is shared or dedicated. (Introduce flags). > #cat /sys/fs/resctrl/test/mon_L3_assignments > 0=mbm_total_bytes:s; 1=mbm_local_bytes:d > > b) Indicate which groups a counter is shared with: > #cat /sys/fs/resctrl/testA/mon_L3_assignments > 0=mbm_total_bytes:s(testB); 1=mbm_local_bytes:d(not needed but perhaps empty for consistent interface?) > #cat /sys/fs/resctrl/testB/mon_L3_assignments > 0=mbm_total_bytes:s(testA); 1=mbm_local_bytes:d(?) This format does not tell what is going on with all other domains. We need to display all the domains. I think that is important because users need to know what to expect reading the events on specific domains. I think we need to convey all the following information to the user. 1. Event Configuation: What is event configuration applied here? 2. Domains: To which all the domains the configaration is applied? This is useful in multi-domain configuration. We also need to know if which domains are assigned or unassigned. 3. Type of assignment: Exclusive(e or d) or Shared(s) or Unassigned(_) 4. Finally: Where to read the events? This is important when we add "mkdir" support in the future, mon_data/mon_l3_*/config_name will be created. With that in mind this might be helpful. # cat /sys/fs/resctrl/test/mon_L3_assignments mbm_total_bytes: 0=e; 1=_ mbm_local_bytes: 0=_; 1=s This format tells the user all the information. mbm_total_bytes and mbm_local_bytes configurations are applied and configuration are coming from counter_configs. User can read the events in mon_data/mon_L3_*/mbm_total_bytes mon_data/mon_L3_*/mbm_local_bytes mbm_total_bytes is assigned on domain 0 and not on domain 1. Reading the mbm_total_bytes on domain 1 will report "unassigned". mbm_local_bytes is not assigned on domain 0 and assigned on domain 1. Reading the mbm_local_bytes on domain 0 will report "unassigned". I dont have much information on shared assignment at this point. Dont know if we can display shared group. > > ... (b) may just be overkill and we should instead follow Tony's > guideline (see https://lore.kernel.org/lkml/Z9CiwLrhuTODruCj@agluck-desk3/ ) > that users should be able to keep track themselves. > > Reinette > -- Thanks Babu Moger
Hi Babu,
On 3/20/25 11:12 AM, Moger, Babu wrote:
> Hi Reinette,
>
> On 3/19/25 13:36, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 3/14/25 9:18 AM, Moger, Babu wrote:
>>> On 3/13/2025 4:21 PM, Reinette Chatre wrote:
>>>> On 3/13/25 1:13 PM, Moger, Babu wrote:
>>>>> On 3/13/25 11:08, Reinette Chatre wrote:
>>>>>> On 3/12/25 11:14 AM, Moger, Babu wrote:
>>>>>>> On 3/12/25 12:14, Reinette Chatre wrote:
>>>>>>>> On 3/12/25 9:03 AM, Moger, Babu wrote:
>>>>>>>>> On 3/12/25 10:07, Reinette Chatre wrote:
>>>>
>>>>
>>>>> Here are the steps. Just copying steps from Peters proposal.
>>>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>>
>>>> Thank you very much for detailing the steps. It is starting the fall into place
>>>> for me.
>>>>
>>>>>
>>>>>
>>>>> 1. Mount the resctrl
>>>>> mount -t resctrl resctrl /sys/fs/resctrl
>>>>
>>>> I assume that on ABMC system the plan remains to have ABMC enabled by default, which
>>>> will continue to depend on BMEC.
>>>
>>> Yes. ABMC will be enabled by default. ABMC will use the configurations from info/L3_MON/counter_configs. ABMC will not depend on BMEC.
>>
>> I see. The previous dependency was thus just something enforced by OS to support the
>> chosen implementation?
>
> Yes. That is correct. We went that route mainly not to change the
> rmid_read operation.
>
> With ABMC, we need to set Extended EVTID and ABMC bit in QM_EVTSEL
> register while reading the cntr_id events. Will add those patches in next
> version to make it clear.
Thank you.
>
>> Looks like the two features share some registers.
>>
>>>
>>>> How would the existing BMEC implementation be impacted in this case?
>>>
>>> BMEC will only work with pre-ABMC(or default) mode.
>>
>> ok. Does this mean that if a user boots kernel with "rdt=!bmec" then ABMC will keep working?
>
> Yes. That is correct.
Just to confirm and bring the two email threads together ... it sounds like the
expectation is that existing users of BMEC are expected to use mon_features to
know if mbm_{total,local}_bytes_config are supported. If system supports ABMC
then BMEC will not be available and thus mon_features will not contain
mbm_{total,local}_bytes_config. Existing users that rely on
mbm_{total,local}_bytes_config will experience failures and are expected
to switch to ABMC?
>
>>
>>
>>>> Without any changes to BMEC support the mbm_total_bytes_config and mbm_local_bytes_config
>>>> files will remain and user space may continue to use them to change the event
>>>> configurations with confusing expecations/results on an ABMC system.
>>>>
>>>> One possibility may be that a user may see below on ABMC system even if BMEC is supported:
>>>> # cat /sys/fs/resctrl/info/L3_MON/mon_features
>>>> llc_occupancy
>>>> mbm_total_bytes
>>>> mbm_local_bytes
>>>>
>>>> With the above a user cannot be expected to want to interact with mbm_total_bytes_config
>>>> and mbm_local_bytes_config, which may be the simplest to do.
>>>
>>> yes.
...
>>>>
>>>> I do think, and Peter also mentioned [1] this, that it may be useful,
>>>> to "put a group/resource-scoped assign_* file higher in the hierarchy
>>>> to make it easier for users who want to configure all domains the
>>>> same for a group."
>>>>
>>>> Placing *additional* files higher in hierarchy (used to manage counters in all
>>>> domains) may be more useful that trying to provide the shared/exclusive/unassign
>>>> in one file per domain.
>>>
>>> Yea. To make it better we can add "mon_l3_assignments" in groups main directory. We can do all the operation in just one file.
>>>
>>> https://lore.kernel.org/lkml/efb5293f-b0ef-4c94-bf10-9ca7ebb3b53f@amd.com/
>>
>> I am hesitant to respond to that message considering the corporate preamble that
>> sneaked in so I'll just add some thoughts here:
>
> Yea. I noticed it later. Will take care next time.
>
>>
>> Having the file higher in hierarchy does seem more useful. It may be useful to reduce
>> amount of parsing to get to needed information. Compare with below two examples that can
>> be used to convey the same information:
>>
>> # cat /sys/fs/resctrl/test/mon_L3_assignments
>> mbm_total_bytes: 0=unassigned; 1=unassigned
>> mbm_local_bytes: 0=unassigned; 1=unassigned
>>
>> #cat /sys/fs/resctrl/test/mon_L3_assignments
>> 0=_; 1=_
>>
>> We need to take care that it is always clear what "0" or "1" means ...
>> Tony has been mentioning a lot of interesting things about scope
>> changes. I assume the "L3" in mon_L3_assignments will dictate the scope?
>
> I didnt think about the scope here. I was thinking of changing it to
> "mbm_assignments".
ah, I see, not a general "monitoring" file but specific to MBM. This still
may encounter difficulty if AMD does something like SNC where MBM could
be done per numa node. Perhaps we could constrain this even more with a
"mbm_L3_assignments". If anything ever shows up that need to do MBM
counter assignment at some other scope then at least we have the option
to create another file "mbm_?_assignments".
>
>>
>> With a syntax like above the needed information can be presented in one line.
>> For example,
>>
>> #cat /sys/fs/resctrl/test/mon_L3_assignments
>> 0=mbm_total_bytes; 1=mbm_local_bytes
>>
>> The caveat is that is only for assigned counters, not shared, so this needs
>> to change.
>>
>> To support shared assignment ... I wonder if it will be useful to users to
>> get the information on which other monitor groups the counter is shared _with_?
>>
>> Some examples:
>>
>> a) Just indicate whether a counter is shared or dedicated. (Introduce flags).
>> #cat /sys/fs/resctrl/test/mon_L3_assignments
>> 0=mbm_total_bytes:s; 1=mbm_local_bytes:d
>>
>> b) Indicate which groups a counter is shared with:
>> #cat /sys/fs/resctrl/testA/mon_L3_assignments
>> 0=mbm_total_bytes:s(testB); 1=mbm_local_bytes:d(not needed but perhaps empty for consistent interface?)
>> #cat /sys/fs/resctrl/testB/mon_L3_assignments
>> 0=mbm_total_bytes:s(testA); 1=mbm_local_bytes:d(?)
>
> This format does not tell what is going on with all other domains. We need
> to display all the domains. I think that is important because users need
> to know what to expect reading the events on specific domains.
>
> I think we need to convey all the following information to the user.
>
> 1. Event Configuation: What is event configuration applied here?
>
> 2. Domains: To which all the domains the configaration is applied?
> This is useful in multi-domain configuration.
> We also need to know if which domains are assigned or unassigned.
>
> 3. Type of assignment: Exclusive(e or d) or Shared(s) or Unassigned(_)
>
> 4. Finally: Where to read the events?
> This is important when we add "mkdir" support in the future,
> mon_data/mon_l3_*/config_name will be created.
>
>
> With that in mind this might be helpful.
>
> # cat /sys/fs/resctrl/test/mon_L3_assignments
> mbm_total_bytes: 0=e; 1=_
> mbm_local_bytes: 0=_; 1=s
>
> This format tells the user all the information.
> mbm_total_bytes and mbm_local_bytes configurations are applied and
> configuration are coming from counter_configs.
>
> User can read the events in
> mon_data/mon_L3_*/mbm_total_bytes
> mon_data/mon_L3_*/mbm_local_bytes
>
> mbm_total_bytes is assigned on domain 0 and not on domain 1.
> Reading the mbm_total_bytes on domain 1 will report "unassigned".
>
> mbm_local_bytes is not assigned on domain 0 and assigned on domain 1.
> Reading the mbm_local_bytes on domain 0 will report "unassigned".
Thank you very much for spelling it out. Much appreciated. This looks good to me.
Please include your list of requirements for interface in the cover-letter and/or
patch that introduces the interface.
>
> I dont have much information on shared assignment at this point. Dont know
> if we can display shared group.
The proposed interface accommodates shared counters. The expectation is that
users can keep track themselves and if not, then the information can be
obtained with a read of every group's counter assignment. The issue here is
that this may worst case need a large number of file operations if expectation
is that it will still be possible to create num RMID monitoring groups.
Using files inside monitor group for this information may actually not be ideal.
If this information is needed then we could perhaps add a new file. For
example:
/sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes/<file reporting which monitor groups share this counter configuration in different domains>
Of course, I do not know if this will be required and this seems manageable as
a later enhancement if needed.
>
>>
>> ... (b) may just be overkill and we should instead follow Tony's
>> guideline (see https://lore.kernel.org/lkml/Z9CiwLrhuTODruCj@agluck-desk3/ )
>> that users should be able to keep track themselves.
>>
>> Reinette
>>
>
Reinette
Hi Reinette,
On 3/20/2025 5:35 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 3/20/25 11:12 AM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 3/19/25 13:36, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 3/14/25 9:18 AM, Moger, Babu wrote:
>>>> On 3/13/2025 4:21 PM, Reinette Chatre wrote:
>>>>> On 3/13/25 1:13 PM, Moger, Babu wrote:
>>>>>> On 3/13/25 11:08, Reinette Chatre wrote:
>>>>>>> On 3/12/25 11:14 AM, Moger, Babu wrote:
>>>>>>>> On 3/12/25 12:14, Reinette Chatre wrote:
>>>>>>>>> On 3/12/25 9:03 AM, Moger, Babu wrote:
>>>>>>>>>> On 3/12/25 10:07, Reinette Chatre wrote:
>>>>>
>>>>>
>>>>>> Here are the steps. Just copying steps from Peters proposal.
>>>>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>>>
>>>>> Thank you very much for detailing the steps. It is starting the fall into place
>>>>> for me.
>>>>>
>>>>>>
>>>>>>
>>>>>> 1. Mount the resctrl
>>>>>> mount -t resctrl resctrl /sys/fs/resctrl
>>>>>
>>>>> I assume that on ABMC system the plan remains to have ABMC enabled by default, which
>>>>> will continue to depend on BMEC.
>>>>
>>>> Yes. ABMC will be enabled by default. ABMC will use the configurations from info/L3_MON/counter_configs. ABMC will not depend on BMEC.
>>>
>>> I see. The previous dependency was thus just something enforced by OS to support the
>>> chosen implementation?
>>
>> Yes. That is correct. We went that route mainly not to change the
>> rmid_read operation.
>>
>> With ABMC, we need to set Extended EVTID and ABMC bit in QM_EVTSEL
>> register while reading the cntr_id events. Will add those patches in next
>> version to make it clear.
>
> Thank you.
>
>>
>>> Looks like the two features share some registers.
>>>
>>>>
>>>>> How would the existing BMEC implementation be impacted in this case?
>>>>
>>>> BMEC will only work with pre-ABMC(or default) mode.
>>>
>>> ok. Does this mean that if a user boots kernel with "rdt=!bmec" then ABMC will keep working?
>>
>> Yes. That is correct.
>
> Just to confirm and bring the two email threads together ... it sounds like the
> expectation is that existing users of BMEC are expected to use mon_features to
> know if mbm_{total,local}_bytes_config are supported. If system supports ABMC
> then BMEC will not be available and thus mon_features will not contain
> mbm_{total,local}_bytes_config. Existing users that rely on
> mbm_{total,local}_bytes_config will experience failures and are expected
> to switch to ABMC?
Yes. Exactly.
>
>
>>
>>>
>>>
>>>>> Without any changes to BMEC support the mbm_total_bytes_config and mbm_local_bytes_config
>>>>> files will remain and user space may continue to use them to change the event
>>>>> configurations with confusing expecations/results on an ABMC system.
>>>>>
>>>>> One possibility may be that a user may see below on ABMC system even if BMEC is supported:
>>>>> # cat /sys/fs/resctrl/info/L3_MON/mon_features
>>>>> llc_occupancy
>>>>> mbm_total_bytes
>>>>> mbm_local_bytes
>>>>>
>>>>> With the above a user cannot be expected to want to interact with mbm_total_bytes_config
>>>>> and mbm_local_bytes_config, which may be the simplest to do.
>>>>
>>>> yes.
>
>
> ...
>
>>>>>
>>>>> I do think, and Peter also mentioned [1] this, that it may be useful,
>>>>> to "put a group/resource-scoped assign_* file higher in the hierarchy
>>>>> to make it easier for users who want to configure all domains the
>>>>> same for a group."
>>>>>
>>>>> Placing *additional* files higher in hierarchy (used to manage counters in all
>>>>> domains) may be more useful that trying to provide the shared/exclusive/unassign
>>>>> in one file per domain.
>>>>
>>>> Yea. To make it better we can add "mon_l3_assignments" in groups main directory. We can do all the operation in just one file.
>>>>
>>>> https://lore.kernel.org/lkml/efb5293f-b0ef-4c94-bf10-9ca7ebb3b53f@amd.com/
>>>
>>> I am hesitant to respond to that message considering the corporate preamble that
>>> sneaked in so I'll just add some thoughts here:
>>
>> Yea. I noticed it later. Will take care next time.
>>
>>>
>>> Having the file higher in hierarchy does seem more useful. It may be useful to reduce
>>> amount of parsing to get to needed information. Compare with below two examples that can
>>> be used to convey the same information:
>>>
>>> # cat /sys/fs/resctrl/test/mon_L3_assignments
>>> mbm_total_bytes: 0=unassigned; 1=unassigned
>>> mbm_local_bytes: 0=unassigned; 1=unassigned
>>>
>>> #cat /sys/fs/resctrl/test/mon_L3_assignments
>>> 0=_; 1=_
>>>
>>> We need to take care that it is always clear what "0" or "1" means ...
>>> Tony has been mentioning a lot of interesting things about scope
>>> changes. I assume the "L3" in mon_L3_assignments will dictate the scope?
>>
>> I didnt think about the scope here. I was thinking of changing it to
>> "mbm_assignments".
>
> ah, I see, not a general "monitoring" file but specific to MBM. This still
> may encounter difficulty if AMD does something like SNC where MBM could
> be done per numa node. Perhaps we could constrain this even more with a
> "mbm_L3_assignments". If anything ever shows up that need to do MBM
> counter assignment at some other scope then at least we have the option
> to create another file "mbm_?_assignments".
Yes. Sounds good to me.
>
>>
>>>
>>> With a syntax like above the needed information can be presented in one line.
>>> For example,
>>>
>>> #cat /sys/fs/resctrl/test/mon_L3_assignments
>>> 0=mbm_total_bytes; 1=mbm_local_bytes
>>>
>>> The caveat is that is only for assigned counters, not shared, so this needs
>>> to change.
>>>
>>> To support shared assignment ... I wonder if it will be useful to users to
>>> get the information on which other monitor groups the counter is shared _with_?
>>>
>>> Some examples:
>>>
>>> a) Just indicate whether a counter is shared or dedicated. (Introduce flags).
>>> #cat /sys/fs/resctrl/test/mon_L3_assignments
>>> 0=mbm_total_bytes:s; 1=mbm_local_bytes:d
>>>
>>> b) Indicate which groups a counter is shared with:
>>> #cat /sys/fs/resctrl/testA/mon_L3_assignments
>>> 0=mbm_total_bytes:s(testB); 1=mbm_local_bytes:d(not needed but perhaps empty for consistent interface?)
>>> #cat /sys/fs/resctrl/testB/mon_L3_assignments
>>> 0=mbm_total_bytes:s(testA); 1=mbm_local_bytes:d(?)
>>
>> This format does not tell what is going on with all other domains. We need
>> to display all the domains. I think that is important because users need
>> to know what to expect reading the events on specific domains.
>>
>> I think we need to convey all the following information to the user.
>>
>> 1. Event Configuation: What is event configuration applied here?
>>
>> 2. Domains: To which all the domains the configaration is applied?
>> This is useful in multi-domain configuration.
>> We also need to know if which domains are assigned or unassigned.
>>
>> 3. Type of assignment: Exclusive(e or d) or Shared(s) or Unassigned(_)
>>
>> 4. Finally: Where to read the events?
>> This is important when we add "mkdir" support in the future,
>> mon_data/mon_l3_*/config_name will be created.
>>
>>
>> With that in mind this might be helpful.
>>
>> # cat /sys/fs/resctrl/test/mon_L3_assignments
>> mbm_total_bytes: 0=e; 1=_
>> mbm_local_bytes: 0=_; 1=s
>>
>> This format tells the user all the information.
>> mbm_total_bytes and mbm_local_bytes configurations are applied and
>> configuration are coming from counter_configs.
>>
>> User can read the events in
>> mon_data/mon_L3_*/mbm_total_bytes
>> mon_data/mon_L3_*/mbm_local_bytes
>>
>> mbm_total_bytes is assigned on domain 0 and not on domain 1.
>> Reading the mbm_total_bytes on domain 1 will report "unassigned".
>>
>> mbm_local_bytes is not assigned on domain 0 and assigned on domain 1.
>> Reading the mbm_local_bytes on domain 0 will report "unassigned".
>
> Thank you very much for spelling it out. Much appreciated. This looks good to me.
> Please include your list of requirements for interface in the cover-letter and/or
> patch that introduces the interface.
Sure. Will do.
>
>>
>> I dont have much information on shared assignment at this point. Dont know
>> if we can display shared group.
>
> The proposed interface accommodates shared counters. The expectation is that
> users can keep track themselves and if not, then the information can be
> obtained with a read of every group's counter assignment. The issue here is
> that this may worst case need a large number of file operations if expectation
> is that it will still be possible to create num RMID monitoring groups.
>
> Using files inside monitor group for this information may actually not be ideal.
> If this information is needed then we could perhaps add a new file. For
> example:
> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes/<file reporting which monitor groups share this counter configuration in different domains>
> Of course, I do not know if this will be required and this seems manageable as
> a later enhancement if needed.
>
Yes. It can be done this way.
Thanks
Babu
On Thu, Mar 13, 2025 at 03:13:32PM -0500, Moger, Babu wrote: > Hi Reinette, > > On 3/13/25 11:08, Reinette Chatre wrote: > > Hi Babu, > > > > On 3/12/25 11:14 AM, Moger, Babu wrote: > >> Hi Reinette, > >> > >> On 3/12/25 12:14, Reinette Chatre wrote: > >>> Hi Babu, > >>> > >>> On 3/12/25 9:03 AM, Moger, Babu wrote: > >>>> Hi Reinette, > >>>> > >>>> On 3/12/25 10:07, Reinette Chatre wrote: > >>>>> Hi Babu, > >>>>> > .. > > >>>>>> We can add the mkdir support later. That way we can provide basic ABMC > >>>>>> support without too much code complexity with mkdir support. > >>>>> > >>>>> This is not clear to me how you envision the "first phase". Is it what you > >>>>> proposed above, for example: > >>>>> #echo "LclFill, LclNTWr, RmtFill" > > >>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes > >>>>> > >>>>> In above the counter configuration name is a file. > >>>> > >>>> Yes. That is correct. > >>>> > >>>> There will be two configuration files by default when resctrl is mounted > >>>> when ABMC is enabled. > >>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes > >>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes > >>>> > >>>>> > >>>>> How could mkdir support be added to this later if there are already files present? > >>>> > >>>> We already have these directories when resctrl is mounted. > >>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes > >>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes > >>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes > >>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes > >>>> > >>>> We dont need "mkdir" support for default configurations. > >>> > >>> I was referring to the "mkdir" support for additional configurations that > >>> I understood you are thinking about adding later. For example, > >>> (copied from Peter's message > >>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/): > >>> > >>> > >>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes > >>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter > >>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter > >>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter > >>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter > >>> LclFill > >>> LclNTWr > >>> LclSlowFill > >>> > >>> Any "later" work needs to be backward compatible with the first phase. > >> > >> Actually, we dont need extra file "event_filter". > >> This was discussed here. > >> https://lore.kernel.org/lkml/CALPaoChLL8p49eANYgQ0dJiFs7G=223fGae+LJyx3DwEhNeR8A@mail.gmail.com/ > > > > I undestand from that exchange that it is possible to read/write from > > an *existing* kernfs file but it is not obvious to me how that file is > > planned to be created. > > My bad.. I misspoke here. We need "event_filter" file under each > configuration. > > > > > > My understanding of the motivation behind support for "mkdir" is to enable > > user space to create custom counter configurations. > > > > That is correct. > > > I understand that ABMC support aims to start with existing mbm_total_bytes/mbm_local_bytes > > configurations but I believe the consensus is that custom configurations need > > to be supported in the future. > > If resctrl starts with support where counter configuration as > > managed with a *file*, for example: > > /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes > > how will user space create future custom configurations? > > As I understand that is only possible with mkdir. > > > >> > >> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes > >> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes > >> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes > >> # cat info/L3_MON/counter_configs/mbm_local_bytes > >> LclFill > >> LclNTWr > >> LclSlowFill > >> > >> In the future, we can add mkdir support. > >> > >> # mkdir info/L3_MON/counter_configs/mbm_read_only > > > > This is exactly my concern. resctrl should not start with a user space where > > a counter configuration is a file (mbm_local_bytes/mbm_total_bytes) and then > > switch user space interface to have counter configuration be done with > > directories. > > > >> # echo LclFill > info/L3_MON/counter_configs/mbm_read_only > >> # cat info/L3_MON/counter_configs/mbm_read_only > >> LclFill > > > > ... wait ... user space writes to the directory? > > > > My bad. This is wrong. Let me rewrite the steps below. > > > > > > >> > >> #echo mbm_read_only > test/mon_data/mon_L3_00/assign_exclusive > >> > >> Which would result in the creation of test/mon_data/mon_L3_*/mbm_read_only > >> > >> So, there is not breakage of backword compatibility. > > > > The way I understand it I am seeing many incompatibilities. Perhaps I am missing > > something. Could you please provide detailed steps of how first phase and > > second phase would look? > > No. You didn't miss anything. I misspoke on few steps. > > Here are the steps. Just copying steps from Peters proposal. > https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/ > > > 1. Mount the resctrl > mount -t resctrl resctrl /sys/fs/resctrl > > 2. When ABMC is supported two default configurations will be created. > > a. info/L3_MON/counter_configs/mbm_total_bytes/event_filter > b. info/L3_MON/counter_configs/mbm_local_bytes/event_filter > > These files will be populated with default total and local events > # cat info/L3_MON/counter_configs/mbm_total_bytes/event_filter > VictimBW > RmtSlowFill > RmtNTWr > RmtFill > LclFill > LclNTWr > LclSlowFill > > # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter > LclFill, > LclNTWr > LclSlowFill > > 3. Users will have options to update the event configuration. > echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter Once the "mkdir" support described below is implemented users will not need to redefine these legacy event file names. That makes me happy. > > 4. As usual the events can be read from the mon_data directories. > #mkdir /sys/fs/resctrl/test > #cd /sys/fs/resctr/test > #cat test/mon_data/mon_data/mon_L3_00/mbm_tota_bytes > 101010 > #cat test/mon_data/mon_data/mon_L3_00/mbm_local_bytes > 32323 > > 5. There will be 3 files created in each group's mon_data directory when > ABMC is supported. > > a. test/mon_data/mon_L3_00/assign_exclusive > b. test/mon_data/mon_L3_00/assign_shared > c. test/mon_data/mon_L3_00/unassign > > > 6. Events can be assigned/unassigned by these commands > > # echo mbm_total_bytes > test/mon_data/mon_L3_00/assign_exclusive > # echo mbm_local_bytes > test/mon_data/mon_L3_01/assign_exclusive > # echo mbm_local_bytes > test/mon_data/mon_L3_01/unassign > > > Note: > I feel 3 files are excessive here. We can probably achieve everything in > just one file. Maybe the one file could look like: # cat mon_L3_assignments mbm_total_bytes: exclusive mbm_local_bytes: shared mbm_read_only: unassigned with new lines appearing when mkdir creates new events, and the obvious write semantics: # echo "mbm_total_bytes: unassigned" > mon_L3_assignments to make updates. > Not sure about mbm_assign_control interface as there are concerns with > group listing holding the lock for long. > > ----------------------------------------------------------------------- > Second phase, we can add support for "mkdir" > > 1. mkdir info/L3_MON/counter_configs/mbm_read_only > > 2. mkdir option will create "event_filter" file. > info/L3_MON/counter_configs/mbm_read_only/event_filter > > 3. Users can modify event configuration. > echo LclFill > info/L3_MON/counter_configs/mbm_read_only/event_filter > > 4. Users can assign the events > > echo mbm_read_only > test/mon_data/mon_L3_00/assign_exclusive > > 5. Events can be read in > > test/mon_data/mon_data/mon_L3_00/mbm_read_only Is there a matching "rmdir" to make this go away again? > -- > Thanks > Babu Moger
[AMD Official Use Only - AMD Internal Distribution Only] Hi Tony, On 3/13/2025 3:36 PM, Luck, Tony wrote: > On Thu, Mar 13, 2025 at 03:13:32PM -0500, Moger, Babu wrote: >> Hi Reinette, >> >> On 3/13/25 11:08, Reinette Chatre wrote: >>> Hi Babu, >>> >>>> #echo mbm_read_only > test/mon_data/mon_L3_00/assign_exclusive >>>> >>>> Which would result in the creation of test/mon_data/mon_L3_*/mbm_read_only >>>> >>>> So, there is not breakage of backword compatibility. >>> >>> The way I understand it I am seeing many incompatibilities. Perhaps I am missing >>> something. Could you please provide detailed steps of how first phase and >>> second phase would look? >> >> No. You didn't miss anything. I misspoke on few steps. >> >> Here are the steps. Just copying steps from Peters proposal. >> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/ >> >> >> 1. Mount the resctrl >> mount -t resctrl resctrl /sys/fs/resctrl >> >> 2. When ABMC is supported two default configurations will be created. >> >> a. info/L3_MON/counter_configs/mbm_total_bytes/event_filter >> b. info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> >> These files will be populated with default total and local events >> # cat info/L3_MON/counter_configs/mbm_total_bytes/event_filter >> VictimBW >> RmtSlowFill >> RmtNTWr >> RmtFill >> LclFill >> LclNTWr >> LclSlowFill >> >> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter >> LclFill, >> LclNTWr >> LclSlowFill >> >> 3. Users will have options to update the event configuration. >> echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter > > Once the "mkdir" support described below is implemented users will not > need to redefine these legacy event file names. That makes me happy. Yea. That is correct. > >> >> 4. As usual the events can be read from the mon_data directories. >> #mkdir /sys/fs/resctrl/test >> #cd /sys/fs/resctr/test >> #cat test/mon_data/mon_data/mon_L3_00/mbm_tota_bytes >> 101010 >> #cat test/mon_data/mon_data/mon_L3_00/mbm_local_bytes >> 32323 >> >> 5. There will be 3 files created in each group's mon_data directory when >> ABMC is supported. >> >> a. test/mon_data/mon_L3_00/assign_exclusive >> b. test/mon_data/mon_L3_00/assign_shared >> c. test/mon_data/mon_L3_00/unassign >> >> >> 6. Events can be assigned/unassigned by these commands >> >> # echo mbm_total_bytes > test/mon_data/mon_L3_00/assign_exclusive >> # echo mbm_local_bytes > test/mon_data/mon_L3_01/assign_exclusive >> # echo mbm_local_bytes > test/mon_data/mon_L3_01/unassign >> >> >> Note: >> I feel 3 files are excessive here. We can probably achieve everything in >> just one file. > > Maybe the one file could look like: > > # cat mon_L3_assignments > mbm_total_bytes: exclusive > mbm_local_bytes: shared > mbm_read_only: unassigned > > with new lines appearing when mkdir creates new events, and the obvious > write semantics: > > # echo "mbm_total_bytes: unassigned" > mon_L3_assignments > > to make updates. Yes. That would work. Also we could move the file to group's main directory like we have other files already. #cat /sys/fs/resctrl/test/mon_L3_assignments mbm_total_bytes: 0=unassigned; 1=unassigned mbm_local_bytes: 0=unassigned; 1=unassigned To assign mbm_total_bytes config on domain 0. $echo "mbm_total_bytes: 0=exclusive " > mon_L3_assignments To assign mbm_total_bytes config on all the domains. $echo "mbm_total_bytes: *=exclusive " > mon_L3_assignments #cat /sys/fs/resctrl/test/mon_L3_assignments mbm_total_bytes: 0=exclusive; 1=exclusive mbm_local_bytes: 0=unassigned; 1=unassigned > >> Not sure about mbm_assign_control interface as there are concerns with >> group listing holding the lock for long. >> >> ----------------------------------------------------------------------- >> Second phase, we can add support for "mkdir" >> >> 1. mkdir info/L3_MON/counter_configs/mbm_read_only >> >> 2. mkdir option will create "event_filter" file. >> info/L3_MON/counter_configs/mbm_read_only/event_filter >> >> 3. Users can modify event configuration. >> echo LclFill > info/L3_MON/counter_configs/mbm_read_only/event_filter >> >> 4. Users can assign the events >> >> echo mbm_read_only > test/mon_data/mon_L3_00/assign_exclusive >> >> 5. Events can be read in >> >> test/mon_data/mon_data/mon_L3_00/mbm_read_only > > Is there a matching "rmdir" to make this go away again? > I would think so. Thanks Babu
On Tue, Mar 11, 2025 at 03:35:28PM -0500, Moger, Babu wrote:
> Hi All,
>
> On 3/10/25 22:51, Reinette Chatre wrote:
> >
> >
> > On 3/10/25 6:44 PM, Moger, Babu wrote:
> >> Hi Tony,
> >>
> >> On 3/10/2025 6:22 PM, Luck, Tony wrote:
> >>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
> >>>> Hi All,
> >>>>
> >>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
> >>>>> Hi Peter,
> >>>>>
> >>>>> On 3/5/25 04:40, Peter Newman wrote:
> >>>>>> Hi Babu,
> >>>>>>
> >>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
> >>>>>>>
> >>>>>>> Hi Peter,
> >>>>>>>
> >>>>>>> On 3/4/25 10:44, Peter Newman wrote:
> >>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Peter/Reinette,
> >>>>>>>>>
> >>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
> >>>>>>>>>> Hi Babu,
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Peter,
> >>>>>>>>>>>
> >>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
> >>>>>>>>>>>> Hi Reinette,
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
> >>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Peter,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
> >>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
> >>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
> >>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
> >>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
> >>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
> >>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
> >>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
> >>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
> >>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
> >>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
> >>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
> >>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
> >>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
> >>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
> >>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
> >>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
> >>>>>>>>>>>>>>>>>>>>>>>>> <value>
> >>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
> >>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
> >>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
> >>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
> >>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
> >>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
> >>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
> >>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
> >>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
> >>>>>>>>>>>>>>>>>>>> for.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
> >>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
> >>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
> >>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
> >>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
> >>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
> >>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
> >>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
> >>>>>>>>>>>>>>>>>>> customers.
> >>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
> >>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
> >>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
> >>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
> >>>>>>>>>>>>>>>>>> event names.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thank you for clarifying.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
> >>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
> >>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
> >>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
> >>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
> >>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
> >>>>>>>>>>>>>>>>>> writes in ABMC would look like...
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> (per domain)
> >>>>>>>>>>>>>>>>>> group 0:
> >>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>>>> group 1:
> >>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
> >>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
> >>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
> >>>>>>>>>>>>>>>>> configuration is a requirement?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
> >>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
> >>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
> >>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
> >>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
> >>>>>>>>>>>>>>>> there's less pressure on the counters.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
> >>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
> >>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
> >>>>>>>>>>>>>>>> many counters the group needs in each domain.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
> >>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
> >>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
> >>>>>>>>>>>>>>> of the hardware.
> >>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
> >>>>>>>>>>>>>>> earlier example copied below:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> (per domain)
> >>>>>>>>>>>>>>>>>> group 0:
> >>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>>>> group 1:
> >>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>>>> ...
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
> >>>>>>>>>>>>>>> I understand it:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> group 0:
> >>>>>>>>>>>>>>> domain 0:
> >>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>> domain 1:
> >>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>> group 1:
> >>>>>>>>>>>>>>> domain 0:
> >>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>> domain 1:
> >>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
> >>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
> >>>>>>>>>>>>>>> in domain 1, resulting in:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> group 0:
> >>>>>>>>>>>>>>> domain 0:
> >>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>> group 1:
> >>>>>>>>>>>>>>> domain 0:
> >>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>> domain 1:
> >>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
> >>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> group 0:
> >>>>>>>>>>>>>>> domain 0:
> >>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>> group 1:
> >>>>>>>>>>>>>>> domain 0:
> >>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>> domain 1:
> >>>>>>>>>>>>>>> counter 0: LclFill,RmtFill
> >>>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>>> counter 3: VictimBW
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
> >>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
> >>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
> >>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
> >>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
> >>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
> >>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
> >>>>>>>>>>>>>> groupings to count.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
> >>>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
> >>>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> /group0/0=t;1=t
> >>>>>>>>>>>>>> /group1/0=t;1=t
> >>>>>>>>>>>>>> /group2/0=_;1=t
> >>>>>>>>>>>>>> /group3/0=rw;1=_
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - group2 is restricted to domain 0
> >>>>>>>>>>>>>> - group3 is restricted to domain 1
> >>>>>>>>>>>>>> - the rest are unrestricted
> >>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I see. Thank you for the example.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
> >>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> /group0/0=t;1=t
> >>>>>>>>>>>>> /group1/0=t;1=t
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
> >>>>>>>>>>>>> be configured differently in each domain.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
> >>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
> >>>>>>>>>>>> domain use the same configurations and are limited to two events per
> >>>>>>>>>>>> group and a per-group mode where every group can be configured and
> >>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
> >>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
> >>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
> >>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
> >>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
> >>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
> >>>>>>>>>>>> have the same flexibility as on MPAM.
> >>>>>>>>>>>
> >>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
> >>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
> >>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
> >>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
> >>>>>>>>>>>
> >>>>>>>>>>> It is documented below.
> >>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
> >>>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
> >>>>>>>>>>>
> >>>>>>>>>>> We previously discussed this with you (off the public list) and I
> >>>>>>>>>>> initially proposed the extended assignment mode.
> >>>>>>>>>>>
> >>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
> >>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
> >>>>>>>>>>> just two.
> >>>>>>>>>>>
> >>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
> >>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
> >>>>>>>>>>> extended mode is not practical at this time.
> >>>>>>>>>>>
> >>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
> >>>>>>>>>>> require modifications to the existing interface, allowing us to continue
> >>>>>>>>>>> using it as is.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> (I might have said something confusing in my last messages because I
> >>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
> >>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
> >>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
> >>>>>>>>>>>> earlier is one I've already been asked about.
> >>>>>>>>>>>
> >>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
> >>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
> >>>>>>>>>>> finalize how to configure the multiple event interface for each group.
> >>>>>>>>>>
> >>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
> >>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
> >>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
> >>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
> >>>>>>>>>> there's already an expectation that the files are present when BMEC is
> >>>>>>>>>> supported.
> >>>>>>>>>>
> >>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
> >>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
> >>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
> >>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
> >>>>>>>>>> interface. If it does, it's something we can live with.
> >>>>>>>>>
> >>>>>>>>> As you know, this series is currently blocked without further feedback.
> >>>>>>>>>
> >>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
> >>>>>>>>> Any input or suggestions would be appreciated.
> >>>>>>>>>
> >>>>>>>>> Here’s what we’ve learned so far:
> >>>>>>>>>
> >>>>>>>>> 1. Assignments should be independent of BMEC.
> >>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
> >>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
> >>>>>>>>> 3. There should be an option to assign events per domain.
> >>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
> >>>>>>>>> should allow flexibility to assign more in the future as the interface
> >>>>>>>>> evolves.
> >>>>>>>>> 5. Utilize the extended RMID read mode.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Here is my proposal using Peter's earlier example:
> >>>>>>>>>
> >>>>>>>>> # define event configurations
> >>>>>>>>>
> >>>>>>>>> ========================================================
> >>>>>>>>> Bits Mnemonics Description
> >>>>>>>>> ==== ========================================================
> >>>>>>>>> 6 VictimBW Dirty Victims from all types of memory
> >>>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
> >>>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
> >>>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
> >>>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
> >>>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
> >>>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
> >>>>>>>>> ==== ========================================================
> >>>>>>>>>
> >>>>>>>>> #Define flags based on combination of above event types.
> >>>>>>>>>
> >>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
> >>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
> >>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
> >>>>>>>>> v = VictimBW
> >>>>>>>>>
> >>>>>>>>> Peter suggested the following format earlier :
> >>>>>>>>>
> >>>>>>>>> /group0/0=t;1=t
> >>>>>>>>> /group1/0=t;1=t
> >>>>>>>>> /group2/0=_;1=t
> >>>>>>>>> /group3/0=rw;1=_
> >>>>>>>>
> >>>>>>>> After some inquiries within Google, it sounds like nobody has invested
> >>>>>>>> much into the current mbm_assign_control format yet, so it would be
> >>>>>>>> best to drop it and distribute the configuration around the filesystem
> >>>>>>>> hierarchy[1], which should allow us to produce something more flexible
> >>>>>>>> and cleaner to implement.
> >>>>>>>>
> >>>>>>>> Roughly what I had in mind:
> >>>>>>>>
> >>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
> >>>>>>>> names for the assignable configurations rather than being restricted
> >>>>>>>> to single letters. In the resulting directory, populate a file where
> >>>>>>>> we can specify the set of events the config should represent. I think
> >>>>>>>> we should use symbolic names for the events rather than raw BMEC field
> >>>>>>>> values. Moving forward we could come up with portable names for common
> >>>>>>>> events and only support the BMEC names on AMD machines for users who
> >>>>>>>> want specific events and don't care about portability.
> >>>>>>>
> >>>>>>>
> >>>>>>> I’m still processing this. Let me start with some initial questions.
> >>>>>>>
> >>>>>>> So, we are creating event configurations here, which seems reasonable.
> >>>>>>>
> >>>>>>> Yes, we should use portable names and are not limited to BMEC names.
> >>>>>>>
> >>>>>>> How many configurations should we allow? Do we know?
> >>>>>>
> >>>>>> Do we need an upper limit?
> >>>>>
> >>>>> I think so. This needs to be maintained in some data structure. We can
> >>>>> start with 2 default configurations for now.
> >
> > There is a big difference between no upper limit and 2. The hardware is
> > capable of supporting per-domain configurations so more flexibility is
> > certainly possible. Consider the example presented by Peter in:
> > https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
> >
> >>>>>>>> Next, put assignment-control file nodes in per-domain directories
> >>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
> >>>>>>>> counter-configuration name into the file would then allocate a counter
> >>>>>>>> in the domain, apply the named configuration, and monitor the parent
> >>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
> >>>>>>>> higher in the hierarchy to make it easier for users who want to
> >>>>>>>> configure all domains the same for a group.
> >>>>>>>
> >>>>>>> What is the difference between shared and exclusive?
> >>>>>>
> >>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
> >>>>>> each domain will be scheduled round-robin to the groups requesting
> >>>>>> shared access to a counter. In my tests, I assigned the counters long
> >>>>>> enough to produce a single 1-second MB/s sample for the per-domain
> >>>>>> aggregation files[2].
> >>>>>>
> >>>>>> These do not need to be implemented immediately, but knowing that they
> >>>>>> work addresses the overhead and scalability concerns of reassigning
> >>>>>> counters and reading their values.
> >>>>>
> >>>>> Ok. Lets focus on exclusive assignments for now.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
> >>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
> >>>>>>> results in 32 × 12 × 3 files, which is quite large.
> >>>>>>>
> >>>>>>> There should be a more efficient way to handle this.
> >>>>>>>
> >>>>>>> Initially, we started with a group-level file for this interface, but it
> >>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
> >>>>>>
> >>>>>> I had rejected it due to the high-frequency of access of a large
> >>>>>> number of files, which has since been addressed by shared assignment
> >>>>>> (or automatic reassignment) and aggregated mbps files.
> >>>>>
> >>>>> I think we should address this as well. Creating three extra files for
> >>>>> each group isn’t ideal when there are more efficient alternatives.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Additionally, how can we list all assignments with a single sysfs call?
> >>>>>>>
> >>>>>>> That was another problem we need to address.
> >>>>>>
> >>>>>> This is not a requirement I was aware of. If the user forgot where
> >>>>>> they assigned counters (or forgot to disable auto-assignment), they
> >>>>>> can read multiple sysfs nodes to remind themselves.
> >>>>>
> >>>>> I suggest, we should provide users with an option to list the assignments
> >>>>> of all groups in a single command. As the number of groups increases, it
> >>>>> becomes cumbersome to query each group individually.
> >>>>>
> >>>>> To achieve this, we can reuse our existing mbm_assign_control interface
> >>>>> for this purpose. More details on this below.
> >>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> The configuration names listed in assign_* would result in files of
> >>>>>>>> the same name in the appropriate mon_data domain directories from
> >>>>>>>> which the count values can be read.
> >>>>>>>>
> >>>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
> >>>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> >>>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> >>>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> >>>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
> >>>>>>>> LclFill
> >>>>>>>> LclNTWr
> >>>>>>>> LclSlowFill
> >>>>>>>
> >>>>>>> I feel we can just have the configs. event_filter file is not required.
> >>>>>>
> >>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
> >>>>>> only looking at struct kernfs_syscall_ops
> >>>>>>
> >>>>>>>
> >>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
> >>>>>>> LclFill <-rename these to generic names.
> >>>>>>> LclNTWr
> >>>>>>> LclSlowFill
> >>>>>>>
> >>>>>>
> >>>>>> I think portable and non-portable event names should both be available
> >>>>>> as options. There are simple bandwidth measurement mechanisms that
> >>>>>> will be applied in general, but when they turn up an issue, it can
> >>>>>> often lead to a more focused investigation, requiring more precise
> >>>>>> events.
> >>>>>
> >>>>> I aggree. We should provide both portable and non-portable event names.
> >>>>>
> >>>>> Here is my draft proposal based on the discussion so far and reusing some
> >>>>> of the current interface. Idea here is to start with basic assigment
> >>>>> feature with options to enhance it in the future. Feel free to
> >>>>> comment/suggest.
> >>>>>
> >>>>> 1. Event configurations will be in
> >>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
> >>>>>
> >>>>> There will be two pre-defined configurations by default.
> >>>>>
> >>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
> >>>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
> >>>>>
> >>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> >>>>> LclFill, LclNTWr, LclSlowFill
> >>>>>
> >>>>> 2. Users will have options to update these configurations.
> >>>>>
> >>>>> #echo "LclFill, LclNTWr, RmtFill" >
> >>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> >>>
> >>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
> >>> reporting "local_bytes" any more. They report something different,
> >>> and users only know if they come to check the options currently
> >>> configured in this file. Changing the contents without changing
> >>> the name seems confusing to me.
> >>
> >> It is the same behaviour right now with BMEC. It is configurable.
> >> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
> >>
> >> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
> >
> > This could be supported by following Peter's original proposal where the name
> > of the counter configuration is provided by the user via a mkdir:
> > https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
> >
> > As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
>
> Sure. We can do that. I was thinking in the first phase, just provide the
> default pre-defined configuration and option to update the configuration.
>
> We can add the mkdir support later. That way we can provide basic ABMC
> support without too much code complexity with mkdir support.
>
> >
> >>
> >>>
> >>>>>
> >>>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
> >>>>> LclFill, LclNTWr, RmtFill
> >>>>>
> >>>>> 3. The default configurations will be used when user mounts the resctrl.
> >>>>>
> >>>>> mount -t resctrl resctrl /sys/fs/resctrl/
> >>>>> mkdir /sys/fs/resctrl/test/
> >>>>>
> >>>>> 4. The resctrl group/domains can be in one of these assingnment states.
> >>>>> e: Exclusive
> >>>>> s: Shared
> >>>>> u: Unassigned
> >>>>>
> >>>>> Exclusive mode is supported now. Shared mode will be supported in the
> >>>>> future.
> >>>>>
> >>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>> to list the assignment state of all the groups.
> >>>>>
> >>>>> Format:
> >>>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
> >>>>>
> >>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>> test//mbm_total_bytes:0=e;1=e
> >>>>> test//mbm_local_bytes:0=e;1=e
> >>>>> //mbm_total_bytes:0=e;1=e
> >>>>> //mbm_local_bytes:0=e;1=e
> >
> > This would make mbm_assign_control even more unwieldy and quicker to exceed a
> > page of data (these examples never seem to reflect those AMD systems with the many
> > L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
> > and solved when/if going this route.
>
> This problem is not specific this series. I feel it is a generic problem
> to many of the semilar interfaces. I dont know how it is addressed. May
> have to investigate on this. Any pointers would be helpful.
>
>
> >
> > There seems to be two opinions about this file at moment. Would it be possible to
> > summarize the discussion with pros/cons raised to make an informed selection?
> > I understand that Google as represented by Peter no longer requires/requests this
> > file but the motivation for this change seems new and does not seem to reduce the
> > original motivation for this file. We may also want to separate requirements for reading
> > from and writing to this file.
>
> Yea. We can just use mbm_assign_control for reading the assignment states.
>
> Summary: We have two proposals.
>
> First one from Peter:
>
> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>
>
> Pros
> a. Allows flexible creation of free-form names for assignable
> configurations, stored in info/L3_MON/counter_configs/.
>
> b. Events can be accessed using corresponding free-form names in the
> mon_data directory, making it clear to users what each event represents.
>
>
> Cons:
> a. Requires three separate files for assignment in each group
> (assign_exclusive, assign_shared, unassign), which might be excessive.
>
> b. No built-in listing support, meaning users must query each group
> individually to check assignment states.
How big of a problem is this in reality? I'd assume that users of this
feature would only reassign counter attributes at some slow rate (set
up counters, measure for at least a few seconds, then set up for next
measurement). Cost to open/read/close a few hundred kernfs files isn't
very high. Biggest cost might be hogging the resctrl mutex which would
cause jitter in the tasks reading data from resctrl monitors.
Anyone doing this at scale should be able to keep track of what they set,
so wouldn't need to read at all. I'm not a big believer in "multiple
agents independently tweaking resctrl without knowledge of each other".
>
> Second Proposal (Mine)
>
> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
>
> Pros:
>
> a. Maintains the flexibility of free-form names for assignable
> configurations (info/L3_MON/counter_configs/).
>
> b. Events remain accessible via free-form names in mon_data, ensuring
> clarity on their purpose.
>
> c. Adds the ability to list assignment states for all groups in a single
> command.
>
> Cons:
> a. Potential buffer overflow issues when handling a large number of
> groups and domains and code complexity to fix the issue.
>
>
> Third Option: A Hybrid Approach
>
> We could combine elements from both proposals:
>
> a. Retain the free-form naming approach for assignable configurations in
> info/L3_MON/counter_configs/.
>
> b. Use the assignment method from the first proposal:
> $mkdir test
> $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
>
> c. Introduce listing support via the info/L3_MON/mbm_assign_control
> interface, enabling users to read assignment states for all groups in one
> place. Only reading support.
>
>
> >
> >>>>>
> >>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
> >>>>>
> >>>>> Format:
> >>>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
> >>>>>
> >>>>> #echo "test//mbm_local_bytes:0=e;1=e" >
> >>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>
> >>>>> #echo "test//mbm_local_bytes:0=u;1=u" >
> >>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>>
> >>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>>> test//mbm_total_bytes:0=u;1=u
> >>>>> test//mbm_local_bytes:0=u;1=u
> >>>>> //mbm_total_bytes:0=e;1=e
> >>>>> //mbm_local_bytes:0=e;1=e
> >>>>>
> >>>>> The corresponding events will be read in
> >>>>>
> >>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> >>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
> >>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> >>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
> >>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
> >>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
> >>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
> >>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
> >>>>>
> >>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
> >>>>> mbm_local_bytes) will be supported.
> >>>>>
> >>>>> 8. In the future, there will be options to create multiple configurations
> >>>>> and corresponding directory will be created in
> >>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
> >>>
> >>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
> >>> directory? Like this:
> >>>
> >>> # echo "LclFill, LclNTWr, RmtFill" >
> >>> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
> >>>
> >>> This seems OK (dependent on the user picking meaningful names for
> >>> the set of attributes picked ... but if they want to name this
> >>> monitor file "brian" then they have to live with any confusion
> >>> that they bring on themselves).
> >>>
> >>> Would this involve an extension to kernfs? I don't see a function
> >>> pointer callback for file creation in kernfs_syscall_ops.
> >>>
> >>>>>
> >>>>
> >>>> I know you are all busy with multiple series going on parallel. I am still
> >>>> waiting for the inputs on this. It will be great if you can spend some time
> >>>> on this to see if we can find common ground on the interface.
> >>>>
> >>>> Thanks
> >>>> Babu
> >>>
> >>> -Tony
> >>>
> >>
> >>
> >> thanks
> >> Babu
> >
> > Reinette
> >
> >
>
> --
> Thanks
> Babu Moger
-Tony
Hi Tony,
On 3/11/25 1:53 PM, Luck, Tony wrote:
> On Tue, Mar 11, 2025 at 03:35:28PM -0500, Moger, Babu wrote:
>> Hi All,
>>
>> On 3/10/25 22:51, Reinette Chatre wrote:
>>>
>>>
>>> On 3/10/25 6:44 PM, Moger, Babu wrote:
>>>> Hi Tony,
>>>>
>>>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>>>> Hi Babu,
>>>>>>>>
>>>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>>>
>>>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It is documented below.
>>>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>>>
>>>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>>>> just two.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>>>> using it as is.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>>>> supported.
>>>>>>>>>>>>
>>>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>>>
>>>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>>>
>>>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>>>
>>>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>>>
>>>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>>>> evolves.
>>>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>>>
>>>>>>>>>>> # define event configurations
>>>>>>>>>>>
>>>>>>>>>>> ========================================================
>>>>>>>>>>> Bits Mnemonics Description
>>>>>>>>>>> ==== ========================================================
>>>>>>>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>>>>>>>> ==== ========================================================
>>>>>>>>>>>
>>>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>>>
>>>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> v = VictimBW
>>>>>>>>>>>
>>>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>>>
>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>
>>>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>>>> and cleaner to implement.
>>>>>>>>>>
>>>>>>>>>> Roughly what I had in mind:
>>>>>>>>>>
>>>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>>>> to single letters. In the resulting directory, populate a file where
>>>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>>>> want specific events and don't care about portability.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>>>
>>>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>>>
>>>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>>>
>>>>>>>>> How many configurations should we allow? Do we know?
>>>>>>>>
>>>>>>>> Do we need an upper limit?
>>>>>>>
>>>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>>>> start with 2 default configurations for now.
>>>
>>> There is a big difference between no upper limit and 2. The hardware is
>>> capable of supporting per-domain configurations so more flexibility is
>>> certainly possible. Consider the example presented by Peter in:
>>> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>>>
>>>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>>>> configure all domains the same for a group.
>>>>>>>>>
>>>>>>>>> What is the difference between shared and exclusive?
>>>>>>>>
>>>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>>>> aggregation files[2].
>>>>>>>>
>>>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>>>> counters and reading their values.
>>>>>>>
>>>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>>>
>>>>>>>>> There should be a more efficient way to handle this.
>>>>>>>>>
>>>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>>>
>>>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>>>> number of files, which has since been addressed by shared assignment
>>>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>>>
>>>>>>> I think we should address this as well. Creating three extra files for
>>>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>>>
>>>>>>>>> That was another problem we need to address.
>>>>>>>>
>>>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>>>
>>>>>>> I suggest, we should provide users with an option to list the assignments
>>>>>>> of all groups in a single command. As the number of groups increases, it
>>>>>>> becomes cumbersome to query each group individually.
>>>>>>>
>>>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>>>> for this purpose. More details on this below.
>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>>>> which the count values can be read.
>>>>>>>>>>
>>>>>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> LclFill
>>>>>>>>>> LclNTWr
>>>>>>>>>> LclSlowFill
>>>>>>>>>
>>>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>>>
>>>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>>>
>>>>>>>>>
>>>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>> LclFill <-rename these to generic names.
>>>>>>>>> LclNTWr
>>>>>>>>> LclSlowFill
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think portable and non-portable event names should both be available
>>>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>>>> events.
>>>>>>>
>>>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>>>
>>>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>>>> of the current interface. Idea here is to start with basic assigment
>>>>>>> feature with options to enhance it in the future. Feel free to
>>>>>>> comment/suggest.
>>>>>>>
>>>>>>> 1. Event configurations will be in
>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>>>
>>>>>>> There will be two pre-defined configurations by default.
>>>>>>>
>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>>>
>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>> LclFill, LclNTWr, LclSlowFill
>>>>>>>
>>>>>>> 2. Users will have options to update these configurations.
>>>>>>>
>>>>>>> #echo "LclFill, LclNTWr, RmtFill" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>
>>>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>>>> reporting "local_bytes" any more. They report something different,
>>>>> and users only know if they come to check the options currently
>>>>> configured in this file. Changing the contents without changing
>>>>> the name seems confusing to me.
>>>>
>>>> It is the same behaviour right now with BMEC. It is configurable.
>>>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>>>
>>>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>>>
>>> This could be supported by following Peter's original proposal where the name
>>> of the counter configuration is provided by the user via a mkdir:
>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>
>>> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
>>
>> Sure. We can do that. I was thinking in the first phase, just provide the
>> default pre-defined configuration and option to update the configuration.
>>
>> We can add the mkdir support later. That way we can provide basic ABMC
>> support without too much code complexity with mkdir support.
>>
>>>
>>>>
>>>>>
>>>>>>>
>>>>>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>> LclFill, LclNTWr, RmtFill
>>>>>>>
>>>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>>>
>>>>>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>>> mkdir /sys/fs/resctrl/test/
>>>>>>>
>>>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>>>> e: Exclusive
>>>>>>> s: Shared
>>>>>>> u: Unassigned
>>>>>>>
>>>>>>> Exclusive mode is supported now. Shared mode will be supported in the
>>>>>>> future.
>>>>>>>
>>>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>> to list the assignment state of all the groups.
>>>>>>>
>>>>>>> Format:
>>>>>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>>>
>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>> test//mbm_total_bytes:0=e;1=e
>>>>>>> test//mbm_local_bytes:0=e;1=e
>>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>>> //mbm_local_bytes:0=e;1=e
>>>
>>> This would make mbm_assign_control even more unwieldy and quicker to exceed a
>>> page of data (these examples never seem to reflect those AMD systems with the many
>>> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
>>> and solved when/if going this route.
>>
>> This problem is not specific this series. I feel it is a generic problem
>> to many of the semilar interfaces. I dont know how it is addressed. May
>> have to investigate on this. Any pointers would be helpful.
>>
>>
>>>
>>> There seems to be two opinions about this file at moment. Would it be possible to
>>> summarize the discussion with pros/cons raised to make an informed selection?
>>> I understand that Google as represented by Peter no longer requires/requests this
>>> file but the motivation for this change seems new and does not seem to reduce the
>>> original motivation for this file. We may also want to separate requirements for reading
>>> from and writing to this file.
>>
>> Yea. We can just use mbm_assign_control for reading the assignment states.
>>
>> Summary: We have two proposals.
>>
>> First one from Peter:
>>
>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>
>>
>> Pros
>> a. Allows flexible creation of free-form names for assignable
>> configurations, stored in info/L3_MON/counter_configs/.
>>
>> b. Events can be accessed using corresponding free-form names in the
>> mon_data directory, making it clear to users what each event represents.
>>
>>
>> Cons:
>> a. Requires three separate files for assignment in each group
>> (assign_exclusive, assign_shared, unassign), which might be excessive.
>>
>> b. No built-in listing support, meaning users must query each group
>> individually to check assignment states.
>
> How big of a problem is this in reality? I'd assume that users of this
> feature would only reassign counter attributes at some slow rate (set
> up counters, measure for at least a few seconds, then set up for next
> measurement). Cost to open/read/close a few hundred kernfs files isn't
> very high. Biggest cost might be hogging the resctrl mutex which would
> cause jitter in the tasks reading data from resctrl monitors.
Good point. The length of holding the resctrl mutex should also be
considered when exploring the mbm_assign_control file. If a user attempts
to make many changes using a single file like that then holding the resctrl
mutex during entire configuration may also have a big impact. This may be
of more concern with the additional automation being added to resctrl, for
example the upcoming "shared assignment" that does automatic assignment of
counters.
>
> Anyone doing this at scale should be able to keep track of what they set,
> so wouldn't need to read at all. I'm not a big believer in "multiple
> agents independently tweaking resctrl without knowledge of each other".
>
>>
>> Second Proposal (Mine)
>>
>> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
>>
>> Pros:
>>
>> a. Maintains the flexibility of free-form names for assignable
>> configurations (info/L3_MON/counter_configs/).
>>
>> b. Events remain accessible via free-form names in mon_data, ensuring
>> clarity on their purpose.
>>
>> c. Adds the ability to list assignment states for all groups in a single
>> command.
>>
>> Cons:
>> a. Potential buffer overflow issues when handling a large number of
>> groups and domains and code complexity to fix the issue.
>>
>>
>> Third Option: A Hybrid Approach
>>
>> We could combine elements from both proposals:
>>
>> a. Retain the free-form naming approach for assignable configurations in
>> info/L3_MON/counter_configs/.
>>
>> b. Use the assignment method from the first proposal:
>> $mkdir test
>> $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
>>
>> c. Introduce listing support via the info/L3_MON/mbm_assign_control
>> interface, enabling users to read assignment states for all groups in one
>> place. Only reading support.
>>
>>
>>>
>>>>>>>
>>>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>>>
>>>>>>> Format:
>>>>>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>>>
>>>>>>> #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>
>>>>>>> #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>
>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>> test//mbm_total_bytes:0=u;1=u
>>>>>>> test//mbm_local_bytes:0=u;1=u
>>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>>> //mbm_local_bytes:0=e;1=e
>>>>>>>
>>>>>>> The corresponding events will be read in
>>>>>>>
>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>
>>>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>>>> mbm_local_bytes) will be supported.
>>>>>>>
>>>>>>> 8. In the future, there will be options to create multiple configurations
>>>>>>> and corresponding directory will be created in
>>>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>>>
>>>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>>>> directory? Like this:
>>>>>
>>>>> # echo "LclFill, LclNTWr, RmtFill" >
>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>>>
>>>>> This seems OK (dependent on the user picking meaningful names for
>>>>> the set of attributes picked ... but if they want to name this
>>>>> monitor file "brian" then they have to live with any confusion
>>>>> that they bring on themselves).
>>>>>
>>>>> Would this involve an extension to kernfs? I don't see a function
>>>>> pointer callback for file creation in kernfs_syscall_ops.
>>>>>
>>>>>>>
>>>>>>
>>>>>> I know you are all busy with multiple series going on parallel. I am still
>>>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>>>> on this to see if we can find common ground on the interface.
>>>>>>
>>>>>> Thanks
>>>>>> Babu
>>>>>
>>>>> -Tony
>>>>>
>>>>
>>>>
>>>> thanks
>>>> Babu
>>>
>>> Reinette
>>>
>>>
>>
>> --
>> Thanks
>> Babu Moger
>
> -Tony
Hi Tony,
On 3/11/25 15:53, Luck, Tony wrote:
> On Tue, Mar 11, 2025 at 03:35:28PM -0500, Moger, Babu wrote:
>> Hi All,
>>
>> On 3/10/25 22:51, Reinette Chatre wrote:
>>>
>>>
>>> On 3/10/25 6:44 PM, Moger, Babu wrote:
>>>> Hi Tony,
>>>>
>>>> On 3/10/2025 6:22 PM, Luck, Tony wrote:
>>>>> On Mon, Mar 10, 2025 at 05:48:44PM -0500, Moger, Babu wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> On 3/5/2025 1:34 PM, Moger, Babu wrote:
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> On 3/5/25 04:40, Peter Newman wrote:
>>>>>>>> Hi Babu,
>>>>>>>>
>>>>>>>> On Tue, Mar 4, 2025 at 10:49 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> On 3/4/25 10:44, Peter Newman wrote:
>>>>>>>>>> On Mon, Mar 3, 2025 at 8:16 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Peter/Reinette,
>>>>>>>>>>>
>>>>>>>>>>> On 2/26/25 07:27, Peter Newman wrote:
>>>>>>>>>>>> Hi Babu,
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/25/25 11:11, Peter Newman wrote:
>>>>>>>>>>>>>> Hi Reinette,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>>>>>>>>>>>> customers.
>>>>>>>>>>>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>>>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>>>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>>>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>>>>>>>>>>>> event names.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you for clarifying.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>>>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>>>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>>>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>>>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>>>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>>>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>>>>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>>>>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>>>>>>>>>>>> configuration is a requirement?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>>>>>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>>>>>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>>>>>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>>>>>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>>>>>>>>>>>> there's less pressure on the counters.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>>>>>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>>>>>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>>>>>>>>>>>> many counters the group needs in each domain.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>>>>>>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>>>>>>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>>>>>>>>>>>> of the hardware.
>>>>>>>>>>>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>>>>>>>>>>>> earlier example copied below:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> (per domain)
>>>>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>>>>>>>>>>>> I understand it:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>>>>>>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>>>>>>>>>>>> in domain 1, resulting in:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>>>>>>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> group 0:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> group 1:
>>>>>>>>>>>>>>>>> domain 0:
>>>>>>>>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> domain 1:
>>>>>>>>>>>>>>>>> counter 0: LclFill,RmtFill
>>>>>>>>>>>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>>> counter 3: VictimBW
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>>>>>>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>>>>>>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>>>>>>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>>>>>>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>>>>>>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>>>>>>>>>>>> system configuration, the user will settle on a handful of useful
>>>>>>>>>>>>>>>> groupings to count.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>>>>>>>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # legacy "total" configuration, effectively r+w
>>>>>>>>>>>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - group2 is restricted to domain 0
>>>>>>>>>>>>>>>> - group3 is restricted to domain 1
>>>>>>>>>>>>>>>> - the rest are unrestricted
>>>>>>>>>>>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see. Thank you for the example.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> resctrl supports per-domain configurations with the following possible when
>>>>>>>>>>>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>>>>>>>>>>>> be configured differently in each domain.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>>>>>>>>>>>> reasonable to carry this forward to what will be supported next.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The hardware supports both a per-domain mode, where all groups in a
>>>>>>>>>>>>>> domain use the same configurations and are limited to two events per
>>>>>>>>>>>>>> group and a per-group mode where every group can be configured and
>>>>>>>>>>>>>> assigned freely. This series is using the legacy counter access mode
>>>>>>>>>>>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>>>>>>>>>>>> in the domain can be read. If we chose to read the assigned counter
>>>>>>>>>>>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>>>>>>>>>>>> rather than asking the hardware to find the counter by RMID, we would
>>>>>>>>>>>>>> not be limited to 2 counters per group/domain and the hardware would
>>>>>>>>>>>>>> have the same flexibility as on MPAM.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In extended mode, the contents of a specific counter can be read by
>>>>>>>>>>>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>>>>>>>>>>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>>>>>>>>>>>> QM_CTR will then return the contents of the specified counter.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It is documented below.
>>>>>>>>>>>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>>>>>>>>>>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>>>>>>>>>>>
>>>>>>>>>>>>> We previously discussed this with you (off the public list) and I
>>>>>>>>>>>>> initially proposed the extended assignment mode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>>>>>>>>>>>> counters to be assigned to the same group, rather than being limited to
>>>>>>>>>>>>> just two.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, the challenge is that we currently lack the necessary interfaces
>>>>>>>>>>>>> to configure multiple events per group. Without these interfaces, the
>>>>>>>>>>>>> extended mode is not practical at this time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>>>>>>>>>>>> require modifications to the existing interface, allowing us to continue
>>>>>>>>>>>>> using it as is.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (I might have said something confusing in my last messages because I
>>>>>>>>>>>>>> had forgotten that I switched to the extended assignment mode when
>>>>>>>>>>>>>> prototyping with soft-ABMC and MPAM.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Forcing all groups on a domain to share the same 2 counter
>>>>>>>>>>>>>> configurations would not be acceptable for us, as the example I gave
>>>>>>>>>>>>>> earlier is one I've already been asked about.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don’t see this as a blocker. It should be considered an extension to the
>>>>>>>>>>>>> current ABMC series. We can easily build on top of this series once we
>>>>>>>>>>>>> finalize how to configure the multiple event interface for each group.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think it is, either. Only being able to use ABMC to assign
>>>>>>>>>>>> counters is fine for our use as an incremental step. My longer-term
>>>>>>>>>>>> concern is the domain-scoped mbm_total_bytes_config and
>>>>>>>>>>>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>>>>>>>>>>>> there's already an expectation that the files are present when BMEC is
>>>>>>>>>>>> supported.
>>>>>>>>>>>>
>>>>>>>>>>>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>>>>>>>>>>>> ABMC when only the BMEC-style event configuration interface exists.
>>>>>>>>>>>> The scope of my issue is just whether enabling "full" ABMC support
>>>>>>>>>>>> will require an additional opt-in, since that could remove the BMEC
>>>>>>>>>>>> interface. If it does, it's something we can live with.
>>>>>>>>>>>
>>>>>>>>>>> As you know, this series is currently blocked without further feedback.
>>>>>>>>>>>
>>>>>>>>>>> I’d like to begin reworking these patches to incorporate Peter’s feedback.
>>>>>>>>>>> Any input or suggestions would be appreciated.
>>>>>>>>>>>
>>>>>>>>>>> Here’s what we’ve learned so far:
>>>>>>>>>>>
>>>>>>>>>>> 1. Assignments should be independent of BMEC.
>>>>>>>>>>> 2. We should be able to specify multiple event types to a counter (e.g.,
>>>>>>>>>>> read, write, victimBM, etc.). This is also called shared counter
>>>>>>>>>>> 3. There should be an option to assign events per domain.
>>>>>>>>>>> 4. Currently, only two counters can be assigned per group, but the design
>>>>>>>>>>> should allow flexibility to assign more in the future as the interface
>>>>>>>>>>> evolves.
>>>>>>>>>>> 5. Utilize the extended RMID read mode.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Here is my proposal using Peter's earlier example:
>>>>>>>>>>>
>>>>>>>>>>> # define event configurations
>>>>>>>>>>>
>>>>>>>>>>> ========================================================
>>>>>>>>>>> Bits Mnemonics Description
>>>>>>>>>>> ==== ========================================================
>>>>>>>>>>> 6 VictimBW Dirty Victims from all types of memory
>>>>>>>>>>> 5 RmtSlowFill Reads to slow memory in the non-local NUMA domain
>>>>>>>>>>> 4 LclSlowFill Reads to slow memory in the local NUMA domain
>>>>>>>>>>> 3 RmtNTWr Non-temporal writes to non-local NUMA domain
>>>>>>>>>>> 2 LclNTWr Non-temporal writes to local NUMA domain
>>>>>>>>>>> 1 mtFill Reads to memory in the non-local NUMA domain
>>>>>>>>>>> 0 LclFill Reads to memory in the local NUMA domain
>>>>>>>>>>> ==== ========================================================
>>>>>>>>>>>
>>>>>>>>>>> #Define flags based on combination of above event types.
>>>>>>>>>>>
>>>>>>>>>>> t = LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> l = LclFill, LclNTWr, LclSlowFill
>>>>>>>>>>> r = LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>>> w = VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>>> v = VictimBW
>>>>>>>>>>>
>>>>>>>>>>> Peter suggested the following format earlier :
>>>>>>>>>>>
>>>>>>>>>>> /group0/0=t;1=t
>>>>>>>>>>> /group1/0=t;1=t
>>>>>>>>>>> /group2/0=_;1=t
>>>>>>>>>>> /group3/0=rw;1=_
>>>>>>>>>>
>>>>>>>>>> After some inquiries within Google, it sounds like nobody has invested
>>>>>>>>>> much into the current mbm_assign_control format yet, so it would be
>>>>>>>>>> best to drop it and distribute the configuration around the filesystem
>>>>>>>>>> hierarchy[1], which should allow us to produce something more flexible
>>>>>>>>>> and cleaner to implement.
>>>>>>>>>>
>>>>>>>>>> Roughly what I had in mind:
>>>>>>>>>>
>>>>>>>>>> Use mkdir in a info/<resource>_MON subdirectory to create free-form
>>>>>>>>>> names for the assignable configurations rather than being restricted
>>>>>>>>>> to single letters. In the resulting directory, populate a file where
>>>>>>>>>> we can specify the set of events the config should represent. I think
>>>>>>>>>> we should use symbolic names for the events rather than raw BMEC field
>>>>>>>>>> values. Moving forward we could come up with portable names for common
>>>>>>>>>> events and only support the BMEC names on AMD machines for users who
>>>>>>>>>> want specific events and don't care about portability.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I’m still processing this. Let me start with some initial questions.
>>>>>>>>>
>>>>>>>>> So, we are creating event configurations here, which seems reasonable.
>>>>>>>>>
>>>>>>>>> Yes, we should use portable names and are not limited to BMEC names.
>>>>>>>>>
>>>>>>>>> How many configurations should we allow? Do we know?
>>>>>>>>
>>>>>>>> Do we need an upper limit?
>>>>>>>
>>>>>>> I think so. This needs to be maintained in some data structure. We can
>>>>>>> start with 2 default configurations for now.
>>>
>>> There is a big difference between no upper limit and 2. The hardware is
>>> capable of supporting per-domain configurations so more flexibility is
>>> certainly possible. Consider the example presented by Peter in:
>>> https://lore.kernel.org/lkml/CALPaoCi0mFZ9TycyNs+SCR+2tuRJovQ2809jYMun4HtC64hJmA@mail.gmail.com/
>>>
>>>>>>>>>> Next, put assignment-control file nodes in per-domain directories
>>>>>>>>>> (i.e., mon_data/mon_L3_00/assign_{exclusive,shared}). Writing a
>>>>>>>>>> counter-configuration name into the file would then allocate a counter
>>>>>>>>>> in the domain, apply the named configuration, and monitor the parent
>>>>>>>>>> group-directory. We can also put a group/resource-scoped assign_* file
>>>>>>>>>> higher in the hierarchy to make it easier for users who want to
>>>>>>>>>> configure all domains the same for a group.
>>>>>>>>>
>>>>>>>>> What is the difference between shared and exclusive?
>>>>>>>>
>>>>>>>> Shared assignment[1] means that non-exclusively-assigned counters in
>>>>>>>> each domain will be scheduled round-robin to the groups requesting
>>>>>>>> shared access to a counter. In my tests, I assigned the counters long
>>>>>>>> enough to produce a single 1-second MB/s sample for the per-domain
>>>>>>>> aggregation files[2].
>>>>>>>>
>>>>>>>> These do not need to be implemented immediately, but knowing that they
>>>>>>>> work addresses the overhead and scalability concerns of reassigning
>>>>>>>> counters and reading their values.
>>>>>>>
>>>>>>> Ok. Lets focus on exclusive assignments for now.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Having three files—assign_shared, assign_exclusive, and unassign—for each
>>>>>>>>> domain seems excessive. In a system with 32 groups and 12 domains, this
>>>>>>>>> results in 32 × 12 × 3 files, which is quite large.
>>>>>>>>>
>>>>>>>>> There should be a more efficient way to handle this.
>>>>>>>>>
>>>>>>>>> Initially, we started with a group-level file for this interface, but it
>>>>>>>>> was rejected due to the high number of sysfs calls, making it inefficient.
>>>>>>>>
>>>>>>>> I had rejected it due to the high-frequency of access of a large
>>>>>>>> number of files, which has since been addressed by shared assignment
>>>>>>>> (or automatic reassignment) and aggregated mbps files.
>>>>>>>
>>>>>>> I think we should address this as well. Creating three extra files for
>>>>>>> each group isn’t ideal when there are more efficient alternatives.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Additionally, how can we list all assignments with a single sysfs call?
>>>>>>>>>
>>>>>>>>> That was another problem we need to address.
>>>>>>>>
>>>>>>>> This is not a requirement I was aware of. If the user forgot where
>>>>>>>> they assigned counters (or forgot to disable auto-assignment), they
>>>>>>>> can read multiple sysfs nodes to remind themselves.
>>>>>>>
>>>>>>> I suggest, we should provide users with an option to list the assignments
>>>>>>> of all groups in a single command. As the number of groups increases, it
>>>>>>> becomes cumbersome to query each group individually.
>>>>>>>
>>>>>>> To achieve this, we can reuse our existing mbm_assign_control interface
>>>>>>> for this purpose. More details on this below.
>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The configuration names listed in assign_* would result in files of
>>>>>>>>>> the same name in the appropriate mon_data domain directories from
>>>>>>>>>> which the count values can be read.
>>>>>>>>>>
>>>>>>>>>> # mkdir info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>>> # echo LclFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> # echo LclNTWr > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> # echo LclSlowFill > info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> # cat info/L3_MON/counter_configs/mbm_local_bytes/event_filter
>>>>>>>>>> LclFill
>>>>>>>>>> LclNTWr
>>>>>>>>>> LclSlowFill
>>>>>>>>>
>>>>>>>>> I feel we can just have the configs. event_filter file is not required.
>>>>>>>>
>>>>>>>> That's right, I forgot that we can implement kernfs_ops::open(). I was
>>>>>>>> only looking at struct kernfs_syscall_ops
>>>>>>>>
>>>>>>>>>
>>>>>>>>> #cat info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>>>> LclFill <-rename these to generic names.
>>>>>>>>> LclNTWr
>>>>>>>>> LclSlowFill
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think portable and non-portable event names should both be available
>>>>>>>> as options. There are simple bandwidth measurement mechanisms that
>>>>>>>> will be applied in general, but when they turn up an issue, it can
>>>>>>>> often lead to a more focused investigation, requiring more precise
>>>>>>>> events.
>>>>>>>
>>>>>>> I aggree. We should provide both portable and non-portable event names.
>>>>>>>
>>>>>>> Here is my draft proposal based on the discussion so far and reusing some
>>>>>>> of the current interface. Idea here is to start with basic assigment
>>>>>>> feature with options to enhance it in the future. Feel free to
>>>>>>> comment/suggest.
>>>>>>>
>>>>>>> 1. Event configurations will be in
>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/.
>>>>>>>
>>>>>>> There will be two pre-defined configurations by default.
>>>>>>>
>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_total_bytes
>>>>>>> LclFill, LclNTWr,LclSlowFill,VictimBM,RmtSlowFill,LclSlowFill,RmtFill
>>>>>>>
>>>>>>> #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>> LclFill, LclNTWr, LclSlowFill
>>>>>>>
>>>>>>> 2. Users will have options to update these configurations.
>>>>>>>
>>>>>>> #echo "LclFill, LclNTWr, RmtFill" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>
>>>>> This part seems odd to me. Now the "mbm_local_bytes" files aren't
>>>>> reporting "local_bytes" any more. They report something different,
>>>>> and users only know if they come to check the options currently
>>>>> configured in this file. Changing the contents without changing
>>>>> the name seems confusing to me.
>>>>
>>>> It is the same behaviour right now with BMEC. It is configurable.
>>>> By default it is mbm_local_bytes, but users can configure whatever they want to monitor using /info/L3_MON/mbm_local_bytes_config.
>>>>
>>>> We can continue the same behaviour with ABMC, but the configuration will be in /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes.
>>>
>>> This could be supported by following Peter's original proposal where the name
>>> of the counter configuration is provided by the user via a mkdir:
>>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>>
>>> As he mentioned there could be pre-populated mbm_local_bytes/mbm_total_bytes.
>>
>> Sure. We can do that. I was thinking in the first phase, just provide the
>> default pre-defined configuration and option to update the configuration.
>>
>> We can add the mkdir support later. That way we can provide basic ABMC
>> support without too much code complexity with mkdir support.
>>
>>>
>>>>
>>>>>
>>>>>>>
>>>>>>> # #cat /sys/fs/resctrl/info/L3_MON/counter_configs/mbm_local_bytes
>>>>>>> LclFill, LclNTWr, RmtFill
>>>>>>>
>>>>>>> 3. The default configurations will be used when user mounts the resctrl.
>>>>>>>
>>>>>>> mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>>> mkdir /sys/fs/resctrl/test/
>>>>>>>
>>>>>>> 4. The resctrl group/domains can be in one of these assingnment states.
>>>>>>> e: Exclusive
>>>>>>> s: Shared
>>>>>>> u: Unassigned
>>>>>>>
>>>>>>> Exclusive mode is supported now. Shared mode will be supported in the
>>>>>>> future.
>>>>>>>
>>>>>>> 5. We can use the current /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>> to list the assignment state of all the groups.
>>>>>>>
>>>>>>> Format:
>>>>>>> "<CTRL_MON group>/<MON group>/<confguration>:<domain_id>=<assign state>"
>>>>>>>
>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>> test//mbm_total_bytes:0=e;1=e
>>>>>>> test//mbm_local_bytes:0=e;1=e
>>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>>> //mbm_local_bytes:0=e;1=e
>>>
>>> This would make mbm_assign_control even more unwieldy and quicker to exceed a
>>> page of data (these examples never seem to reflect those AMD systems with the many
>>> L3 domains). How to handle resctrl files larger than 4KB needs to be well understood
>>> and solved when/if going this route.
>>
>> This problem is not specific this series. I feel it is a generic problem
>> to many of the semilar interfaces. I dont know how it is addressed. May
>> have to investigate on this. Any pointers would be helpful.
>>
>>
>>>
>>> There seems to be two opinions about this file at moment. Would it be possible to
>>> summarize the discussion with pros/cons raised to make an informed selection?
>>> I understand that Google as represented by Peter no longer requires/requests this
>>> file but the motivation for this change seems new and does not seem to reduce the
>>> original motivation for this file. We may also want to separate requirements for reading
>>> from and writing to this file.
>>
>> Yea. We can just use mbm_assign_control for reading the assignment states.
>>
>> Summary: We have two proposals.
>>
>> First one from Peter:
>>
>> https://lore.kernel.org/lkml/CALPaoCiii0vXOF06mfV=kVLBzhfNo0SFqt4kQGwGSGVUqvr2Dg@mail.gmail.com/
>>
>>
>> Pros
>> a. Allows flexible creation of free-form names for assignable
>> configurations, stored in info/L3_MON/counter_configs/.
>>
>> b. Events can be accessed using corresponding free-form names in the
>> mon_data directory, making it clear to users what each event represents.
>>
>>
>> Cons:
>> a. Requires three separate files for assignment in each group
>> (assign_exclusive, assign_shared, unassign), which might be excessive.
>>
>> b. No built-in listing support, meaning users must query each group
>> individually to check assignment states.
>
> How big of a problem is this in reality? I'd assume that users of this
> feature would only reassign counter attributes at some slow rate (set
> up counters, measure for at least a few seconds, then set up for next
> measurement). Cost to open/read/close a few hundred kernfs files isn't
> very high. Biggest cost might be hogging the resctrl mutex which would
> cause jitter in the tasks reading data from resctrl monitors.
Yes. That is a good point. Dont know how big the problem it is.
But we all need to aggre that group listing is not requirement. We can go
ahead that route.
Lets hear from all the parties.
>
> Anyone doing this at scale should be able to keep track of what they set,
> so wouldn't need to read at all. I'm not a big believer in "multiple
> agents independently tweaking resctrl without knowledge of each other".
>
>>
>> Second Proposal (Mine)
>>
>> https://lore.kernel.org/lkml/a4ab53b5-03be-4299-8853-e86270d46f2e@amd.com/
>>
>> Pros:
>>
>> a. Maintains the flexibility of free-form names for assignable
>> configurations (info/L3_MON/counter_configs/).
>>
>> b. Events remain accessible via free-form names in mon_data, ensuring
>> clarity on their purpose.
>>
>> c. Adds the ability to list assignment states for all groups in a single
>> command.
>>
>> Cons:
>> a. Potential buffer overflow issues when handling a large number of
>> groups and domains and code complexity to fix the issue.
>>
>>
>> Third Option: A Hybrid Approach
>>
>> We could combine elements from both proposals:
>>
>> a. Retain the free-form naming approach for assignable configurations in
>> info/L3_MON/counter_configs/.
>>
>> b. Use the assignment method from the first proposal:
>> $mkdir test
>> $echo mbm_local_bytes > test/mon_data/mon_L3_00/assign_exclusive
>>
>> c. Introduce listing support via the info/L3_MON/mbm_assign_control
>> interface, enabling users to read assignment states for all groups in one
>> place. Only reading support.
>>
>>
>>>
>>>>>>>
>>>>>>> 6. Users can modify the assignment state by writing to mbm_assign_control.
>>>>>>>
>>>>>>> Format:
>>>>>>> “<CTRL_MON group>/<MON group>/<configuration>:<domain_id>=<assign state>”
>>>>>>>
>>>>>>> #echo "test//mbm_local_bytes:0=e;1=e" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>
>>>>>>> #echo "test//mbm_local_bytes:0=u;1=u" >
>>>>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>
>>>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>> test//mbm_total_bytes:0=u;1=u
>>>>>>> test//mbm_local_bytes:0=u;1=u
>>>>>>> //mbm_total_bytes:0=e;1=e
>>>>>>> //mbm_local_bytes:0=e;1=e
>>>>>>>
>>>>>>> The corresponding events will be read in
>>>>>>>
>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>> /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_total_bytes
>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_total_bytes
>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>> /sys/fs/resctrl/test/mon_data/mon_L3_01/mbm_local_bytes
>>>>>>>
>>>>>>> 7. In the first stage, only two configurations(mbm_total_bytes and
>>>>>>> mbm_local_bytes) will be supported.
>>>>>>>
>>>>>>> 8. In the future, there will be options to create multiple configurations
>>>>>>> and corresponding directory will be created in
>>>>>>> /sysf/fs/resctrl/test/mon_data/mon_L3_00/<configation name>.
>>>>>
>>>>> Would this be done by creating a new file in the /sys/fs/resctrl/info/L3_MON/counter_configs
>>>>> directory? Like this:
>>>>>
>>>>> # echo "LclFill, LclNTWr, RmtFill" >
>>>>> /sys/fs/resctrl/info/L3_MON/counter_configs/cache_stuff
>>>>>
>>>>> This seems OK (dependent on the user picking meaningful names for
>>>>> the set of attributes picked ... but if they want to name this
>>>>> monitor file "brian" then they have to live with any confusion
>>>>> that they bring on themselves).
>>>>>
>>>>> Would this involve an extension to kernfs? I don't see a function
>>>>> pointer callback for file creation in kernfs_syscall_ops.
>>>>>
>>>>>>>
>>>>>>
>>>>>> I know you are all busy with multiple series going on parallel. I am still
>>>>>> waiting for the inputs on this. It will be great if you can spend some time
>>>>>> on this to see if we can find common ground on the interface.
>>>>>>
>>>>>> Thanks
>>>>>> Babu
>>>>>
>>>>> -Tony
>>>>>
>>>>
>>>>
>>>> thanks
>>>> Babu
>>>
>>> Reinette
>>>
>>>
>>
>> --
>> Thanks
>> Babu Moger
>
> -Tony
>
--
Thanks
Babu Moger
Hi Peter, On 2/26/25 5:27 AM, Peter Newman wrote: > Hi Babu, > > On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote: >> >> Hi Peter, >> >> On 2/25/25 11:11, Peter Newman wrote: >>> Hi Reinette, >>> >>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre >>> <reinette.chatre@intel.com> wrote: >>>> >>>> Hi Peter, >>>> >>>> On 2/21/25 5:12 AM, Peter Newman wrote: >>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre >>>>> <reinette.chatre@intel.com> wrote: >>>>>> On 2/20/25 6:53 AM, Peter Newman wrote: >>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre >>>>>>> <reinette.chatre@intel.com> wrote: >>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote: >>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre >>>>>>>>> <reinette.chatre@intel.com> wrote: >>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote: >>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre >>>>>>>>>>> <reinette.chatre@intel.com> wrote: >>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote: >>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: >>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote: >>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: >>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: >>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: >>>>>>>>>>>> >>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax) >>>>>>>>>>>> >>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters. >>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. >>>>>>>>>>>>>>>> Please help me understand if you see it differently. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events, >>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> mbm_local_read_bytes a >>>>>>>>>>>>>>>> mbm_local_write_bytes b >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Then mbm_assign_control can be used as: >>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control >>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes >>>>>>>>>>>>>>>> <value> >>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), >>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined >>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? >>>>>>>>>>>> >>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that >>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit >>>>>>>>>>>> is low enough to be of concern. >>>>>>>>>>> >>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM >>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits >>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained >>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their >>>>>>>>>>> investigation, I would question whether they know what they're looking >>>>>>>>>>> for. >>>>>>>>>> >>>>>>>>>> The key here is "so far" as well as the focus on MBM only. >>>>>>>>>> >>>>>>>>>> It is impossible for me to predict what we will see in a couple of years >>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface >>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register >>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into >>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned >>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea >>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I >>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their >>>>>>>>>> customers. >>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26. >>>>>>>>> >>>>>>>>> I was thinking of the letters as representing a reusable, user-defined >>>>>>>>> event-set for applying to a single counter rather than as individual >>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each >>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic >>>>>>>>> event names. >>>>>>>> >>>>>>>> Thank you for clarifying. >>>>>>>> >>>>>>>>> >>>>>>>>> In the letters as events model, choosing the events assigned to a >>>>>>>>> group wouldn't be enough information, since we would want to control >>>>>>>>> which events should share a counter and which should be counted by >>>>>>>>> separate counters. I think the amount of information that would need >>>>>>>>> to be encoded into mbm_assign_control to represent the level of >>>>>>>>> configurability supported by hardware would quickly get out of hand. >>>>>>>>> >>>>>>>>> Maybe as an example, one counter for all reads, one counter for all >>>>>>>>> writes in ABMC would look like... >>>>>>>>> >>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below) >>>>>>>>> >>>>>>>>> (per domain) >>>>>>>>> group 0: >>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>>>>> group 1: >>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>>>>> ... >>>>>>>>> >>>>>>>> >>>>>>>> I think this may also be what Dave was heading towards in [2] but in that >>>>>>>> example and above the counter configuration appears to be global. You do mention >>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter >>>>>>>> configuration is a requirement? >>>>>>> >>>>>>> If it's global and we want a particular group to be watched by more >>>>>>> counters, I wouldn't want this to result in allocating more counters >>>>>>> for that group in all domains, or allocating counters in domains where >>>>>>> they're not needed. I want to encourage my users to avoid allocating >>>>>>> monitoring resources in domains where a job is not allowed to run so >>>>>>> there's less pressure on the counters. >>>>>>> >>>>>>> In Dave's proposal it looks like global configuration means >>>>>>> globally-defined "named counter configurations", which works because >>>>>>> it's really per-domain assignment of the configurations to however >>>>>>> many counters the group needs in each domain. >>>>>> >>>>>> I think I am becoming lost. Would a global configuration not break your >>>>>> view of "event-set applied to a single counter"? If a counter is configured >>>>>> globally then it would not make it possible to support the full configurability >>>>>> of the hardware. >>>>>> Before I add more confusion, let me try with an example that builds on your >>>>>> earlier example copied below: >>>>>> >>>>>>>>> (per domain) >>>>>>>>> group 0: >>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>>>>> group 1: >>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>>>>> ... >>>>>> >>>>>> Since the above states "per domain" I rewrite the example to highlight that as >>>>>> I understand it: >>>>>> >>>>>> group 0: >>>>>> domain 0: >>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>> domain 1: >>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>> group 1: >>>>>> domain 0: >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>> domain 1: >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>> >>>>>> You mention that you do not want counters to be allocated in domains that they >>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1 >>>>>> in domain 1, resulting in: >>>>>> >>>>>> group 0: >>>>>> domain 0: >>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>> group 1: >>>>>> domain 0: >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>> domain 1: >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>> >>>>>> With counter 0 and counter 1 available in domain 1, these counters could >>>>>> theoretically be configured to give group 1 more data in domain 1: >>>>>> >>>>>> group 0: >>>>>> domain 0: >>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr >>>>>> group 1: >>>>>> domain 0: >>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr >>>>>> domain 1: >>>>>> counter 0: LclFill,RmtFill >>>>>> counter 1: LclNTWr,RmtNTWr >>>>>> counter 2: LclSlowFill,RmtSlowFill >>>>>> counter 3: VictimBW >>>>>> >>>>>> The counters are shown with different per-domain configurations that seems to >>>>>> match with earlier goals of (a) choose events counted by each counter and >>>>>> (b) do not allocate counters in domains where they are not needed. As I >>>>>> understand the above does contradict global counter configuration though. >>>>>> Or do you mean that only the *name* of the counter is global and then >>>>>> that it is reconfigured as part of every assignment? >>>>> >>>>> Yes, I meant only the *name* is global. I assume based on a particular >>>>> system configuration, the user will settle on a handful of useful >>>>> groupings to count. >>>>> >>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example... >>>>> >>>>> # define global configurations (in ABMC terms), not necessarily in this >>>>> # syntax and probably not in the mbm_assign_control file. >>>>> >>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill >>>>> w=VictimBW,LclNTWr,RmtNTWr >>>>> >>>>> # legacy "total" configuration, effectively r+w >>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr >>>>> >>>>> /group0/0=t;1=t >>>>> /group1/0=t;1=t >>>>> /group2/0=_;1=t >>>>> /group3/0=rw;1=_ >>>>> >>>>> - group2 is restricted to domain 0 >>>>> - group3 is restricted to domain 1 >>>>> - the rest are unrestricted >>>>> - In group3, we decided we need to separate read and write traffic >>>>> >>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1. >>>>> >>>> >>>> I see. Thank you for the example. >>>> >>>> resctrl supports per-domain configurations with the following possible when >>>> using mbm_total_bytes_config and mbm_local_bytes_config: >>>> >>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr >>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr >>>> >>>> /group0/0=t;1=t >>>> /group1/0=t;1=t >>>> >>>> Even though the flags are identical in all domains, the assigned counters will >>>> be configured differently in each domain. >>>> >>>> With this supported by hardware and currently also supported by resctrl it seems >>>> reasonable to carry this forward to what will be supported next. >>> >>> The hardware supports both a per-domain mode, where all groups in a >>> domain use the same configurations and are limited to two events per >>> group and a per-group mode where every group can be configured and >>> assigned freely. This series is using the legacy counter access mode >>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n >>> in the domain can be read. If we chose to read the assigned counter >>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC) >>> rather than asking the hardware to find the counter by RMID, we would >>> not be limited to 2 counters per group/domain and the hardware would >>> have the same flexibility as on MPAM. >> >> In extended mode, the contents of a specific counter can be read by >> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1, >> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading >> QM_CTR will then return the contents of the specified counter. >> >> It is documented below. >> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf >> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC) >> >> We previously discussed this with you (off the public list) and I >> initially proposed the extended assignment mode. >> >> Yes, the extended mode allows greater flexibility by enabling multiple >> counters to be assigned to the same group, rather than being limited to >> just two. >> >> However, the challenge is that we currently lack the necessary interfaces >> to configure multiple events per group. Without these interfaces, the >> extended mode is not practical at this time. >> >> Therefore, we ultimately agreed to use the legacy mode, as it does not >> require modifications to the existing interface, allowing us to continue >> using it as is. >> >>> >>> (I might have said something confusing in my last messages because I >>> had forgotten that I switched to the extended assignment mode when >>> prototyping with soft-ABMC and MPAM.) >>> >>> Forcing all groups on a domain to share the same 2 counter >>> configurations would not be acceptable for us, as the example I gave >>> earlier is one I've already been asked about. >> >> I don’t see this as a blocker. It should be considered an extension to the >> current ABMC series. We can easily build on top of this series once we >> finalize how to configure the multiple event interface for each group. > > I don't think it is, either. Only being able to use ABMC to assign > counters is fine for our use as an incremental step. My longer-term > concern is the domain-scoped mbm_total_bytes_config and > mbm_local_bytes_config files, but they were introduced with BMEC, so > there's already an expectation that the files are present when BMEC is > supported. > > On ABMC hardware that also supports BMEC, I'm concerned about enabling > ABMC when only the BMEC-style event configuration interface exists. ABMC currently depends on BMEC making the current implementation the one you are concerned about? https://lore.kernel.org/lkml/e4111779ebb0e7004dbedc258eeae2677f578ab1.1737577229.git.babu.moger@amd.com/ > The scope of my issue is just whether enabling "full" ABMC support > will require an additional opt-in, since that could remove the BMEC > interface. If it does, it's something we can live with. Reinette
Hi Peter/Reinette,
On 2/26/25 10:25, Reinette Chatre wrote:
> Hi Peter,
>
> On 2/26/25 5:27 AM, Peter Newman wrote:
>> Hi Babu,
>>
>> On Tue, Feb 25, 2025 at 10:31 PM Moger, Babu <babu.moger@amd.com> wrote:
>>>
>>> Hi Peter,
>>>
>>> On 2/25/25 11:11, Peter Newman wrote:
>>>> Hi Reinette,
>>>>
>>>> On Fri, Feb 21, 2025 at 11:43 PM Reinette Chatre
>>>> <reinette.chatre@intel.com> wrote:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> On 2/21/25 5:12 AM, Peter Newman wrote:
>>>>>> On Thu, Feb 20, 2025 at 7:36 PM Reinette Chatre
>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>> On 2/20/25 6:53 AM, Peter Newman wrote:
>>>>>>>> On Wed, Feb 19, 2025 at 7:21 PM Reinette Chatre
>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>> On 2/19/25 3:28 AM, Peter Newman wrote:
>>>>>>>>>> On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre
>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>> On 2/17/25 2:26 AM, Peter Newman wrote:
>>>>>>>>>>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>>>>>>>>>>> <reinette.chatre@intel.com> wrote:
>>>>>>>>>>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>>>>>>>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>>>>>>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>>>>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>>>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>>>>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>>>>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>>>>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> mbm_local_read_bytes a
>>>>>>>>>>>>>>>>> mbm_local_write_bytes b
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>>>>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>>>>>>>>>>> <value>
>>>>>>>>>>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>>>>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>>>>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>>>>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>>>>>>>>>>
>>>>>>>>>>>>> As mentioned above, one possible issue with existing interface is that
>>>>>>>>>>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>>>>>>>>>>> is low enough to be of concern.
>>>>>>>>>>>>
>>>>>>>>>>>> The events which can be monitored by a single counter on ABMC and MPAM
>>>>>>>>>>>> so far are combinable, so 26 counters per group today means it limits
>>>>>>>>>>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>>>>>>>>>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>>>>>>>>>>> investigation, I would question whether they know what they're looking
>>>>>>>>>>>> for.
>>>>>>>>>>>
>>>>>>>>>>> The key here is "so far" as well as the focus on MBM only.
>>>>>>>>>>>
>>>>>>>>>>> It is impossible for me to predict what we will see in a couple of years
>>>>>>>>>>> from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface
>>>>>>>>>>> to support their users. Just looking at the Intel RDT spec the event register
>>>>>>>>>>> has space for 32 events for each "CPU agent" resource. That does not take into
>>>>>>>>>>> account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned
>>>>>>>>>>> that he is working on patches [1] that will add new events and shared the idea
>>>>>>>>>>> that we may be trending to support "perf" like events associated with RMID. I
>>>>>>>>>>> expect AMD PQoS and Arm MPAM to provide related enhancements to support their
>>>>>>>>>>> customers.
>>>>>>>>>>> This all makes me think that resctrl should be ready to support more events than 26.
>>>>>>>>>>
>>>>>>>>>> I was thinking of the letters as representing a reusable, user-defined
>>>>>>>>>> event-set for applying to a single counter rather than as individual
>>>>>>>>>> events, since MPAM and ABMC allow us to choose the set of events each
>>>>>>>>>> one counts. Wherever we define the letters, we could use more symbolic
>>>>>>>>>> event names.
>>>>>>>>>
>>>>>>>>> Thank you for clarifying.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In the letters as events model, choosing the events assigned to a
>>>>>>>>>> group wouldn't be enough information, since we would want to control
>>>>>>>>>> which events should share a counter and which should be counted by
>>>>>>>>>> separate counters. I think the amount of information that would need
>>>>>>>>>> to be encoded into mbm_assign_control to represent the level of
>>>>>>>>>> configurability supported by hardware would quickly get out of hand.
>>>>>>>>>>
>>>>>>>>>> Maybe as an example, one counter for all reads, one counter for all
>>>>>>>>>> writes in ABMC would look like...
>>>>>>>>>>
>>>>>>>>>> (L3_QOS_ABMC_CFG.BwType field names below)
>>>>>>>>>>
>>>>>>>>>> (per domain)
>>>>>>>>>> group 0:
>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> group 1:
>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think this may also be what Dave was heading towards in [2] but in that
>>>>>>>>> example and above the counter configuration appears to be global. You do mention
>>>>>>>>> "configurability supported by hardware" so I wonder if per-domain counter
>>>>>>>>> configuration is a requirement?
>>>>>>>>
>>>>>>>> If it's global and we want a particular group to be watched by more
>>>>>>>> counters, I wouldn't want this to result in allocating more counters
>>>>>>>> for that group in all domains, or allocating counters in domains where
>>>>>>>> they're not needed. I want to encourage my users to avoid allocating
>>>>>>>> monitoring resources in domains where a job is not allowed to run so
>>>>>>>> there's less pressure on the counters.
>>>>>>>>
>>>>>>>> In Dave's proposal it looks like global configuration means
>>>>>>>> globally-defined "named counter configurations", which works because
>>>>>>>> it's really per-domain assignment of the configurations to however
>>>>>>>> many counters the group needs in each domain.
>>>>>>>
>>>>>>> I think I am becoming lost. Would a global configuration not break your
>>>>>>> view of "event-set applied to a single counter"? If a counter is configured
>>>>>>> globally then it would not make it possible to support the full configurability
>>>>>>> of the hardware.
>>>>>>> Before I add more confusion, let me try with an example that builds on your
>>>>>>> earlier example copied below:
>>>>>>>
>>>>>>>>>> (per domain)
>>>>>>>>>> group 0:
>>>>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> group 1:
>>>>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>>>> ...
>>>>>>>
>>>>>>> Since the above states "per domain" I rewrite the example to highlight that as
>>>>>>> I understand it:
>>>>>>>
>>>>>>> group 0:
>>>>>>> domain 0:
>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> domain 1:
>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> group 1:
>>>>>>> domain 0:
>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>> domain 1:
>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>
>>>>>>> You mention that you do not want counters to be allocated in domains that they
>>>>>>> are not needed in. So, let's say group 0 does not need counter 0 and counter 1
>>>>>>> in domain 1, resulting in:
>>>>>>>
>>>>>>> group 0:
>>>>>>> domain 0:
>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> group 1:
>>>>>>> domain 0:
>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>> domain 1:
>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>>
>>>>>>> With counter 0 and counter 1 available in domain 1, these counters could
>>>>>>> theoretically be configured to give group 1 more data in domain 1:
>>>>>>>
>>>>>>> group 0:
>>>>>>> domain 0:
>>>>>>> counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 1: VictimBW,LclNTWr,RmtNTWr
>>>>>>> group 1:
>>>>>>> domain 0:
>>>>>>> counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>>> counter 3: VictimBW,LclNTWr,RmtNTWr
>>>>>>> domain 1:
>>>>>>> counter 0: LclFill,RmtFill
>>>>>>> counter 1: LclNTWr,RmtNTWr
>>>>>>> counter 2: LclSlowFill,RmtSlowFill
>>>>>>> counter 3: VictimBW
>>>>>>>
>>>>>>> The counters are shown with different per-domain configurations that seems to
>>>>>>> match with earlier goals of (a) choose events counted by each counter and
>>>>>>> (b) do not allocate counters in domains where they are not needed. As I
>>>>>>> understand the above does contradict global counter configuration though.
>>>>>>> Or do you mean that only the *name* of the counter is global and then
>>>>>>> that it is reconfigured as part of every assignment?
>>>>>>
>>>>>> Yes, I meant only the *name* is global. I assume based on a particular
>>>>>> system configuration, the user will settle on a handful of useful
>>>>>> groupings to count.
>>>>>>
>>>>>> Perhaps mbm_assign_control syntax is the clearest way to express an example...
>>>>>>
>>>>>> # define global configurations (in ABMC terms), not necessarily in this
>>>>>> # syntax and probably not in the mbm_assign_control file.
>>>>>>
>>>>>> r=LclFill,RmtFill,LclSlowFill,RmtSlowFill
>>>>>> w=VictimBW,LclNTWr,RmtNTWr
>>>>>>
>>>>>> # legacy "total" configuration, effectively r+w
>>>>>> t=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>>>
>>>>>> /group0/0=t;1=t
>>>>>> /group1/0=t;1=t
>>>>>> /group2/0=_;1=t
>>>>>> /group3/0=rw;1=_
>>>>>>
>>>>>> - group2 is restricted to domain 0
>>>>>> - group3 is restricted to domain 1
>>>>>> - the rest are unrestricted
>>>>>> - In group3, we decided we need to separate read and write traffic
>>>>>>
>>>>>> This consumes 4 counters in domain 0 and 3 counters in domain 1.
>>>>>>
>>>>>
>>>>> I see. Thank you for the example.
>>>>>
>>>>> resctrl supports per-domain configurations with the following possible when
>>>>> using mbm_total_bytes_config and mbm_local_bytes_config:
>>>>>
>>>>> t(domain 0)=LclFill,RmtFill,LclSlowFill,RmtSlowFill,VictimBW,LclNTWr,RmtNTWr
>>>>> t(domain 1)=LclFill,RmtFill,VictimBW,LclNTWr,RmtNTWr
>>>>>
>>>>> /group0/0=t;1=t
>>>>> /group1/0=t;1=t
>>>>>
>>>>> Even though the flags are identical in all domains, the assigned counters will
>>>>> be configured differently in each domain.
>>>>>
>>>>> With this supported by hardware and currently also supported by resctrl it seems
>>>>> reasonable to carry this forward to what will be supported next.
>>>>
>>>> The hardware supports both a per-domain mode, where all groups in a
>>>> domain use the same configurations and are limited to two events per
>>>> group and a per-group mode where every group can be configured and
>>>> assigned freely. This series is using the legacy counter access mode
>>>> where only counters whose BwType matches an instance of QOS_EVT_CFG_n
>>>> in the domain can be read. If we chose to read the assigned counter
>>>> directly (QM_EVTSEL[ExtendedEvtID]=1, QM_EVTSEL[EvtID]=L3CacheABMC)
>>>> rather than asking the hardware to find the counter by RMID, we would
>>>> not be limited to 2 counters per group/domain and the hardware would
>>>> have the same flexibility as on MPAM.
>>>
>>> In extended mode, the contents of a specific counter can be read by
>>> setting the following fields in QM_EVTSEL: [ExtendedEvtID]=1,
>>> [EvtID]=L3CacheABMC and setting [RMID] to the desired counter ID. Reading
>>> QM_CTR will then return the contents of the specified counter.
>>>
>>> It is documented below.
>>> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
>>> Section: 19.3.3.3 Assignable Bandwidth Monitoring (ABMC)
>>>
>>> We previously discussed this with you (off the public list) and I
>>> initially proposed the extended assignment mode.
>>>
>>> Yes, the extended mode allows greater flexibility by enabling multiple
>>> counters to be assigned to the same group, rather than being limited to
>>> just two.
>>>
>>> However, the challenge is that we currently lack the necessary interfaces
>>> to configure multiple events per group. Without these interfaces, the
>>> extended mode is not practical at this time.
>>>
>>> Therefore, we ultimately agreed to use the legacy mode, as it does not
>>> require modifications to the existing interface, allowing us to continue
>>> using it as is.
>>>
>>>>
>>>> (I might have said something confusing in my last messages because I
>>>> had forgotten that I switched to the extended assignment mode when
>>>> prototyping with soft-ABMC and MPAM.)
>>>>
>>>> Forcing all groups on a domain to share the same 2 counter
>>>> configurations would not be acceptable for us, as the example I gave
>>>> earlier is one I've already been asked about.
>>>
>>> I don’t see this as a blocker. It should be considered an extension to the
>>> current ABMC series. We can easily build on top of this series once we
>>> finalize how to configure the multiple event interface for each group.
>>
>> I don't think it is, either. Only being able to use ABMC to assign
>> counters is fine for our use as an incremental step. My longer-term
>> concern is the domain-scoped mbm_total_bytes_config and
>> mbm_local_bytes_config files, but they were introduced with BMEC, so
>> there's already an expectation that the files are present when BMEC is
>> supported.
It's good that we at least know about this concern now. Let's take a step
back and figure out how we can address it.
>>
>> On ABMC hardware that also supports BMEC, I'm concerned about enabling
>> ABMC when only the BMEC-style event configuration interface exists.
>
> ABMC currently depends on BMEC making the current implementation the
> one you are concerned about?
> https://lore.kernel.org/lkml/e4111779ebb0e7004dbedc258eeae2677f578ab1.1737577229.git.babu.moger@amd.com/
I think it is more than that.
The ABMC feature allows event configuration by writing to L3_QOS_ABMC_CFG,
where we can set cntr_id, RMID, and event configuration. Currently, we
derive event configuration from BMEC settings (either
mbm_total_bytes_config or mbm_local_bytes_config).
If we don’t use BMEC values, we would need to require users to manually
specify event configuration settings.
struct mbm_cntr_cfg {
enum resctrl_event_id evtid;
struct rdtgroup *rdtgrp;
};
Currently, we determine the RMID from the rdtgroup and the event type,
while event configuration relies on BMEC:
To make event configuration independent of BMEC, we can include an
explicit event configuration field:
struct mbm_cntr_cfg {
enum resctrl_event_id evtid;
u32 evt_cfg; // User-provided config value
struct rdtgroup *rdtgrp;
};
Key Considerations
1. Counter Management: Managing counters globally (like CLOSID
management) would be simpler than handling them at the domain level,
though domain-level management is feasible.
2. User Input: Users will need to specify event configuration when
assigning events.
Here is the quick example using our current interface:
a. List the group.
#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
//0=t:0x1F,l:0x15;1=t:0x1F,l:0x15
b. Unassign an Event:
#echo "//0-l" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
#cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
//0=t:0x1F;1=t:0x1F,l:0x15
c. Assign an Event:
#echo "//0+l:0x15" > /sys/fs/resctrl/info/L3_MON/mbm_assign_control
Note that I dont want to rush here.
Peter, Can you please spend some time and propose the interface you are
thinking of based on both ABMC and MPAM.
>
>> The scope of my issue is just whether enabling "full" ABMC support
>> will require an additional opt-in, since that could remove the BMEC
>> interface. If it does, it's something we can live with.
>
>
> Reinette
>
>
--
Thanks
Babu Moger
Hi, On Wed, Feb 19, 2025 at 12:28:16PM +0100, Peter Newman wrote: > Hi Reinette, > > On Tue, Feb 18, 2025 at 6:50 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: > > > > Hi Peter, > > > > On 2/17/25 2:26 AM, Peter Newman wrote: > > > Hi Reinette, > > > > > > On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre > > > <reinette.chatre@intel.com> wrote: [...] > > >> As mentioned above, one possible issue with existing interface is that > > >> it is limited to 26 events (assuming only lower case letters are used). The limit > > >> is low enough to be of concern. > > > > > > The events which can be monitored by a single counter on ABMC and MPAM > > > so far are combinable, so 26 counters per group today means it limits > > > breaking down MBM traffic for each group 26 ways. If a user complained > > > that a 26-way breakdown of a group's MBM traffic was limiting their > > > investigation, I would question whether they know what they're looking > > > for. > > > > The key here is "so far" as well as the focus on MBM only. > > > > It is impossible for me to predict what we will see in a couple of years > > from Intel RDT, AMD PQoS, and Arm MPAM that now all rely on resctrl interface > > to support their users. Just looking at the Intel RDT spec the event register > > has space for 32 events for each "CPU agent" resource. That does not take into > > account the "non-CPU agents" that are enumerated via ACPI. Tony already mentioned > > that he is working on patches [1] that will add new events and shared the idea > > that we may be trending to support "perf" like events associated with RMID. I > > expect AMD PQoS and Arm MPAM to provide related enhancements to support their > > customers. > > This all makes me think that resctrl should be ready to support more events than 26. > > I was thinking of the letters as representing a reusable, user-defined > event-set for applying to a single counter rather than as individual > events, since MPAM and ABMC allow us to choose the set of events each > one counts. Wherever we define the letters, we could use more symbolic > event names. > > In the letters as events model, choosing the events assigned to a > group wouldn't be enough information, since we would want to control > which events should share a counter and which should be counted by > separate counters. I think the amount of information that would need > to be encoded into mbm_assign_control to represent the level of > configurability supported by hardware would quickly get out of hand. > > Maybe as an example, one counter for all reads, one counter for all > writes in ABMC would look like... > > (L3_QOS_ABMC_CFG.BwType field names below) > > (per domain) > group 0: > counter 0: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 1: VictimBW,LclNTWr,RmtNTWr > group 1: > counter 2: LclFill,RmtFill,LclSlowFill,RmtSlowFill > counter 3: VictimBW,LclNTWr,RmtNTWr > ... > > I assume packing all of this info for a group's desired counter > configuration into a single line (with 32 domains per line on many > dual-socket AMD configurations I see) would be difficult to look at, > even if we could settle on a single letter to represent each > universally. > > > > > My goal is for resctrl to have a user interface that can as much as possible > > be ready for whatever may be required from it years down the line. Of course, > > I may be wrong and resctrl would never need to support more than 26 events per > > resource (*). The risk is that resctrl *may* need to support more than 26 events > > and how could resctrl support that? > > > > What is the risk of supporting more than 26 events? As I highlighted earlier > > the interface I used as demonstration may become unwieldy to parse on a system > > with many domains that supports many events. This is a concern for me. Any suggestions > > will be appreciated, especially from you since I know that you are very familiar with > > issues related to large scale use of resctrl interfaces. > > It's mainly just the unwieldiness of all the information in one file. > It's already at the limit of what I can visually look through. > > I believe that shared assignments will take care of all the > high-frequency and performance-intensive batch configuration updates I > was originally concerned about, so I no longer see much benefit in > finding ways to textually encode all this information in a single file > when it would be more manageable to distribute it around the > filesystem hierarchy. > > -Peter This was sort of what I had in my mind. I think it may make some sense to support "t" and "l" out of the box, as intuitively backwards-compatible event names, but provide a way to create new "letters" as needed, with well-defined way (customisable or not) of mapping these to event names visible in resctrlfs. I just used the digits for this purpose, but we could have an explicit interface for it. In order for this series to stabilise though, does it make sense to put this out of scope just for now? The current series provides a way to provide the mbm_total_bytes and mbm_local_bytes counters on AMBC and MPAM systems, without having to limit the total number of monitoring groups (MPAM's current approach) or overcommit the counters so that they may not be continuously reliable when there are too many groups (AMD?). That seems immediately useful. The ability to assign arbitrarily many counters to a group is a new feature however. Does it make sense to consider this on its own merits when the baseline ABMC interface has been settled? May main concern right now (from the Arm side) is to be confident that the initial ABMC interface definition doesn't paint us into a corner. Cheers ---Dave
Hi All, On 2/17/25 04:26, Peter Newman wrote: > Hi Reinette, > > On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre > <reinette.chatre@intel.com> wrote: >> >> Hi Babu, >> >> On 2/14/25 10:31 AM, Moger, Babu wrote: >>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: >>>> On 2/13/25 9:37 AM, Dave Martin wrote: >>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: >>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: >>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: >> >> (quoting relevant parts with goal to focus discussion on new possible syntax) >> >>>>>> I see the support for MPAM events distinct from the support of assignable counters. >>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. >>>>>> Please help me understand if you see it differently. >>>>>> >>>>>> Doing so would need to come up with alphabetical letters for these events, >>>>>> which seems to be needed for your proposal also? If we use possible flags of: >>>>>> >>>>>> mbm_local_read_bytes a >>>>>> mbm_local_write_bytes b >>>>>> >>>>>> Then mbm_assign_control can be used as: >>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control >>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes >>>>>> <value> >>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes >>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> >>>>>> >>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), >>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined >>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? >> >> As mentioned above, one possible issue with existing interface is that >> it is limited to 26 events (assuming only lower case letters are used). The limit >> is low enough to be of concern. > > The events which can be monitored by a single counter on ABMC and MPAM > so far are combinable, so 26 counters per group today means it limits > breaking down MBM traffic for each group 26 ways. If a user complained > that a 26-way breakdown of a group's MBM traffic was limiting their > investigation, I would question whether they know what they're looking > for. Based on the discussion so far, it felt like it is not a group level breakdown. It is kind of global level breakdown. I could be wrong here. My understanding so far, MPAM has a number of global counters. It can be assigned to any domain in the system and monitor events. They also have a way to configure the events (read, write or both). Both these feature are inline with current resctrl implementation and can be easily adapted. One thing I am not clear why MPAM implementation plans to create separate files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the events. We already have files in each group to read the events. # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/ total 0 -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes -- Thanks Babu Moger
> Based on the discussion so far, it felt like it is not a group level > breakdown. It is kind of global level breakdown. I could be wrong here. > > My understanding so far, MPAM has a number of global counters. It can be > assigned to any domain in the system and monitor events. > > They also have a way to configure the events (read, write or both). > > Both these feature are inline with current resctrl implementation and can > be easily adapted. > > One thing I am not clear why MPAM implementation plans to create separate > files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the > events. We already have files in each group to read the events. > > # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/ > total 0 > -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy > -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes > -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes It would be nice if the filenames here reflected the reconfigured events. From what I can tell on AMD with BMEC it is possible to change the underlying events so that local b/w is reported in the mbm_total_bytes file, and vice versa. Or an event like: 6 Dirty Victims from the QOS domain to all types of memory is counted. Though maybe we'd need to create a lot of filenames for the 2**6 combinations of bits. -Tony
Hi Tony, On 2/18/25 8:51 AM, Luck, Tony wrote: >> Based on the discussion so far, it felt like it is not a group level >> breakdown. It is kind of global level breakdown. I could be wrong here. >> >> My understanding so far, MPAM has a number of global counters. It can be >> assigned to any domain in the system and monitor events. >> >> They also have a way to configure the events (read, write or both). >> >> Both these feature are inline with current resctrl implementation and can >> be easily adapted. >> >> One thing I am not clear why MPAM implementation plans to create separate >> files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the >> events. We already have files in each group to read the events. >> >> # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/ >> total 0 >> -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy >> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes >> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes > > It would be nice if the filenames here reflected the reconfigured > events. From what I can tell on AMD with BMEC it is possible to change the > underlying events so that local b/w is reported in the mbm_total_bytes > file, and vice versa. Or an event like: > > 6 Dirty Victims from the QOS domain to all types of memory > > is counted. > > Though maybe we'd need to create a lot of filenames for the 2**6 > combinations of bits. Instead of accommodating all possible names resctrl could support "generic" names as hinted in Dave Martin's proposal. The complication with BMEC is that these are the underlying mbm_local_bytes and mbm_total_bytes events on which configuration was built. Specifically, by default and at hardware reset mbm_local_bytes counts exactly that. The event is fixed if BMEC is not supported and configurable if it is. Reinette [1] https://lore.kernel.org/lkml/Z6zeXby8ajh0ax6i@e133380.arm.com/
> >> Based on the discussion so far, it felt like it is not a group level > >> breakdown. It is kind of global level breakdown. I could be wrong here. > >> > >> My understanding so far, MPAM has a number of global counters. It can be > >> assigned to any domain in the system and monitor events. > >> > >> They also have a way to configure the events (read, write or both). > >> > >> Both these feature are inline with current resctrl implementation and can > >> be easily adapted. > >> > >> One thing I am not clear why MPAM implementation plans to create separate > >> files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the > >> events. We already have files in each group to read the events. > >> > >> # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/ > >> total 0 > >> -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy > >> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes > >> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes > > > > It would be nice if the filenames here reflected the reconfigured > > events. From what I can tell on AMD with BMEC it is possible to change the > > underlying events so that local b/w is reported in the mbm_total_bytes > > file, and vice versa. Or an event like: > > > > 6 Dirty Victims from the QOS domain to all types of memory > > > > is counted. > > > > Though maybe we'd need to create a lot of filenames for the 2**6 > > combinations of bits. > > Instead of accommodating all possible names resctrl could support > "generic" names as hinted in Dave Martin's proposal. > > The complication with BMEC is that these are the underlying > mbm_local_bytes and mbm_total_bytes events on which configuration > was built. Specifically, by default and at hardware reset mbm_local_bytes > counts exactly that. The event is fixed if BMEC is not supported and > configurable if it is. Would if be possible to rename the files if the config changed? I.e. initially they are named mbm_local_bytes and mbm_total_bytes. But when the user changes the config for mbm_total_bytes using the BMEC config file, that file is renamed everywhere to "user_config1" -Tony
Hi Tony, On 2/18/25 11:08 AM, Luck, Tony wrote: >>>> Based on the discussion so far, it felt like it is not a group level >>>> breakdown. It is kind of global level breakdown. I could be wrong here. >>>> >>>> My understanding so far, MPAM has a number of global counters. It can be >>>> assigned to any domain in the system and monitor events. >>>> >>>> They also have a way to configure the events (read, write or both). >>>> >>>> Both these feature are inline with current resctrl implementation and can >>>> be easily adapted. >>>> >>>> One thing I am not clear why MPAM implementation plans to create separate >>>> files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the >>>> events. We already have files in each group to read the events. >>>> >>>> # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/ >>>> total 0 >>>> -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy >>>> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes >>>> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes >>> >>> It would be nice if the filenames here reflected the reconfigured >>> events. From what I can tell on AMD with BMEC it is possible to change the >>> underlying events so that local b/w is reported in the mbm_total_bytes >>> file, and vice versa. Or an event like: >>> >>> 6 Dirty Victims from the QOS domain to all types of memory >>> >>> is counted. >>> >>> Though maybe we'd need to create a lot of filenames for the 2**6 >>> combinations of bits. >> >> Instead of accommodating all possible names resctrl could support >> "generic" names as hinted in Dave Martin's proposal. >> >> The complication with BMEC is that these are the underlying >> mbm_local_bytes and mbm_total_bytes events on which configuration >> was built. Specifically, by default and at hardware reset mbm_local_bytes >> counts exactly that. The event is fixed if BMEC is not supported and >> configurable if it is. > > Would if be possible to rename the files if the config changed? > > I.e. initially they are named mbm_local_bytes and mbm_total_bytes. > > But when the user changes the config for mbm_total_bytes using the > BMEC config file, that file is renamed everywhere to "user_config1" > The motivation for doing this to an existing interface is not clear. On its own I think it will add confusion. It sounds to me as though there is some future (similar to BMEC) feature that needs to be supported for which such a change would make things compatible. For this I think it would be easier to discuss that future feature and ensure everybody is clear on what interface would work for that new feature before making changes to existing feature to be compatible with it. Reinette
On Mon, Feb 17, 2025 at 10:45:29AM -0600, Moger, Babu wrote: > Hi All, > > On 2/17/25 04:26, Peter Newman wrote: > > Hi Reinette, > > > > On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre > > <reinette.chatre@intel.com> wrote: > >> > >> Hi Babu, > >> > >> On 2/14/25 10:31 AM, Moger, Babu wrote: > >>> On 2/14/2025 12:26 AM, Reinette Chatre wrote: > >>>> On 2/13/25 9:37 AM, Dave Martin wrote: > >>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote: > >>>>>> On 2/12/25 9:46 AM, Dave Martin wrote: > >>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: > >> > >> (quoting relevant parts with goal to focus discussion on new possible syntax) > >> > >>>>>> I see the support for MPAM events distinct from the support of assignable counters. > >>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface. > >>>>>> Please help me understand if you see it differently. > >>>>>> > >>>>>> Doing so would need to come up with alphabetical letters for these events, > >>>>>> which seems to be needed for your proposal also? If we use possible flags of: > >>>>>> > >>>>>> mbm_local_read_bytes a > >>>>>> mbm_local_write_bytes b > >>>>>> > >>>>>> Then mbm_assign_control can be used as: > >>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control > >>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes > >>>>>> <value> > >>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes > >>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes> > >>>>>> > >>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available), > >>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined > >>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit? > >> > >> As mentioned above, one possible issue with existing interface is that > >> it is limited to 26 events (assuming only lower case letters are used). The limit > >> is low enough to be of concern. > > > > The events which can be monitored by a single counter on ABMC and MPAM > > so far are combinable, so 26 counters per group today means it limits > > breaking down MBM traffic for each group 26 ways. If a user complained > > that a 26-way breakdown of a group's MBM traffic was limiting their > > investigation, I would question whether they know what they're looking > > for. > > Based on the discussion so far, it felt like it is not a group level > breakdown. It is kind of global level breakdown. I could be wrong here. > > My understanding so far, MPAM has a number of global counters. It can be > assigned to any domain in the system and monitor events. > > They also have a way to configure the events (read, write or both). > > Both these feature are inline with current resctrl implementation and can > be easily adapted. > > One thing I am not clear why MPAM implementation plans to create separate > files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the > events. We already have files in each group to read the events. > > # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/ > total 0 > -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy > -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes > -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes To be clear, we have no current plan to do this from the Arm side. My sketch was just a thought experiment to test whether we would have difficulties _if_ a decision were made to extend the interface in that direction. But it looks OK to me: the interface proposed in this series seems to leave enough possibilities for extension open that we could do something like what I described later in if we decide to. Overall, the interface proposed in this series seems a reasonable way to support ABMC systems while keeping the consumer-side interface (i.e., reading the mbm_total_bytes files etc.) as similar to the classic / Intel RDT situation as possible. MPAM can fit in with this approach, as demonstrated by James' past branches porting the MPAM driver on top of previous versions of the ABMC series. As I understand it, he's almost done with porting onto this v11, with no significant issues. Cheers ---Dave
Hi All,
On 2/18/25 06:30, Dave Martin wrote:
> On Mon, Feb 17, 2025 at 10:45:29AM -0600, Moger, Babu wrote:
>> Hi All,
>>
>> On 2/17/25 04:26, Peter Newman wrote:
>>> Hi Reinette,
>>>
>>> On Fri, Feb 14, 2025 at 8:18 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>>
>>>> Hi Babu,
>>>>
>>>> On 2/14/25 10:31 AM, Moger, Babu wrote:
>>>>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>>>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>>>>
>>>> (quoting relevant parts with goal to focus discussion on new possible syntax)
>>>>
>>>>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>>>>> Please help me understand if you see it differently.
>>>>>>>>
>>>>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>>>>
>>>>>>>> mbm_local_read_bytes a
>>>>>>>> mbm_local_write_bytes b
>>>>>>>>
>>>>>>>> Then mbm_assign_control can be used as:
>>>>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>>>>> <value>
>>>>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>>>>
>>>>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>>>>
>>>> As mentioned above, one possible issue with existing interface is that
>>>> it is limited to 26 events (assuming only lower case letters are used). The limit
>>>> is low enough to be of concern.
>>>
>>> The events which can be monitored by a single counter on ABMC and MPAM
>>> so far are combinable, so 26 counters per group today means it limits
>>> breaking down MBM traffic for each group 26 ways. If a user complained
>>> that a 26-way breakdown of a group's MBM traffic was limiting their
>>> investigation, I would question whether they know what they're looking
>>> for.
>>
>> Based on the discussion so far, it felt like it is not a group level
>> breakdown. It is kind of global level breakdown. I could be wrong here.
>>
>> My understanding so far, MPAM has a number of global counters. It can be
>> assigned to any domain in the system and monitor events.
>>
>> They also have a way to configure the events (read, write or both).
>>
>> Both these feature are inline with current resctrl implementation and can
>> be easily adapted.
>>
>> One thing I am not clear why MPAM implementation plans to create separate
>> files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the
>> events. We already have files in each group to read the events.
>>
>> # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/
>> total 0
>> -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy
>> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes
>> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes
>
>
> To be clear, we have no current plan to do this from the Arm side.
>
> My sketch was just a thought experiment to test whether we would have
> difficulties _if_ a decision were made to extend the interface in that
> direction.
>
> But it looks OK to me: the interface proposed in this series seems to
> leave enough possibilities for extension open that we could do
> something like what I described later in if we decide to.
>
>
> Overall, the interface proposed in this series seems a reasonable way
> to support ABMC systems while keeping the consumer-side interface
> (i.e., reading the mbm_total_bytes files etc.) as similar to the
> classic / Intel RDT situation as possible.
>
> MPAM can fit in with this approach, as demonstrated by James' past
> branches porting the MPAM driver on top of previous versions of the
> ABMC series.
Thanks Dave.
>
> As I understand it, he's almost done with porting onto this v11,
> with no significant issues.
>
Good to know. Thanks
I am working on v12 of ABMC with few changes from Reinette's earlier
review comments.
Most of the changes are related to commit message update and user
documentation update.
Introduced couple of new functions resctrl_reset_rmid_all() and
mbm_cntr_free_all() to organize the code better based on the comment.
https://lore.kernel.org/lkml/b60b4f72-6245-46db-a126-428fb13b6310@intel.com/
On top of that I have few comments from from Dave.
1. Change "mbm_cntr_assign" to "mbm_counter_assign".
This will require me to search and replace lot of places. There are
variables, names like num_mbm_cntrs, mbm_cntr_assignable,
resctrl_arch_mbm_cntr_assign_enabled, resctrl_arch_mbm_cntr_assign_set,
mbm_cntr_assign_enabled, resctrl_num_mbm_cntrs_show, mbm_cntr_cfg and list
goes on.
This is mostly cosmetic and not much value add. Will drop this change if
Dave has no objections.
2. Change /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs to display per-domain
supported counters instead of a single value.
3. Use the actual events instead of flags based on the below comment.
https://lore.kernel.org/lkml/a07fca4c-c8fa-41a6-b126-59815b9a58f9@intel.com/
Something like this.
# echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}'
>/sys/fs/resctrl/info/L3_MON/mbm_assign_control
Are we ready to go with this approach? I am still not clear on this.
Reinette, What do you think?
--
Thanks
Babu Moger
Hi there,
On Tue, Feb 18, 2025 at 09:39:43AM -0600, Moger, Babu wrote:
> Hi All,
>
> On 2/18/25 06:30, Dave Martin wrote:
> > On Mon, Feb 17, 2025 at 10:45:29AM -0600, Moger, Babu wrote:
> >> Hi All,
[...]
> >> One thing I am not clear why MPAM implementation plans to create separate
> >> files(dynamically) in /sys/fs/resctrl/info/L3_MON/ directory to read the
> >> events. We already have files in each group to read the events.
> >>
> >> # ls -l /sys/fs/resctrl/mon_data/mon_L3_00/
> >> total 0
> >> -r--r--r--. 1 root root 0 Feb 17 08:16 llc_occupancy
> >> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_local_bytes
> >> -r--r--r--. 1 root root 0 Feb 17 08:16 mbm_total_bytes
> >
> >
> > To be clear, we have no current plan to do this from the Arm side.
> >
> > My sketch was just a thought experiment to test whether we would have
> > difficulties _if_ a decision were made to extend the interface in that
> > direction.
> >
> > But it looks OK to me: the interface proposed in this series seems to
> > leave enough possibilities for extension open that we could do
> > something like what I described later in if we decide to.
> >
> >
> > Overall, the interface proposed in this series seems a reasonable way
> > to support ABMC systems while keeping the consumer-side interface
> > (i.e., reading the mbm_total_bytes files etc.) as similar to the
> > classic / Intel RDT situation as possible.
> >
> > MPAM can fit in with this approach, as demonstrated by James' past
> > branches porting the MPAM driver on top of previous versions of the
> > ABMC series.
>
> Thanks Dave.
> >
> > As I understand it, he's almost done with porting onto this v11,
> > with no significant issues.
> >
> Good to know. Thanks
>
> I am working on v12 of ABMC with few changes from Reinette's earlier
> review comments.
>
> Most of the changes are related to commit message update and user
> documentation update.
>
> Introduced couple of new functions resctrl_reset_rmid_all() and
> mbm_cntr_free_all() to organize the code better based on the comment.
> https://lore.kernel.org/lkml/b60b4f72-6245-46db-a126-428fb13b6310@intel.com/
>
>
> On top of that I have few comments from from Dave.
>
> 1. Change "mbm_cntr_assign" to "mbm_counter_assign".
>
> This will require me to search and replace lot of places. There are
> variables, names like num_mbm_cntrs, mbm_cntr_assignable,
> resctrl_arch_mbm_cntr_assign_enabled, resctrl_arch_mbm_cntr_assign_set,
> mbm_cntr_assign_enabled, resctrl_num_mbm_cntrs_show, mbm_cntr_cfg and list
> goes on.
>
> This is mostly cosmetic and not much value add. Will drop this change if
> Dave has no objections.
There is no need to change the names of kernel symbols -- this was just
about the interface presented to userspace.
So, if you rename only the affect file names in resctrlfs (I think
there weren't any others) then I'm happy with that.
But if you prefer to avoid this inconsistency, the file name can stay
as-is. It's not a huge deal.
> 2. Change /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs to display per-domain
> supported counters instead of a single value.
Ack; thanks (we could always add it back in later without an ABI break,
if people feel strongly about it and it looks feasible).
> 3. Use the actual events instead of flags based on the below comment.
>
> https://lore.kernel.org/lkml/a07fca4c-c8fa-41a6-b126-59815b9a58f9@intel.com/
>
> Something like this.
> # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}'
> >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> Are we ready to go with this approach? I am still not clear on this.
[...]
> --
> Thanks
> Babu Moger
On this point, I'll defer to discussions elsewhere on the thread.
I have a few other minor comments pending to post, but it looks like
there may be a more serious issue with how the mbm_assign_control file
is handled in the kernel -- I'll try to post comments on that today.
Cheers
---Dave
Hi Babu,
On 2/18/25 7:39 AM, Moger, Babu wrote:
> 3. Use the actual events instead of flags based on the below comment.
>
> https://lore.kernel.org/lkml/a07fca4c-c8fa-41a6-b126-59815b9a58f9@intel.com/
>
> Something like this.
> # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}'
>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> Are we ready to go with this approach? I am still not clear on this.
>
> Reinette, What do you think?
I was actually expecting some push back or at least discussion on this interface
because the braces seem difficult to parse when compared to, for example, using
commas to separate the events of a domain. Peter [1] has some reservations about
going this direction and since he would end up using this interface significantly
I would prefer to resolve that first.
Reinette
[1] https://lore.kernel.org/lkml/CALPaoCh7WpohzpXhSAbumjSZBv1_+1bXON7_V1pwG4bdEBr52Q@mail.gmail.com/
Hi Reinette,
On 2/18/25 12:14, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/18/25 7:39 AM, Moger, Babu wrote:
>
>> 3. Use the actual events instead of flags based on the below comment.
>>
>> https://lore.kernel.org/lkml/a07fca4c-c8fa-41a6-b126-59815b9a58f9@intel.com/
>>
>> Something like this.
>> # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}'
>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>
>> Are we ready to go with this approach? I am still not clear on this.
>>
>> Reinette, What do you think?
>
> I was actually expecting some push back or at least discussion on this interface
> because the braces seem difficult to parse when compared to, for example, using
I am yet to work on it. Will work on it after confirmation.
Here is the output from a system with 12 domains. I created one "test" group.
Output is definitely harder to parse for human eyes.
#cat info/L3_MON/mbm_assign_control
test//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_total_bytes}{mbm_local_bytes};2={mbm_total_bytes}{mbm_local_bytes};3={mbm_total_bytes}{mbm_local_bytes};4={mbm_total_bytes}{mbm_local_bytes};5={mbm_total_bytes}{mbm_local_bytes};6={mbm_total_bytes}{mbm_local_bytes};7={mbm_total_bytes}{mbm_local_bytes};8={mbm_total_bytes}{mbm_local_bytes};9={mbm_total_bytes}{mbm_local_bytes};10={mbm_total_bytes}{mbm_local_bytes};11={mbm_total_bytes}{mbm_local_bytes}
//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_total_bytes}{mbm_local_bytes};2={mbm_total_bytes}{mbm_local_bytes};3={mbm_total_bytes}{mbm_local_bytes};4={mbm_total_bytes}{mbm_local_bytes};5={mbm_total_bytes}{mbm_local_bytes};6={mbm_total_bytes}{mbm_local_bytes};7={mbm_total_bytes}{mbm_local_bytes};8={mbm_total_bytes}{mbm_local_bytes};9={mbm_total_bytes}{mbm_local_bytes};10={mbm_total_bytes}{mbm_local_bytes};11={mbm_total_bytes}{mbm_local_bytes}
It is harder to parse in code also. We should consider only if there is a
value-add with this format.
Otherwise I prefer our current flag format.
# cat info/L3_MON/mbm_assign_control
test//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl
//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl
> commas to separate the events of a domain. Peter [1] has some reservations about
Yes. I would like to hear from Peter.
> going this direction and since he would end up using this interface significantly
> I would prefer to resolve that first.
>
> Reinette
>
>
> [1] https://lore.kernel.org/lkml/CALPaoCh7WpohzpXhSAbumjSZBv1_+1bXON7_V1pwG4bdEBr52Q@mail.gmail.com/
>
>
--
Thanks
Babu Moger
Hi Babu,
On 2/18/25 11:32 AM, Moger, Babu wrote:
> Hi Reinette,
>
> On 2/18/25 12:14, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 2/18/25 7:39 AM, Moger, Babu wrote:
>>
>>> 3. Use the actual events instead of flags based on the below comment.
>>>
>>> https://lore.kernel.org/lkml/a07fca4c-c8fa-41a6-b126-59815b9a58f9@intel.com/
>>>
>>> Something like this.
>>> # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}'
>>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>
>>> Are we ready to go with this approach? I am still not clear on this.
>>>
>>> Reinette, What do you think?
>>
>> I was actually expecting some push back or at least discussion on this interface
>> because the braces seem difficult to parse when compared to, for example, using
>
> I am yet to work on it. Will work on it after confirmation.
>
> Here is the output from a system with 12 domains. I created one "test" group.
>
> Output is definitely harder to parse for human eyes.
>
> #cat info/L3_MON/mbm_assign_control
> test//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_total_bytes}{mbm_local_bytes};2={mbm_total_bytes}{mbm_local_bytes};3={mbm_total_bytes}{mbm_local_bytes};4={mbm_total_bytes}{mbm_local_bytes};5={mbm_total_bytes}{mbm_local_bytes};6={mbm_total_bytes}{mbm_local_bytes};7={mbm_total_bytes}{mbm_local_bytes};8={mbm_total_bytes}{mbm_local_bytes};9={mbm_total_bytes}{mbm_local_bytes};10={mbm_total_bytes}{mbm_local_bytes};11={mbm_total_bytes}{mbm_local_bytes}
> //0={mbm_total_bytes}{mbm_local_bytes};1={mbm_total_bytes}{mbm_local_bytes};2={mbm_total_bytes}{mbm_local_bytes};3={mbm_total_bytes}{mbm_local_bytes};4={mbm_total_bytes}{mbm_local_bytes};5={mbm_total_bytes}{mbm_local_bytes};6={mbm_total_bytes}{mbm_local_bytes};7={mbm_total_bytes}{mbm_local_bytes};8={mbm_total_bytes}{mbm_local_bytes};9={mbm_total_bytes}{mbm_local_bytes};10={mbm_total_bytes}{mbm_local_bytes};11={mbm_total_bytes}{mbm_local_bytes}
>
> It is harder to parse in code also. We should consider only if there is a
> value-add with this format.
Please see my comments in [2] for some motivations.
>
> Otherwise I prefer our current flag format.
>
> # cat info/L3_MON/mbm_assign_control
> test//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl
> //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl
We could possibly consider some middle ground where flags are separated by
commas and when the amount of used flags reach 26 the interface can use
"two letter flags" or "longer names" or "the actual event name" or ....
>
>
>> commas to separate the events of a domain. Peter [1] has some reservations about
>
> Yes. I would like to hear from Peter.
>
Reinette
[2] https://lore.kernel.org/lkml/ccd9c5d7-0266-4054-879e-e084b6972ad5@intel.com/
Hi Reinette,
On Tue, Feb 18, 2025 at 01:29:09PM -0800, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/18/25 11:32 AM, Moger, Babu wrote:
> > Hi Reinette,
> >
> > On 2/18/25 12:14, Reinette Chatre wrote:
> >> Hi Babu,
> >>
> >> On 2/18/25 7:39 AM, Moger, Babu wrote:
> >>
> >>> 3. Use the actual events instead of flags based on the below comment.
> >>>
> >>> https://lore.kernel.org/lkml/a07fca4c-c8fa-41a6-b126-59815b9a58f9@intel.com/
> >>>
> >>> Something like this.
> >>> # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}'
> >>>> /sys/fs/resctrl/info/L3_MON/mbm_assign_control
> >>>
> >>> Are we ready to go with this approach? I am still not clear on this.
> >>>
> >>> Reinette, What do you think?
> >>
> >> I was actually expecting some push back or at least discussion on this interface
> >> because the braces seem difficult to parse when compared to, for example, using
> >
> > I am yet to work on it. Will work on it after confirmation.
> >
> > Here is the output from a system with 12 domains. I created one "test" group.
> >
> > Output is definitely harder to parse for human eyes.
> >
> > #cat info/L3_MON/mbm_assign_control
> > test//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_total_bytes}{mbm_local_bytes};2={mbm_total_bytes}{mbm_local_bytes};3={mbm_total_bytes}{mbm_local_bytes};4={mbm_total_bytes}{mbm_local_bytes};5={mbm_total_bytes}{mbm_local_bytes};6={mbm_total_bytes}{mbm_local_bytes};7={mbm_total_bytes}{mbm_local_bytes};8={mbm_total_bytes}{mbm_local_bytes};9={mbm_total_bytes}{mbm_local_bytes};10={mbm_total_bytes}{mbm_local_bytes};11={mbm_total_bytes}{mbm_local_bytes}
> > //0={mbm_total_bytes}{mbm_local_bytes};1={mbm_total_bytes}{mbm_local_bytes};2={mbm_total_bytes}{mbm_local_bytes};3={mbm_total_bytes}{mbm_local_bytes};4={mbm_total_bytes}{mbm_local_bytes};5={mbm_total_bytes}{mbm_local_bytes};6={mbm_total_bytes}{mbm_local_bytes};7={mbm_total_bytes}{mbm_local_bytes};8={mbm_total_bytes}{mbm_local_bytes};9={mbm_total_bytes}{mbm_local_bytes};10={mbm_total_bytes}{mbm_local_bytes};11={mbm_total_bytes}{mbm_local_bytes}
> >
> > It is harder to parse in code also. We should consider only if there is a
> > value-add with this format.
>
> Please see my comments in [2] for some motivations.
>
> >
> > Otherwise I prefer our current flag format.
> >
> > # cat info/L3_MON/mbm_assign_control
> > test//0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl
> > //0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl
>
> We could possibly consider some middle ground where flags are separated by
> commas and when the amount of used flags reach 26 the interface can use
> "two letter flags" or "longer names" or "the actual event name" or ....
>
> >
> >
> >> commas to separate the events of a domain. Peter [1] has some reservations about
> >
> > Yes. I would like to hear from Peter.
> >
>
> Reinette
Ack; see also my reply to Peter on the other subthread.
I think the single-letter names provide a much less cumbersome
interface.
From the Arm side, I'd be happy to see just "t" and "l" for now, with
their current fixed mappings to event names, provided that we are
confident that we can add flexibility later without breaking the ABI.
In case this has got lost in the noise, I still think that the v11
proposal for the ABMC interface looks fine as a first step -- I just
wanted to kick the tires re extensibility.
Cheers
---Dave
Hi Reinette,
On 2/14/2025 1:18 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 2/14/25 10:31 AM, Moger, Babu wrote:
>> On 2/14/2025 12:26 AM, Reinette Chatre wrote:
>>> On 2/13/25 9:37 AM, Dave Martin wrote:
>>>> On Wed, Feb 12, 2025 at 03:33:31PM -0800, Reinette Chatre wrote:
>>>>> On 2/12/25 9:46 AM, Dave Martin wrote:
>>>>>> On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote:
>
> (quoting relevant parts with goal to focus discussion on new possible syntax)
>
>>>>> I see the support for MPAM events distinct from the support of assignable counters.
>>>>> Once the MPAM events are sorted, I think that they can be assigned with existing interface.
>>>>> Please help me understand if you see it differently.
>>>>>
>>>>> Doing so would need to come up with alphabetical letters for these events,
>>>>> which seems to be needed for your proposal also? If we use possible flags of:
>>>>>
>>>>> mbm_local_read_bytes a
>>>>> mbm_local_write_bytes b
>>>>>
>>>>> Then mbm_assign_control can be used as:
>>>>> # echo '//0=ab;1=b' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_read_bytes
>>>>> <value>
>>>>> # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>>>> <sum of mbm_local_read_bytes and mbm_local_write_bytes>
>>>>>
>>>>> One issue would be when resctrl needs to support more than 26 events (no more flags available),
>>>>> assuming that upper case would be used for "shared" counters (unless this interface is defined
>>>>> differently and only few uppercase letters used for it). Would this be too low of a limit?
>
> As mentioned above, one possible issue with existing interface is that
> it is limited to 26 events (assuming only lower case letters are used). The limit
> is low enough to be of concern.
Yes. Agree.
>
> ....
>
>>>>
>>>> Alternatively, if we want to be able to expand beyond single letters,
>>>> could we reserve one or more characters for extension purposes?
>>>>
>>>> If braces are forbidden by the syntax today, could we add support for
>>>> something like the following later on, without breaking anything?
>>>>
>>>> # echo '//0={foo}{bar};1={bar}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>>>>
>>>
>
> Dave proposed a change in syntax that can (a) support unlimited events,
> (b) be more intuitive than the one letter flags that may be hard to match
> to the events they correspond to.
Yea. Sounds good.
>
>>> Thank you for the suggestion. I think we may need something like this.
>>> Babu, what do you think?
>>
>> I'm not quite clear on this. Do we know what 'foo' and 'bar' refer to?
>> It is a random text?
>
> Not random text. It refers to the events.
>
> I do not know if braces is what will be settled on but a slight change in
> example to make it match your series can be:
>
> # echo '//0={mbm_total_bytes}{mbm_local_bytes};1={mbm_local_bytes}' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control
>
> With syntax like above there is no concern that we will run out of
> flags and the events assigned are clear without needing to parse separate flags.
Yes. We need to change our current "flag parsing". It should not be a
problem.
> For a system with a lot of events and domains this will become quite a lot
> to parse though.
>
>>
>> In his example from
>> https://lore.kernel.org/lkml/Z643WdXYARTADSBy@e133380.arm.com/
>> --------------------------------------------------------------
>> The numbers are not supposed to have an hardware significance.
>>
>> '//0=6'
>>
>> just "means assign some unused counter for domain 0, and create files
>> in resctrl so I can configure and read it".
>
> Thanks for pointing this out. I missed that the idea was that the
> configuration files are dynamically created.
>
>>
>> The "6" is really just a tag for labelling the resulting resctrl
>> file names so that the user can tell them apart. It's not supposed
>> to imply any specific hardware counter or event.
>
> Right.
>
>> ------------------------------------------------------------------
>>
>> It seems that 'foo' and 'bar' are tags used to create files in /sys/fs/resctrl/info/L3_MON/.
>>
>> Given that, it looks like we're discussing entirely different things.
>
> I am still trying to understand how MPAM counters can be supported.
>
> Reinette
Thanks
Babu
Hi Dave, Thanks for your help. Reinette has asked few questions already. I have few more questions on top of that. On 2/12/25 11:46, Dave Martin wrote: > Hi there, > > On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: >> >> This series adds the support for Assignable Bandwidth Monitoring Counters >> (ABMC). It is also called QoS RMID Pinning feature >> >> Series is written such that it is easier to support other assignable >> features supported from different vendors. >> >> The feature details are documented in the APM listed below [1]. >> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming >> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth >> Monitoring (ABMC). The documentation is available at >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 >> >> The patches are based on top of commit >> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx' >> >> # Introduction > > [...] > >> # Examples >> >> a. Check if ABMC support is available >> #mount -t resctrl resctrl /sys/fs/resctrl/ >> >> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode >> [mbm_cntr_assign] >> default > > (Nit: can this be called "mbm_counter_assign"? The name is already > long, so I wonder whether anything is gained by using a cryptic > abbreviation for "counter". Same with all the "cntrs" elsewhere. > This is purely cosmetic, though -- the interface works either way.) Yes. We can do that. > >> ABMC feature is detected and it is enabled. >> >> b. Check how many ABMC counters are available. >> >> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs >> 32 > > Is this file needed? > > With MPAM, it is more difficult to promise that the same number of > counters will be available everywhere. > > Rather than lie, or report a "safe" value here that may waste some > counters, can we just allow the number of counters to be be discovered > per domain via available_mbm_cntrs? As Reinette suggested below we can display per domain supported counters here. https://lore.kernel.org/lkml/9e849476-7c4b-478b-bd2a-185024def3a3@intel.com/ > > num_closids and num_rmids are already problematic for MPAM, so it would > be good to avoid any more parameters of this sort from being reported > to userspace unless there is a clear understanding of why they are > needed. > > Reporting number of counters per monitoring domain is a more natural > fit for MPAM, as below: > >> c. Check how many ABMC counters are available in each domain. >> >> # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs >> 0=30;1=30 > > For MPAM, this seems supportable. Each monitoring domain will have > some counters, and a well-defined number of them will be available for > allocation at any one time. > >> d. Create few resctrl groups. >> >> # mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp >> # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp >> # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp >> >> e. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_assign_control >> to list and modify any group's monitoring states. File provides single place >> to list monitoring states of all the resctrl groups. It makes it easier for >> user space to learn about the used counters without needing to traverse all >> the groups thus reducing the number of file system calls. >> >> The list follows the following format: >> >> "<CTRL_MON group>/<MON group>/<domain_id>=<flags>" >> >> Format for specific type of groups: >> >> * Default CTRL_MON group: >> "//<domain_id>=<flags>" > > [...] > >> Flags can be one of the following: >> >> t MBM total event is enabled. >> l MBM local event is enabled. >> tl Both total and local MBM events are enabled. >> _ None of the MBM events are enabled >> >> Examples: > > [...] > > I think that this basically works for MPAM. > > The local/total distinction doesn't map in a consistent way onto MPAM, > but this problem is not specific to ABMC. It feels sensible for ABMC > to be built around the same concepts that resctrl already has elsewhere > in the interface. MPAM will do its best to fit (as already). > > Regarding Peter's use case of assiging multiple counters to a > monitoring group [1], I feel that it's probably good enough to make > sure that the ABMC interface can be extended in future in a backwards > compatible way so as to support this, without trying to support it > immediately. > > [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/ > > > For example, if we added new generic "letters" -- say, "0" to "9", > combined with new counter files in resctrlfs, that feels like a > possible approach. ABMC (as in this series) should just reject such > such assignments, and the new counter files wouldn't exist. What is "combined with new counter files"? Does MPAM going to add new files to support counter assignment in ARM? Also what is "0" to "9"? Is this counter ids? > > Availability of this feature could also be reported as a distinct mode > in mbm_assign_mode, say "mbm_cntr_generic", or whatever. Yes. That should be fine. > > > A _sketch_ of this follows. This is NOT a proposal -- the key > question is whether we are confident that we can extend the interface > in this way in the future without breaking anything. > > If "yes", then the ABMC interface (as proposed by this series) works as > a foundation to build on. > > --8<-- > > [artists's impression] > > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode > mbm_cntr_generic > [mbm_cntr_assign] > default Yes. This looks good. > # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode > # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control Looks like you are assigning counter ids to domains here. That is different than ABMC. In ABMC, we assign events (local or total) to the domain. We internally handle the counter ids based on the availability. Can MPAM follow the same concept? It is possible? > # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter0_bytes_type > # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter1_bytes_type > # echo t >/sys/fs/resctrl/info/L3_MON/mbm_counter2_bytes_type > # echo l >/sys/fs/resctrl/info/L3_MON/mbm_counter3_bytes_type This also looks different that we are have right now in resctrl fs. Are you creating separate file for each counter id in /sys/fs/resctrl/info/L3_MON/? > > ... > > # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_counter1_bytes > > etc. > > -->8-- > > Any thoughts on this, Peter? > > [...] > > Cheers > ---Dave > -- Thanks Babu Moger
Hi, On Thu, Feb 13, 2025 at 10:19:29AM -0600, Moger, Babu wrote: > Hi Dave, > > Thanks for your help. Reinette has asked few questions already. I have few > more questions on top of that. > > On 2/12/25 11:46, Dave Martin wrote: > > Hi there, > > > > On Wed, Jan 22, 2025 at 02:20:08PM -0600, Babu Moger wrote: > >> > >> This series adds the support for Assignable Bandwidth Monitoring Counters > >> (ABMC). It is also called QoS RMID Pinning feature [...] > >> a. Check if ABMC support is available > >> #mount -t resctrl resctrl /sys/fs/resctrl/ > >> > >> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode > >> [mbm_cntr_assign] > >> default > > > > (Nit: can this be called "mbm_counter_assign"? The name is already > > long, so I wonder whether anything is gained by using a cryptic > > abbreviation for "counter". Same with all the "cntrs" elsewhere. > > This is purely cosmetic, though -- the interface works either way.) > > Yes. We can do that. Thanks (note, I'm also happy without this change, if you aren't planning do a substantial respin of the series.) [...] > >> b. Check how many ABMC counters are available. > >> > >> # cat /sys/fs/resctrl/info/L3_MON/num_mbm_cntrs > >> 32 > > > > Is this file needed? > > > > With MPAM, it is more difficult to promise that the same number of > > counters will be available everywhere. > > > > Rather than lie, or report a "safe" value here that may waste some > > counters, can we just allow the number of counters to be be discovered > > per domain via available_mbm_cntrs? > > As Reinette suggested below we can display per domain supported counters > here. > https://lore.kernel.org/lkml/9e849476-7c4b-478b-bd2a-185024def3a3@intel.com/ Although I'm still not convinced that this file is necessary, MPAM should be able to work with this. (I'm assuming that ABMC hardware has a set of counters for each monitoring domain, of course -- otherwise this doesn't make sense.) [...] > >> c. Check how many ABMC counters are available in each domain. > >> > >> # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs > >> 0=30;1=30 > > > > For MPAM, this seems supportable. Each monitoring domain will have > > some counters, and a well-defined number of them will be available for > > allocation at any one time. [...] > >> Flags can be one of the following: > >> > >> t MBM total event is enabled. > >> l MBM local event is enabled. > >> tl Both total and local MBM events are enabled. > >> _ None of the MBM events are enabled > >> > >> Examples: > > > > [...] > > > > I think that this basically works for MPAM. > > > > The local/total distinction doesn't map in a consistent way onto MPAM, > > but this problem is not specific to ABMC. It feels sensible for ABMC > > to be built around the same concepts that resctrl already has elsewhere > > in the interface. MPAM will do its best to fit (as already). > > > > Regarding Peter's use case of assiging multiple counters to a > > monitoring group [1], I feel that it's probably good enough to make > > sure that the ABMC interface can be extended in future in a backwards > > compatible way so as to support this, without trying to support it > > immediately. > > > > [1] https://lore.kernel.org/lkml/CALPaoCjY-3f2tWvBjuaQPfoPhxveWxxCxHqQMn4BEaeBXBa0bA@mail.gmail.com/ > > > > > > For example, if we added new generic "letters" -- say, "0" to "9", > > combined with new counter files in resctrlfs, that feels like a > > possible approach. ABMC (as in this series) should just reject such > > such assignments, and the new counter files wouldn't exist. > > What is "combined with new counter files"? Does MPAM going to add new > files to support counter assignment in ARM? > > Also what is "0" to "9"? Is this counter ids? > > > > > > Availability of this feature could also be reported as a distinct mode > > in mbm_assign_mode, say "mbm_cntr_generic", or whatever. > > Yes. That should be fine. > > > > > > > A _sketch_ of this follows. This is NOT a proposal -- the key > > question is whether we are confident that we can extend the interface > > in this way in the future without breaking anything. > > > > If "yes", then the ABMC interface (as proposed by this series) works as > > a foundation to build on. > > > > --8<-- > > > > [artists's impression] > > > > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_mode > > mbm_cntr_generic > > [mbm_cntr_assign] > > default > > Yes. This looks good. Good to know, thanks. (Just to be clear, I am *not* suggesting adding anything like this just now -- just checking whether the idea works at all.) > > # echo mbm_cntr_generic >/sys/fs/resctrl/info/L3_MON/mbm_assign_mode > > # echo '//0=01;1=23' >/sys/fs/resctrl/info/L3_MON/mbm_assign_control > > Looks like you are assigning counter ids to domains here. That is > different than ABMC. In ABMC, we assign events (local or total) to the > domain. We internally handle the counter ids based on the availability. The numbers are not supposed to have an hardware significance. '//0=6' just "means assign some unused counter for domain 0, and create files in resctrl so I can configure and read it". The "6" is really just a tag for labelling the resulting resctrl file names so that the user can tell them apart. It's not supposed to imply any specific hardware counter or event. > Can MPAM follow the same concept? It is possible? [...] > Thanks > Babu Moger Yes, although there is some hard-to-avoid fuzz about the precise meaning of "local" and "total". As Reinette pointed out, there is the also the possibility of adding new named events other than "local" and "total" if we find that some kinds of event don't fit these categories. Cheers ---Dave
> Yes, although there is some hard-to-avoid fuzz about the precise > meaning of "local" and "total". Things are only getting fuzzier with mixed DDR and CXL memory. > As Reinette pointed out, there is the also the possibility of adding > new named events other than "local" and "total" if we find that some > kinds of event don't fit these categories. Not just new names, new scopes too. Patches coming later this year that would present: $ cd sys/fs/resctrl $ cat mon_data/mon_PKG_00/llc_stalls 779762866739 I.e. a way to cheaply collect some "perf" like events across all CPUs on a package that executed jobs with a specific RMID. Of course this can be done with perf today, but the cost to collect this data from heavily multi-threaded workloads that context switch rapidly is very high. -Tony
Hi Tony, On 2/13/25 10:39 AM, Luck, Tony wrote: >> Yes, although there is some hard-to-avoid fuzz about the precise >> meaning of "local" and "total". > > Things are only getting fuzzier with mixed DDR and CXL memory. > >> As Reinette pointed out, there is the also the possibility of adding >> new named events other than "local" and "total" if we find that some >> kinds of event don't fit these categories. > > Not just new names, new scopes too. Patches coming later this year > that would present: > > $ cd sys/fs/resctrl > $ cat mon_data/mon_PKG_00/llc_stalls > 779762866739 Thank you for catching this. To support this would not be possible for the current plan for mbm_assign_control since it does not have a way to distinguish domain X of the PKG resource from domain X of the L3 resource. Sounds like we need to include the resource name in the mbm_assign_control syntax? Reinette
On 2/13/25 10:34 PM, Reinette Chatre wrote: > Hi Tony, > > On 2/13/25 10:39 AM, Luck, Tony wrote: >>> Yes, although there is some hard-to-avoid fuzz about the precise >>> meaning of "local" and "total". >> >> Things are only getting fuzzier with mixed DDR and CXL memory. >> >>> As Reinette pointed out, there is the also the possibility of adding >>> new named events other than "local" and "total" if we find that some >>> kinds of event don't fit these categories. >> >> Not just new names, new scopes too. Patches coming later this year >> that would present: >> >> $ cd sys/fs/resctrl >> $ cat mon_data/mon_PKG_00/llc_stalls >> 779762866739 > > Thank you for catching this. To support this would not be possible for > the current plan for mbm_assign_control since it does not have a way > to distinguish domain X of the PKG resource from domain X of the L3 resource. > Sounds like we need to include the resource name in the mbm_assign_control > syntax? ugh ... please ignore this message. This is not needed since mbm_assign_control is already associated with the resource. Reinette
Hi Babu, On Wed, Jan 22, 2025 at 9:20 PM Babu Moger <babu.moger@amd.com> wrote: > > > This series adds the support for Assignable Bandwidth Monitoring Counters > (ABMC). It is also called QoS RMID Pinning feature > > Series is written such that it is easier to support other assignable > features supported from different vendors. > > The feature details are documented in the APM listed below [1]. > [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming > Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth > Monitoring (ABMC). The documentation is available at > Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 > > The patches are based on top of commit > d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx' > > # Introduction > > Users can create as many monitor groups as RMIDs supported by the hardware. > However, bandwidth monitoring feature on AMD system only guarantees that > RMIDs currently assigned to a processor will be tracked by hardware. > The counters of any other RMIDs which are no longer being tracked will be > reset to zero. The MBM event counters return "Unavailable" for the RMIDs > that are not tracked by hardware. So, there can be only limited number of > groups that can give guaranteed monitoring numbers. With ever changing > configurations there is no way to definitely know which of these groups > are being tracked for certain point of time. Users do not have the option > to monitor a group or set of groups for certain period of time without > worrying about counter being reset in between. > > The ABMC feature provides an option to the user to assign a hardware > counter to an RMID, event pair and monitor the bandwidth as long as it is > assigned. The assigned RMID will be tracked by the hardware until the user > unassigns it manually. There is no need to worry about counters being reset > during this period. Additionally, the user can specify a bitmask identifying > the specific bandwidth types from the given source to track with the counter. > > Without ABMC enabled, monitoring will work in current 'default' mode without > assignment option. > > # Linux Implementation > > Create a generic interface aimed to support user space assignment > of scarce counters used for monitoring. First usage of interface > is by ABMC with option to expand usage to "soft-ABMC" and MPAM > counters in future. As a reminder of the work related to this, please take a look at the thread where Reinette proposed a "shared counters" mode in mbm_assign_control[1]. I am currently working to demonstrate that this combined with the mbm_*_bytes_per_second events discussed earlier in the same thread will address my users' concerns about the overhead of reading a large number of MBM counters, resulting from a maximal number of monitoring groups whose jobs are not isolated to any L3 monitoring domain. ABMC will add to the number of registers which need to be programmed in each domain, so I will need to demonstrate that ABMC combined with these additional features addresses their performance concerns and that the resulting interface is user-friendly enough that they will not need a detailed understanding of the implementation to avoid an unacceptable performance degradation (i.e., needing to understand what conditions will increase the number of IPIs required). If all goes well, soft-ABMC will try to extend this usage model to the existing, pre-ABMC, AMD platforms I support. Thanks, -Peter [1] https://lore.kernel.org/lkml/7ee63634-3b55-4427-8283-8e3d38105f41@intel.com/
Hi Peter, On 2/3/25 08:54, Peter Newman wrote: > Hi Babu, > > On Wed, Jan 22, 2025 at 9:20 PM Babu Moger <babu.moger@amd.com> wrote: >> >> >> This series adds the support for Assignable Bandwidth Monitoring Counters >> (ABMC). It is also called QoS RMID Pinning feature >> >> Series is written such that it is easier to support other assignable >> features supported from different vendors. >> >> The feature details are documented in the APM listed below [1]. >> [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming >> Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth >> Monitoring (ABMC). The documentation is available at >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 >> >> The patches are based on top of commit >> d361b84d51bfe (tip/master) Merge branch into tip/master: 'x86/tdx' >> >> # Introduction >> >> Users can create as many monitor groups as RMIDs supported by the hardware. >> However, bandwidth monitoring feature on AMD system only guarantees that >> RMIDs currently assigned to a processor will be tracked by hardware. >> The counters of any other RMIDs which are no longer being tracked will be >> reset to zero. The MBM event counters return "Unavailable" for the RMIDs >> that are not tracked by hardware. So, there can be only limited number of >> groups that can give guaranteed monitoring numbers. With ever changing >> configurations there is no way to definitely know which of these groups >> are being tracked for certain point of time. Users do not have the option >> to monitor a group or set of groups for certain period of time without >> worrying about counter being reset in between. >> >> The ABMC feature provides an option to the user to assign a hardware >> counter to an RMID, event pair and monitor the bandwidth as long as it is >> assigned. The assigned RMID will be tracked by the hardware until the user >> unassigns it manually. There is no need to worry about counters being reset >> during this period. Additionally, the user can specify a bitmask identifying >> the specific bandwidth types from the given source to track with the counter. >> >> Without ABMC enabled, monitoring will work in current 'default' mode without >> assignment option. >> >> # Linux Implementation >> >> Create a generic interface aimed to support user space assignment >> of scarce counters used for monitoring. First usage of interface >> is by ABMC with option to expand usage to "soft-ABMC" and MPAM >> counters in future. > > As a reminder of the work related to this, please take a look at the > thread where Reinette proposed a "shared counters" mode in > mbm_assign_control[1]. I am currently working to demonstrate that this > combined with the mbm_*_bytes_per_second events discussed earlier in > the same thread will address my users' concerns about the overhead of > reading a large number of MBM counters, resulting from a maximal > number of monitoring groups whose jobs are not isolated to any L3 > monitoring domain. > > ABMC will add to the number of registers which need to be programmed > in each domain, so I will need to demonstrate that ABMC combined with > these additional features addresses their performance concerns and > that the resulting interface is user-friendly enough that they will > not need a detailed understanding of the implementation to avoid an > unacceptable performance degradation (i.e., needing to understand what > conditions will increase the number of IPIs required). > > If all goes well, soft-ABMC will try to extend this usage model to the > existing, pre-ABMC, AMD platforms I support. > > Thanks, > -Peter > > [1] https://lore.kernel.org/lkml/7ee63634-3b55-4427-8283-8e3d38105f41@intel.com/ > Thanks for the heads-up. I understand what's going on and have an idea of the plan. Please keep us updated on the progress. Also, if any changes are needed in this series to meet your requirements, feel free to share your feedback. -- Thanks Babu Moger
On Mon, Feb 03, 2025 at 02:49:27PM -0600, Moger, Babu wrote: > Hi Peter, > > On 2/3/25 08:54, Peter Newman wrote: [...] > >> # Linux Implementation > >> > >> Create a generic interface aimed to support user space assignment > >> of scarce counters used for monitoring. First usage of interface > >> is by ABMC with option to expand usage to "soft-ABMC" and MPAM > >> counters in future. > > > > As a reminder of the work related to this, please take a look at the > > thread where Reinette proposed a "shared counters" mode in > > mbm_assign_control[1]. I am currently working to demonstrate that this > > combined with the mbm_*_bytes_per_second events discussed earlier in > > the same thread will address my users' concerns about the overhead of > > reading a large number of MBM counters, resulting from a maximal > > number of monitoring groups whose jobs are not isolated to any L3 > > monitoring domain. > > > > ABMC will add to the number of registers which need to be programmed > > in each domain, so I will need to demonstrate that ABMC combined with > > these additional features addresses their performance concerns and > > that the resulting interface is user-friendly enough that they will > > not need a detailed understanding of the implementation to avoid an > > unacceptable performance degradation (i.e., needing to understand what > > conditions will increase the number of IPIs required). > > > > If all goes well, soft-ABMC will try to extend this usage model to the > > existing, pre-ABMC, AMD platforms I support. > > > > Thanks, > > -Peter > > > > [1] https://lore.kernel.org/lkml/7ee63634-3b55-4427-8283-8e3d38105f41@intel.com/ > > > > Thanks for the heads-up. I understand what's going on and have an idea of > the plan. Please keep us updated on the progress. Also, if any changes are > needed in this series to meet your requirements, feel free to share your > feedback. Playing devil's advocate, I wonder whether there is a point beyond which it would be better to have an interface to hand over some of the counters to perf? The logic for round-robin scheduling of events onto counters, dealing with overflows etc. has already been invented over there, and it's fiddly to get right. Ideally resctrl wouldn't have its own special implementation of that kind of stuff. (Said my someone who has never tried to hack up an uncore event source in perf.) Cheers ---Dave
> Playing devil's advocate, I wonder whether there is a point beyond > which it would be better to have an interface to hand over some of the > counters to perf? > > The logic for round-robin scheduling of events onto counters, dealing > with overflows etc. has already been invented over there, and it's > fiddly to get right. Ideally resctrl wouldn't have its own special > implementation of that kind of stuff. > > (Said my someone who has never tried to hack up an uncore event source > in perf.) Initial implementation on Intel RDT tried to use perf ... it all went badly and was reverted. There are some very un-perf-like properties that we couldn't find a workaround for at the time. E.g. 1) Cache occupancy counters. These change even when your workload isn't running (downward due to evictions). 2) Counters based on RMIDs show the aggregated values from multiple CPUs as tasks are scheduled on cores. But maybe you meant "don't let resctrl use all those counters" ... hand some of them to perf to use in some other way? -Tony
© 2016 - 2026 Red Hat, Inc.