[v5] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

[PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Babu Moger 1 year, 7 months ago

This series adds the support for Assignable Bandwidth Monitoring Counters
(ABMC). It is also called QoS RMID Pinning feature

The feature details are documented in the  APM listed below [1].
[1] AMD64 Architecture Programmer's Manual Volume 2: System Programming
Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth
Monitoring (ABMC). The documentation is available at
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537

The patches are based on top of commit
commit fbfd4bcefc65 ("Merge branch into tip/master: 'x86/vmware'").
whic includes Tony's SNC support.

# Introduction

Users can create as many monitor groups as RMIDs supported by the hardware.
However, bandwidth monitoring feature on AMD system only guarantees that
RMIDs currently assigned to a processor will be tracked by hardware.
The counters of any other RMIDs which are no longer being tracked will be
reset to zero. The MBM event counters return "Unavailable" for the RMIDs
that are not tracked by hardware. So, there can be only limited number of
groups that can give guaranteed monitoring numbers. With ever changing
configurations there is no way to definitely know which of these groups
are being tracked for certain point of time. Users do not have the option
to monitor a group or set of groups for certain period of time without
worrying about RMID being reset in between.
    
The ABMC feature provides an option to the user to assign a hardware
counter to an RMID and monitor the bandwidth as long as it is assigned.
The assigned RMID will be tracked by the hardware until the user unassigns
it manually. There is no need to worry about counters being reset during
this period. Additionally, the user can specify a bitmask identifying the
specific bandwidth types from the given source to track with the counter.

Without ABMC enabled, monitoring will work in current mode without
assignment option.

# Linux Implementation

Linux resctrl subsystem provides the interface to count maximum of two
memory bandwidth events per group, from a combination of available total
and local events. Keeping the current interface, users can enable a maximum
of 2 ABMC counters per group. User will also have the option to enable only
one counter to the group. If the system runs out of assignable ABMC
counters, kernel will display an error. Users need to disable an already
enabled counter to make space for new assignments.


# Examples

a. Check if ABMC support is available
	#mount -t resctrl resctrl /sys/fs/resctrl/

	#cat /sys/fs/resctrl/info/L3_MON/mbm_mode
	[abmc] 
	legacy

	Linux kernel detected ABMC feature and it is enabled.

b. Check how many ABMC counters are available. 

	#cat /sys/fs/resctrl/info/L3_MON/num_cntrs 
	32

c. Create few resctrl groups.

	# mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp


d. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_control
   to list and modify the group's monitoring states. File provides single place
   to list monitoring states of all the resctrl groups. It makes it easier for
   user space to learn about the counters are used without needing to traverse
   all the groups thus reducing the number of filesystem calls.

	The list follows the following format:

	"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

	Format for specific type of groups:

	* Default CTRL_MON group:
	 "//<domain_id>=<flags>"

       * Non-default CTRL_MON group:
               "<CTRL_MON group>//<domain_id>=<flags>"

       * Child MON group of default CTRL_MON group:
               "/<MON group>/<domain_id>=<flags>"

       * Child MON group of non-default CTRL_MON group:
               "<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

       Flags can be one of the following:

        t  MBM total event is enabled.
        l  MBM local event is enabled.
        tl Both total and local MBM events are enabled.
        _  None of the MBM events are enabled

	Examples:

	# cat /sys/fs/resctrl/info/L3_MON/mbm_control 
	non_default_ctrl_mon_grp//0=tl;1=tl;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
	//0=tl;1=tl;
	/child_default_mon_grp/0=tl;1=tl;
	
	There are four groups and all the groups have local and total
	event enabled on domain 0 and 1.

	=tl means both total and local events are enabled.

	"//" - This is a default CTRL_MON group

	"non_default_ctrl_mon_grp//" - This is non-default CTRL_MON group

	"/child_default_mon_grp/"  - This is Child MON group of the defult group

	"non_default_ctrl_mon_grp/child_non_default_mon_grp/" - This is child
	MON group of the non-default group

e. Update the group assignment states using the interface file /sys/fs/resctrl/info/L3_MON/mbm_control.

	The write format is similar to the above list format with addition of
	op-code for the assignment operation.
	
	* Default CTRL_MON group:
	        "//<domain_id><op-code><flags>"
	
	* Non-default CTRL_MON group:
	        "<CTRL_MON group>//<domain_id><op-code><flags>"
	
	* Child MON group of default CTRL_MON group:
	        "/<MON group>/<domain_id><op-code><flags>"
	
	* Child MON group of non-default CTRL_MON group:
	        "<CTRL_MON group>/<MON group>/<domain_id><op-code><flags>"
	
	Op-code can be one of the following:
	
	= Update the assignment to match the flag.
	+ Assign a new state.
	- Unassign a new state.

	Flags can be one of the following:

        t  MBM total event.
        l  MBM local event.
        tl Both total and local MBM events.
        _  None of the MBM events. Only works with '=' op-code.
	
	Initial group status:
	# cat /sys/fs/resctrl/info/L3_MON/mbm_control
	non_default_ctrl_mon_grp//0=tl;1=tl;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
	//0=tl;1=tl;
	/child_default_mon_grp/0=tl;1=tl;

	To update the default group to enable only total event on domain 0:
	# echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_control

	Assignment status after the update:
	# cat /sys/fs/resctrl/info/L3_MON/mbm_control
	non_default_ctrl_mon_grp//0=tl;1=tl;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
	//0=t;1=tl;
	/child_default_mon_grp/0=tl;1=tl;

	To update the MON group child_default_mon_grp to remove total event on domain 1:
	# echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control

	Assignment status after the update:
	$ cat /sys/fs/resctrl/info/L3_MON/mbm_control
	non_default_ctrl_mon_grp//0=tl;1=tl;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
	//0=t;1=tl;
	/child_default_mon_grp/0=tl;1=l;

	To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to
	remove both local and total events on domain 1:
	# echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
	       /sys/fs/resctrl/info/L3_MON/mbm_control

	Assignment status after the update:
	non_default_ctrl_mon_grp//0=tl;1=tl;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
	//0=t;1=tl;
	/child_default_mon_grp/0=tl;1=l;

	To update the default group to add a local event domain 0.
	# echo "//0+l" > /sys/fs/resctrl/info/L3_MON/mbm_control

	Assignment status after the update:
	# cat /sys/fs/resctrl/info/L3_MON/mbm_control
	non_default_ctrl_mon_grp//0=tl;1=tl;
	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
	//0=tl;1=tl;
	/child_default_mon_grp/0=tl;1=l;


f. Read the event mbm_total_bytes and mbm_local_bytes of the default group.
   There is no change in reading the events with ABMC. If the event is unassigned
   when reading, then the read will come back as "Unassigned".
	
	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	779247936
	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes 
	765207488
	
g. Users will have the option to go back to legacy mbm_mode if required.
   This can be done using the following command. Note that switching the
   mbm_mode will reset all the mbm counters of all resctrl groups.

	# echo "legacy" > /sys/fs/resctrl/info/L3_MON/mbm_mode
	# cat /sys/fs/resctrl/info/L3_MON/mbm_mode
	abmc
	[legacy]

h. Check the bandwidth configuration for the group. Note that bandwidth
   configuration has a domain scope. Total event defaults to 0x7F (to
   count all the events) and local event defaults to 0x15 (to count all
   the local numa events). The event bitmap decoding is available at
   https://www.kernel.org/doc/Documentation/x86/resctrl.rst
   in section "mbm_total_bytes_config", "mbm_local_bytes_config":
	
	#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 
	0=0x7f;1=0x7f
	
	#cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 
	0=0x15;1=0x15
	
j. Change the bandwidth source for domain 0 for the total event to count only reads.
   Note that this change effects total events on the domain 0.
	
	#echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 
	#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 
	0=0x33;1=0x7F
	
k. Now read the total event again. The first read will come back with "Unavailable"
   status. The subsequent read of mbm_total_bytes will display only the read events.
	
	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	Unavailable
	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
	314101
	
l. Unmount the resctrl
	 
	#umount /sys/fs/resctrl/

---
v5:
  Rebase changes (because of SNC support)

  Interface changes.
   /sys/fs/resctrl/mbm_assign to /sys/fs/resctrl/mbm_mode.
   /sys/fs/resctrl/mbm_assign_control to /sys/fs/resctrl/mbm_control.

  Added few arch specific routines.
  resctrl_arch_get_abmc_enabled.
  resctrl_arch_abmc_enable.
  resctrl_arch_abmc_disable.

  Few renames
   num_cntrs_free_map -> mbm_cntrs_free_map
   num_cntrs_init -> mbm_cntrs_init
   arch_domain_mbm_evt_config -> resctrl_arch_mbm_evt_config

  Introduced resctrl_arch_event_config_get and
    resctrl_arch_event_config_set() to update event configuration.

  Removed mon_state field mongroup. Added MON_CNTR_UNSET to initialize counters.

  Renamed ctr_id to cntr_id for the hardware counter.
 
  Report "Unassigned" in case the user attempts to read the events without assigning the counter.
  
  ABMC is enabled during the boot up. Can be enabled or disabled later.

  Fixed opcode and flags combination.
    '=_" is valid.
    "-_" amd "+_" is not valid.

 Added all the comments as far as I know. If I missed something, it is not intentional.

v4: 
  Main change is domain specific event assignment.
  Kept the ABMC feature as a default.
  Dynamcic switching between ABMC and mbm_legacy is still allowed.
  We are still not clear about mount option.
  Moved the monitoring related data in resctrl_mon structure from rdt_resource.
  Fixed the display of legacy and ABMC mode.
  Used bimap APIs when possible.
  Removed event configuration read from MSRs. We can use the
  internal saved data.(patch 12)
  Added more comments about L3_QOS_ABMC_CFG MSR.
  Added IPIs to read the assignment status for each domain (patch 18 and 19)
  More details in each patch.

v3:
   This series adds the support for global assignment mode discussed in
   the thread. https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/
   Removed the individual assignment mode and included the global assignment interface.
   Added following interface files.
   a. /sys/fs/resctrl/info/L3_MON/mbm_assign
      Used for displaying the current assignment mode and switch between
      ABMC and legacy mode.
   b. /sys/fs/resctrl/info/L3_MON/mbm_assign_control
      Used for lising the groups assignment mode and modify the assignment states.
   c. Most of the changes are related to the new interface.
   d. Addressed the comments from Reinette, James and Peter.
   e. Hope I have addressed most of the major feedbacks discussed. If I missed
      something then it is not intentional. Please feel free to comment.
   f. Sending this as an RFC as per Reinette's comment. So, this is still open
      for discussion.

v2:
   a. Major change is the way ABMC is enabled. Earlier, user needed to remount
      with -o abmc to enable ABMC feature. Removed that option now.
      Now users can enable ABMC by "$echo 1 to /sys/fs/resctrl/info/L3_MON/mbm_assign_enable".
     
   b. Added new word 21 to x86/cpufeatures.h.

   c. Display unsupported if user attempts to read the events when ABMC is enabled
      and event is not assigned.

   d. Display monitor_state as "Unsupported" when ABMC is disabled.
  
   e. Text updates and rebase to latest tip tree (as of Jan 18).
 
   f. This series is still work in progress. I am yet to hear from ARM developers. 

v4:
  https://lore.kernel.org/lkml/cover.1716552602.git.babu.moger@amd.com/

v3:
 https://lore.kernel.org/lkml/cover.1711674410.git.babu.moger@amd.com/  

v2:
  https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/

v1 :
   https://lore.kernel.org/lkml/20231201005720.235639-1-babu.moger@amd.com/



Babu Moger (20):
  x86/cpufeatures: Add support for Assignable Bandwidth Monitoring
    Counters (ABMC)
  x86/resctrl: Add ABMC feature in the command line options
  x86/resctrl: Consolidate monitoring related data from rdt_resource
  x86/resctrl: Detect Assignable Bandwidth Monitoring feature details
  x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags
  x86/resctrl: Add support to enable/disable AMD ABMC feature
  x86/resctrl: Introduce the interface to display monitor mode
  x86/resctrl: Introduce interface to display number of monitoring
    counters
  x86/resctrl: Initialize monitor counters bitmap
  x86/resctrl: Introduce mbm_total_cfg and mbm_local_cfg
  x86/resctrl: Remove MSR reading of event configuration value
  x86/resctrl: Add data structures and definitions for ABMC assignment
  x86/resctrl: Add the interface to assign hardware counter
  x86/resctrl: Add the interface to unassign hardware counter
  x86/resctrl: Assign/unassign counters by default when ABMC is enabled
  x86/resctrl: Report "Unassigned" for MBM events in ABMC mode
  x86/resctrl: Introduce the interface switch between monitor modes
  x86/resctrl: Enable AMD ABMC feature by default when supported
  x86/resctrl: Introduce interface to list monitor states of all the
    groups
  x86/resctrl: Introduce interface to modify assignment states of the
    groups

 .../admin-guide/kernel-parameters.txt         |   2 +-
 Documentation/arch/x86/resctrl.rst            | 181 ++++
 arch/x86/include/asm/cpufeatures.h            |   1 +
 arch/x86/include/asm/msr-index.h              |   3 +
 arch/x86/kernel/cpu/cpuid-deps.c              |   3 +
 arch/x86/kernel/cpu/resctrl/core.c            |  12 +-
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c     |  19 +-
 arch/x86/kernel/cpu/resctrl/internal.h        |  69 +-
 arch/x86/kernel/cpu/resctrl/monitor.c         |  63 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c        | 930 ++++++++++++++++--
 arch/x86/kernel/cpu/scattered.c               |   1 +
 include/linux/resctrl.h                       |  21 +-
 12 files changed, 1218 insertions(+), 87 deletions(-)

-- 
2.34.1

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Reinette Chatre 1 year, 6 months ago

Hi Babu,

On 7/3/24 2:48 PM, Babu Moger wrote:
> # Linux Implementation
> 
> Linux resctrl subsystem provides the interface to count maximum of two
> memory bandwidth events per group, from a combination of available total
> and local events. Keeping the current interface, users can enable a maximum
> of 2 ABMC counters per group. User will also have the option to enable only
> one counter to the group. If the system runs out of assignable ABMC
> counters, kernel will display an error. Users need to disable an already
> enabled counter to make space for new assignments.

The implementation appears to be converging on an interface that can
be generic enough to be used by other features discussed along the way.
"Linux implementation" summary can thus add:

	Create a generic interface aimed to support user space assignment
	of scarce counters used for monitoring. First usage of interface
	is by ABMC with option to expand usage to "soft-RMID" and MPAM
	counters in future.


> # Examples
> 
> a. Check if ABMC support is available
> 	#mount -t resctrl resctrl /sys/fs/resctrl/
> 
> 	#cat /sys/fs/resctrl/info/L3_MON/mbm_mode
> 	[abmc]
> 	legacy
> 
> 	Linux kernel detected ABMC feature and it is enabled.

How about renaming "abmc" to "mbm_cntrs"? This will match the num_mbm_cntrs
info file and be the final step to make this generic so that another architecture
can more easily support assignining hardware counters without needing to call
the feature AMD's "abmc".

Expanding on this it may be possible to add a new "sw_mbm_cntrs" feature that
will be the "soft-RMID" feature while also reflecting the "mbm_cntrs" name
so that when user space enables that feature its properties can be found in
"num_mbm_cntrs".

The "abmc" kernel parameter remains but that does seem separate from this
resctrl fs feature since it is explicitly tied to X86_FEATURE_ABMC surely
making it architecture specific.

> 
> b. Check how many ABMC counters are available.
> 
> 	#cat /sys/fs/resctrl/info/L3_MON/num_cntrs
> 	32

This is now num_mbm_cntrs

> 
> c. Create few resctrl groups.
> 
> 	# mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
> 	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
> 	# mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp
> 
> 
> d. This series adds a new interface file /sys/fs/resctrl/info/L3_MON/mbm_control
>     to list and modify the group's monitoring states. File provides single place
>     to list monitoring states of all the resctrl groups. It makes it easier for
>     user space to learn about the counters are used without needing to traverse
>     all the groups thus reducing the number of filesystem calls.
> 
> 	The list follows the following format:
> 
> 	"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
> 
> 	Format for specific type of groups:
> 
> 	* Default CTRL_MON group:
> 	 "//<domain_id>=<flags>"
> 
>         * Non-default CTRL_MON group:
>                 "<CTRL_MON group>//<domain_id>=<flags>"
> 
>         * Child MON group of default CTRL_MON group:
>                 "/<MON group>/<domain_id>=<flags>"
> 
>         * Child MON group of non-default CTRL_MON group:
>                 "<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
> 
>         Flags can be one of the following:
> 
>          t  MBM total event is enabled.
>          l  MBM local event is enabled.
>          tl Both total and local MBM events are enabled.
>          _  None of the MBM events are enabled

The language needs to be changed here (and in the many copied places) to
be specific about what setting the flag accomplishes. For example, in
"legacy" mode user space can be expected to find all events enabled, no?
Needing a new feature to set a flag to accomplish something that is
possible in legacy mode can thus cause confusion.

If I understand the implementation reading "mbm_control" will fail
if system is ABMC capable but it is disabled. Why can "mbm_control" not
always be displayed to user space? For example, what if "mbm_control" is
always available to user space and it can provide specific information to
user space. For example:
	t  MBM total event is enabled but may not always be counted.
	T  MBM total event is enabled and being counted.

On AMD systems resource groups will have "t" associated with monitor
groups when ABMC disabled, "T" when ABMC enabled and a counter assigned.
On Intel systems monitor groups will always have "T".

For "soft-RMID" the flag could possible continue to be "T"?

I am trying to find ways to communicate to user space consistently
and clearly and any insights will be appreciated. We really do not want
to add this interface and then find that it just causes confusion.

It is not quite obvious to me when the new files should be visible and
what they should present to the user. "mbm_mode" is now always visible.
Should "num_mbm_cntrs" not also always be visible? Right now "num_mbm_cntrs"
appears to be only associated to ABMC, should it not also, for example,
be the file that "soft-RMID" may use to share how many counters are
available? Its contents will thus be dynamic based on which "MBM mode" is
active, begging the question, what should it contain when "legacy" mode is
enabled, should "num_mbm_cntrs" perhaps show "0" to user space when
"legacy" mode is active?


> 
> 	Examples:
> 
> 	# cat /sys/fs/resctrl/info/L3_MON/mbm_control
> 	non_default_ctrl_mon_grp//0=tl;1=tl;
> 	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
> 	//0=tl;1=tl;
> 	/child_default_mon_grp/0=tl;1=tl;
> 	
> 	There are four groups and all the groups have local and total
> 	event enabled on domain 0 and 1.

"local and total event" is vague, can it be made specific with, for example,
"local and total MBM events"

> 
> 	=tl means both total and local events are enabled.

Same here (and all copied places in this series)

> 
> 	"//" - This is a default CTRL_MON group
> 
> 	"non_default_ctrl_mon_grp//" - This is non-default CTRL_MON group
> 
> 	"/child_default_mon_grp/"  - This is Child MON group of the defult group

Same typos as in previous version of cover letter.

> 
> 	"non_default_ctrl_mon_grp/child_non_default_mon_grp/" - This is child
> 	MON group of the non-default group
> 
> e. Update the group assignment states using the interface file /sys/fs/resctrl/info/L3_MON/mbm_control.
> 
> 	The write format is similar to the above list format with addition of
> 	op-code for the assignment operation.
> 	
> 	* Default CTRL_MON group:
> 	        "//<domain_id><op-code><flags>"
> 	
> 	* Non-default CTRL_MON group:
> 	        "<CTRL_MON group>//<domain_id><op-code><flags>"
> 	
> 	* Child MON group of default CTRL_MON group:
> 	        "/<MON group>/<domain_id><op-code><flags>"
> 	
> 	* Child MON group of non-default CTRL_MON group:
> 	        "<CTRL_MON group>/<MON group>/<domain_id><op-code><flags>"
> 	
> 	Op-code can be one of the following:
> 	
> 	= Update the assignment to match the flag.
> 	+ Assign a new state.
> 	- Unassign a new state.

Please be consistent with terminology. Above switches between "flag"
and "state" while it then continues below using "event". Also,
"Unassign a _new_ state" is unexpected, it should probably be an
_existing_ (not "new") state/flag/event?

> 
> 	Flags can be one of the following:
> 
>          t  MBM total event.
>          l  MBM local event.
>          tl Both total and local MBM events.
>          _  None of the MBM events. Only works with '=' op-code.
> 	
> 	Initial group status:
> 	# cat /sys/fs/resctrl/info/L3_MON/mbm_control
> 	non_default_ctrl_mon_grp//0=tl;1=tl;
> 	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
> 	//0=tl;1=tl;
> 	/child_default_mon_grp/0=tl;1=tl;
> 
> 	To update the default group to enable only total event on domain 0:
> 	# echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_control
> 
> 	Assignment status after the update:
> 	# cat /sys/fs/resctrl/info/L3_MON/mbm_control
> 	non_default_ctrl_mon_grp//0=tl;1=tl;
> 	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
> 	//0=t;1=tl;
> 	/child_default_mon_grp/0=tl;1=tl;
> 
> 	To update the MON group child_default_mon_grp to remove total event on domain 1:
> 	# echo "/child_default_mon_grp/1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
> 
> 	Assignment status after the update:
> 	$ cat /sys/fs/resctrl/info/L3_MON/mbm_control
> 	non_default_ctrl_mon_grp//0=tl;1=tl;
> 	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
> 	//0=t;1=tl;
> 	/child_default_mon_grp/0=tl;1=l;
> 
> 	To update the MON group non_default_ctrl_mon_grp/child_non_default_mon_grp to
> 	remove both local and total events on domain 1:
> 	# echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
> 	       /sys/fs/resctrl/info/L3_MON/mbm_control
> 
> 	Assignment status after the update:
> 	non_default_ctrl_mon_grp//0=tl;1=tl;
> 	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
> 	//0=t;1=tl;
> 	/child_default_mon_grp/0=tl;1=l;
> 
> 	To update the default group to add a local event domain 0.
> 	# echo "//0+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
> 
> 	Assignment status after the update:
> 	# cat /sys/fs/resctrl/info/L3_MON/mbm_control
> 	non_default_ctrl_mon_grp//0=tl;1=tl;
> 	non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
> 	//0=tl;1=tl;
> 	/child_default_mon_grp/0=tl;1=l;
> 
> 
> f. Read the event mbm_total_bytes and mbm_local_bytes of the default group.
>     There is no change in reading the events with ABMC. If the event is unassigned
>     when reading, then the read will come back as "Unassigned".
> 	
> 	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 	779247936
> 	# cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
> 	765207488
> 	
> g. Users will have the option to go back to legacy mbm_mode if required.
>     This can be done using the following command. Note that switching the
>     mbm_mode will reset all the mbm counters of all resctrl groups.

mbm -> MBM (throughout)

> 
> 	# echo "legacy" > /sys/fs/resctrl/info/L3_MON/mbm_mode
> 	# cat /sys/fs/resctrl/info/L3_MON/mbm_mode
> 	abmc
> 	[legacy]
> 
> h. Check the bandwidth configuration for the group. Note that bandwidth
>     configuration has a domain scope. Total event defaults to 0x7F (to
>     count all the events) and local event defaults to 0x15 (to count all
>     the local numa events). The event bitmap decoding is available at
>     https://www.kernel.org/doc/Documentation/x86/resctrl.rst
>     in section "mbm_total_bytes_config", "mbm_local_bytes_config":
> 	
> 	#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
> 	0=0x7f;1=0x7f
> 	
> 	#cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
> 	0=0x15;1=0x15
> 	
> j. Change the bandwidth source for domain 0 for the total event to count only reads.
>     Note that this change effects total events on the domain 0.
> 	
> 	#echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
> 	#cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
> 	0=0x33;1=0x7F
> 	
> k. Now read the total event again. The first read will come back with "Unavailable"
>     status. The subsequent read of mbm_total_bytes will display only the read events.
> 	
> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 	Unavailable
> 	#cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
> 	314101
> 	
> l. Unmount the resctrl
> 	
> 	#umount /sys/fs/resctrl/
> 

Reinette

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Moger, Babu 1 year, 6 months ago

Hi Reinette,

On 7/12/24 17:03, Reinette Chatre wrote:
> Hi Babu,
> 
> On 7/3/24 2:48 PM, Babu Moger wrote:
>> # Linux Implementation
>>
>> Linux resctrl subsystem provides the interface to count maximum of two
>> memory bandwidth events per group, from a combination of available total
>> and local events. Keeping the current interface, users can enable a maximum
>> of 2 ABMC counters per group. User will also have the option to enable only
>> one counter to the group. If the system runs out of assignable ABMC
>> counters, kernel will display an error. Users need to disable an already
>> enabled counter to make space for new assignments.
> 
> The implementation appears to be converging on an interface that can
> be generic enough to be used by other features discussed along the way.
> "Linux implementation" summary can thus add:
> 
>     Create a generic interface aimed to support user space assignment
>     of scarce counters used for monitoring. First usage of interface
>     is by ABMC with option to expand usage to "soft-RMID" and MPAM
>     counters in future.

Sure.

> 
> 
>> # Examples
>>
>> a. Check if ABMC support is available
>>     #mount -t resctrl resctrl /sys/fs/resctrl/
>>
>>     #cat /sys/fs/resctrl/info/L3_MON/mbm_mode
>>     [abmc]
>>     legacy
>>
>>     Linux kernel detected ABMC feature and it is enabled.
> 
> How about renaming "abmc" to "mbm_cntrs"? This will match the num_mbm_cntrs
> info file and be the final step to make this generic so that another
> architecture
> can more easily support assignining hardware counters without needing to call
> the feature AMD's "abmc".

I think we aleady settled this with "mbm_cntr_assignable".

For soft-RMID" it will be mbm_sw_assignable.


> 
> Expanding on this it may be possible to add a new "sw_mbm_cntrs" feature that
> will be the "soft-RMID" feature while also reflecting the "mbm_cntrs" name
> so that when user space enables that feature its properties can be found in
> "num_mbm_cntrs".
> 
> The "abmc" kernel parameter remains but that does seem separate from this
> resctrl fs feature since it is explicitly tied to X86_FEATURE_ABMC surely
> making it architecture specific.
> 
>>
>> b. Check how many ABMC counters are available.
>>
>>     #cat /sys/fs/resctrl/info/L3_MON/num_cntrs
>>     32
> 
> This is now num_mbm_cntrs

Sure.

> 
>>
>> c. Create few resctrl groups.
>>
>>     # mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
>>     # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
>>     # mkdir
>> /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp
>>
>>
>> d. This series adds a new interface file
>> /sys/fs/resctrl/info/L3_MON/mbm_control
>>     to list and modify the group's monitoring states. File provides
>> single place
>>     to list monitoring states of all the resctrl groups. It makes it
>> easier for
>>     user space to learn about the counters are used without needing to
>> traverse
>>     all the groups thus reducing the number of filesystem calls.
>>
>>     The list follows the following format:
>>
>>     "<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
>>
>>     Format for specific type of groups:
>>
>>     * Default CTRL_MON group:
>>      "//<domain_id>=<flags>"
>>
>>         * Non-default CTRL_MON group:
>>                 "<CTRL_MON group>//<domain_id>=<flags>"
>>
>>         * Child MON group of default CTRL_MON group:
>>                 "/<MON group>/<domain_id>=<flags>"
>>
>>         * Child MON group of non-default CTRL_MON group:
>>                 "<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
>>
>>         Flags can be one of the following:
>>
>>          t  MBM total event is enabled.
>>          l  MBM local event is enabled.
>>          tl Both total and local MBM events are enabled.
>>          _  None of the MBM events are enabled
> 
> The language needs to be changed here (and in the many copied places) to
> be specific about what setting the flag accomplishes. For example, in
> "legacy" mode user space can be expected to find all events enabled, no?
> Needing a new feature to set a flag to accomplish something that is
> possible in legacy mode can thus cause confusion.

Yes. It is possible to do it. But I feel unnessassary.

> 
> If I understand the implementation reading "mbm_control" will fail
> if system is ABMC capable but it is disabled. Why can "mbm_control" not
> always be displayed to user space? For example, what if "mbm_control" is
> always available to user space and it can provide specific information to
> user space. For example:
>     t  MBM total event is enabled but may not always be counted.
>     T  MBM total event is enabled and being counted.
> 
> On AMD systems resource groups will have "t" associated with monitor
> groups when ABMC disabled, "T" when ABMC enabled and a counter assigned.
> On Intel systems monitor groups will always have "T".

I think more flags will add more confusion.

> 
> For "soft-RMID" the flag could possible continue to be "T"?
> 
> I am trying to find ways to communicate to user space consistently
> and clearly and any insights will be appreciated. We really do not want
> to add this interface and then find that it just causes confusion.
> 
> It is not quite obvious to me when the new files should be visible and
> what they should present to the user. "mbm_mode" is now always visible.
> Should "num_mbm_cntrs" not also always be visible? Right now "num_mbm_cntrs"
> appears to be only associated to ABMC, should it not also, for example,
> be the file that "soft-RMID" may use to share how many counters are
> available? Its contents will thus be dynamic based on which "MBM mode" is
> active, begging the question, what should it contain when "legacy" mode is
> enabled, should "num_mbm_cntrs" perhaps show "0" to user space when
> "legacy" mode is active?

Its good we have this discussion.

How about we go with simple way for now. The mbm_mode will only available
when ABMC or Soft_RMID(MPAM feature) is supported. Same way for the
num_mbm_cntrs.


> 
>>
>>     Examples:
>>
>>     # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>     non_default_ctrl_mon_grp//0=tl;1=tl;
>>     non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
>>     //0=tl;1=tl;
>>     /child_default_mon_grp/0=tl;1=tl;
>>     
>>     There are four groups and all the groups have local and total
>>     event enabled on domain 0 and 1.
> 
> "local and total event" is vague, can it be made specific with, for example,
> "local and total MBM events"

Sure.

> 
>>
>>     =tl means both total and local events are enabled.
> 
> Same here (and all copied places in this series)

Sure.

> 
>>
>>     "//" - This is a default CTRL_MON group
>>
>>     "non_default_ctrl_mon_grp//" - This is non-default CTRL_MON group
>>
>>     "/child_default_mon_grp/"  - This is Child MON group of the defult
>> group
> 
> Same typos as in previous version of cover letter.

Oh. no. Will fix it.

> 
>>
>>     "non_default_ctrl_mon_grp/child_non_default_mon_grp/" - This is child
>>     MON group of the non-default group
>>
>> e. Update the group assignment states using the interface file
>> /sys/fs/resctrl/info/L3_MON/mbm_control.
>>
>>     The write format is similar to the above list format with addition of
>>     op-code for the assignment operation.
>>     
>>     * Default CTRL_MON group:
>>             "//<domain_id><op-code><flags>"
>>     
>>     * Non-default CTRL_MON group:
>>             "<CTRL_MON group>//<domain_id><op-code><flags>"
>>     
>>     * Child MON group of default CTRL_MON group:
>>             "/<MON group>/<domain_id><op-code><flags>"
>>     
>>     * Child MON group of non-default CTRL_MON group:
>>             "<CTRL_MON group>/<MON group>/<domain_id><op-code><flags>"
>>     
>>     Op-code can be one of the following:
>>     
>>     = Update the assignment to match the flag.
>>     + Assign a new state.
>>     - Unassign a new state.
> 
> Please be consistent with terminology. Above switches between "flag"
> and "state" while it then continues below using "event". Also,
> "Unassign a _new_ state" is unexpected, it should probably be an
> _existing_ (not "new") state/flag/event?

I will use event consistantly.

> 
>>
>>     Flags can be one of the following:
>>
>>          t  MBM total event.
>>          l  MBM local event.
>>          tl Both total and local MBM events.
>>          _  None of the MBM events. Only works with '=' op-code.
>>     
>>     Initial group status:
>>     # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>     non_default_ctrl_mon_grp//0=tl;1=tl;
>>     non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
>>     //0=tl;1=tl;
>>     /child_default_mon_grp/0=tl;1=tl;
>>
>>     To update the default group to enable only total event on domain 0:
>>     # echo "//0=t" > /sys/fs/resctrl/info/L3_MON/mbm_control
>>
>>     Assignment status after the update:
>>     # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>     non_default_ctrl_mon_grp//0=tl;1=tl;
>>     non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
>>     //0=t;1=tl;
>>     /child_default_mon_grp/0=tl;1=tl;
>>
>>     To update the MON group child_default_mon_grp to remove total event
>> on domain 1:
>>     # echo "/child_default_mon_grp/1-t" >
>> /sys/fs/resctrl/info/L3_MON/mbm_control
>>
>>     Assignment status after the update:
>>     $ cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>     non_default_ctrl_mon_grp//0=tl;1=tl;
>>     non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl;
>>     //0=t;1=tl;
>>     /child_default_mon_grp/0=tl;1=l;
>>
>>     To update the MON group
>> non_default_ctrl_mon_grp/child_non_default_mon_grp to
>>     remove both local and total events on domain 1:
>>     # echo "non_default_ctrl_mon_grp/child_non_default_mon_grp/1=_" >
>>            /sys/fs/resctrl/info/L3_MON/mbm_control
>>
>>     Assignment status after the update:
>>     non_default_ctrl_mon_grp//0=tl;1=tl;
>>     non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
>>     //0=t;1=tl;
>>     /child_default_mon_grp/0=tl;1=l;
>>
>>     To update the default group to add a local event domain 0.
>>     # echo "//0+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
>>
>>     Assignment status after the update:
>>     # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>     non_default_ctrl_mon_grp//0=tl;1=tl;
>>     non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=_;
>>     //0=tl;1=tl;
>>     /child_default_mon_grp/0=tl;1=l;
>>
>>
>> f. Read the event mbm_total_bytes and mbm_local_bytes of the default group.
>>     There is no change in reading the events with ABMC. If the event is
>> unassigned
>>     when reading, then the read will come back as "Unassigned".
>>     
>>     # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>     779247936
>>     # cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
>>     765207488
>>     
>> g. Users will have the option to go back to legacy mbm_mode if required.
>>     This can be done using the following command. Note that switching the
>>     mbm_mode will reset all the mbm counters of all resctrl groups.
> 
> mbm -> MBM (throughout)

Sure.

> 
>>
>>     # echo "legacy" > /sys/fs/resctrl/info/L3_MON/mbm_mode
>>     # cat /sys/fs/resctrl/info/L3_MON/mbm_mode
>>     abmc
>>     [legacy]
>>
>> h. Check the bandwidth configuration for the group. Note that bandwidth
>>     configuration has a domain scope. Total event defaults to 0x7F (to
>>     count all the events) and local event defaults to 0x15 (to count all
>>     the local numa events). The event bitmap decoding is available at
>>     https://www.kernel.org/doc/Documentation/x86/resctrl.rst
>>     in section "mbm_total_bytes_config", "mbm_local_bytes_config":
>>     
>>     #cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
>>     0=0x7f;1=0x7f
>>     
>>     #cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
>>     0=0x15;1=0x15
>>     
>> j. Change the bandwidth source for domain 0 for the total event to count
>> only reads.
>>     Note that this change effects total events on the domain 0.
>>     
>>     #echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
>>     #cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
>>     0=0x33;1=0x7F
>>     
>> k. Now read the total event again. The first read will come back with
>> "Unavailable"
>>     status. The subsequent read of mbm_total_bytes will display only the
>> read events.
>>     
>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>     Unavailable
>>     #cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
>>     314101
>>     
>> l. Unmount the resctrl
>>     
>>     #umount /sys/fs/resctrl/
>>
> 
> Reinette
> 

-- 
Thanks
Babu Moger

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Reinette Chatre 1 year, 6 months ago

Hi Babu,

On 7/17/24 10:19 AM, Moger, Babu wrote:
> Hi Reinette,
> 
> On 7/12/24 17:03, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 7/3/24 2:48 PM, Babu Moger wrote:
>>> # Linux Implementation
>>>
>>> Linux resctrl subsystem provides the interface to count maximum of two
>>> memory bandwidth events per group, from a combination of available total
>>> and local events. Keeping the current interface, users can enable a maximum
>>> of 2 ABMC counters per group. User will also have the option to enable only
>>> one counter to the group. If the system runs out of assignable ABMC
>>> counters, kernel will display an error. Users need to disable an already
>>> enabled counter to make space for new assignments.
>>
>> The implementation appears to be converging on an interface that can
>> be generic enough to be used by other features discussed along the way.
>> "Linux implementation" summary can thus add:
>>
>>      Create a generic interface aimed to support user space assignment
>>      of scarce counters used for monitoring. First usage of interface
>>      is by ABMC with option to expand usage to "soft-RMID" and MPAM
>>      counters in future.
> 
> Sure.
> 
>>
>>
>>> # Examples
>>>
>>> a. Check if ABMC support is available
>>>      #mount -t resctrl resctrl /sys/fs/resctrl/
>>>
>>>      #cat /sys/fs/resctrl/info/L3_MON/mbm_mode
>>>      [abmc]
>>>      legacy
>>>
>>>      Linux kernel detected ABMC feature and it is enabled.
>>
>> How about renaming "abmc" to "mbm_cntrs"? This will match the num_mbm_cntrs
>> info file and be the final step to make this generic so that another
>> architecture
>> can more easily support assignining hardware counters without needing to call
>> the feature AMD's "abmc".
> 
> I think we aleady settled this with "mbm_cntr_assignable".
> 
> For soft-RMID" it will be mbm_sw_assignable.

Maybe getting a bit long but how about "mbm_cntr_sw_assignable" to match
with the term "mbm_cntr" in accompanying "num_mbm_cntrs"?

>> Expanding on this it may be possible to add a new "sw_mbm_cntrs" feature that
>> will be the "soft-RMID" feature while also reflecting the "mbm_cntrs" name
>> so that when user space enables that feature its properties can be found in
>> "num_mbm_cntrs".
>>
>> The "abmc" kernel parameter remains but that does seem separate from this
>> resctrl fs feature since it is explicitly tied to X86_FEATURE_ABMC surely
>> making it architecture specific.
>>
>>>
>>> b. Check how many ABMC counters are available.
>>>
>>>      #cat /sys/fs/resctrl/info/L3_MON/num_cntrs
>>>      32
>>
>> This is now num_mbm_cntrs
> 
> Sure.
> 
>>
>>>
>>> c. Create few resctrl groups.
>>>
>>>      # mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
>>>      # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
>>>      # mkdir
>>> /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp
>>>
>>>
>>> d. This series adds a new interface file
>>> /sys/fs/resctrl/info/L3_MON/mbm_control
>>>      to list and modify the group's monitoring states. File provides
>>> single place
>>>      to list monitoring states of all the resctrl groups. It makes it
>>> easier for
>>>      user space to learn about the counters are used without needing to
>>> traverse
>>>      all the groups thus reducing the number of filesystem calls.
>>>
>>>      The list follows the following format:
>>>
>>>      "<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
>>>
>>>      Format for specific type of groups:
>>>
>>>      * Default CTRL_MON group:
>>>       "//<domain_id>=<flags>"
>>>
>>>          * Non-default CTRL_MON group:
>>>                  "<CTRL_MON group>//<domain_id>=<flags>"
>>>
>>>          * Child MON group of default CTRL_MON group:
>>>                  "/<MON group>/<domain_id>=<flags>"
>>>
>>>          * Child MON group of non-default CTRL_MON group:
>>>                  "<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
>>>
>>>          Flags can be one of the following:
>>>
>>>           t  MBM total event is enabled.
>>>           l  MBM local event is enabled.
>>>           tl Both total and local MBM events are enabled.
>>>           _  None of the MBM events are enabled
>>
>> The language needs to be changed here (and in the many copied places) to
>> be specific about what setting the flag accomplishes. For example, in
>> "legacy" mode user space can be expected to find all events enabled, no?
>> Needing a new feature to set a flag to accomplish something that is
>> possible in legacy mode can thus cause confusion.
> 
> Yes. It is possible to do it. But I feel unnessassary.
> 
>>
>> If I understand the implementation reading "mbm_control" will fail
>> if system is ABMC capable but it is disabled. Why can "mbm_control" not
>> always be displayed to user space? For example, what if "mbm_control" is
>> always available to user space and it can provide specific information to
>> user space. For example:
>>      t  MBM total event is enabled but may not always be counted.
>>      T  MBM total event is enabled and being counted.
>>
>> On AMD systems resource groups will have "t" associated with monitor
>> groups when ABMC disabled, "T" when ABMC enabled and a counter assigned.
>> On Intel systems monitor groups will always have "T".
> 
> I think more flags will add more confusion.
> 
>>
>> For "soft-RMID" the flag could possible continue to be "T"?
>>
>> I am trying to find ways to communicate to user space consistently
>> and clearly and any insights will be appreciated. We really do not want
>> to add this interface and then find that it just causes confusion.
>>
>> It is not quite obvious to me when the new files should be visible and
>> what they should present to the user. "mbm_mode" is now always visible.
>> Should "num_mbm_cntrs" not also always be visible? Right now "num_mbm_cntrs"
>> appears to be only associated to ABMC, should it not also, for example,
>> be the file that "soft-RMID" may use to share how many counters are
>> available? Its contents will thus be dynamic based on which "MBM mode" is
>> active, begging the question, what should it contain when "legacy" mode is
>> enabled, should "num_mbm_cntrs" perhaps show "0" to user space when
>> "legacy" mode is active?
> 
> Its good we have this discussion.
> 
> How about we go with simple way for now. The mbm_mode will only available
> when ABMC or Soft_RMID(MPAM feature) is supported. Same way for the
> num_mbm_cntrs.

If ABMC or Soft_RMID is supported then user can still enable "legacy" instead.
What will num_mbm_cntrs and mbm_control display when user enables
"legacy"?

Reinette

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Peter Newman 1 year, 6 months ago

Hi Reinette and Babu,

On Thu, Aug 1, 2024 at 2:50 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Babu,
>
> On 7/17/24 10:19 AM, Moger, Babu wrote:
> > Hi Reinette,
> >
> > On 7/12/24 17:03, Reinette Chatre wrote:
> >> Hi Babu,
> >>
> >> On 7/3/24 2:48 PM, Babu Moger wrote:
> >>> # Linux Implementation
> >>>
> >>> Linux resctrl subsystem provides the interface to count maximum of two
> >>> memory bandwidth events per group, from a combination of available total
> >>> and local events. Keeping the current interface, users can enable a maximum
> >>> of 2 ABMC counters per group. User will also have the option to enable only
> >>> one counter to the group. If the system runs out of assignable ABMC
> >>> counters, kernel will display an error. Users need to disable an already
> >>> enabled counter to make space for new assignments.
> >>
> >> The implementation appears to be converging on an interface that can
> >> be generic enough to be used by other features discussed along the way.
> >> "Linux implementation" summary can thus add:
> >>
> >>      Create a generic interface aimed to support user space assignment
> >>      of scarce counters used for monitoring. First usage of interface
> >>      is by ABMC with option to expand usage to "soft-RMID" and MPAM
> >>      counters in future.
> >
> > Sure.
> >
> >>
> >>
> >>> # Examples
> >>>
> >>> a. Check if ABMC support is available
> >>>      #mount -t resctrl resctrl /sys/fs/resctrl/
> >>>
> >>>      #cat /sys/fs/resctrl/info/L3_MON/mbm_mode
> >>>      [abmc]
> >>>      legacy
> >>>
> >>>      Linux kernel detected ABMC feature and it is enabled.
> >>
> >> How about renaming "abmc" to "mbm_cntrs"? This will match the num_mbm_cntrs
> >> info file and be the final step to make this generic so that another
> >> architecture
> >> can more easily support assignining hardware counters without needing to call
> >> the feature AMD's "abmc".
> >
> > I think we aleady settled this with "mbm_cntr_assignable".
> >
> > For soft-RMID" it will be mbm_sw_assignable.
>
> Maybe getting a bit long but how about "mbm_cntr_sw_assignable" to match
> with the term "mbm_cntr" in accompanying "num_mbm_cntrs"?

My users are pushing for a consistent interface regardless of whether
counter assignment is implemented in hardware or software, so I would
like to avoid exposing implementation differences in the interface
where possible.

The main semantic difference with SW assignments is that it is not
possible to assign counters to individual events. Because the
implementation is assigning RMIDs to groups, assignment results in all
events being counted.

I was considering introducing a boolean mbm_assign_events node to
indicate whether assigning individual events is supported. If true,
num_mbm_cntrs indicates the number of events which can be counted,
otherwise it indicates the number of groups to which counters can be
assigned and attempting to assign a single event is silently upgraded
to assigning counters to all events in the group.

However, If we don't expect to see these semantics in any other
implementation, these semantics could be implicit in the definition of
a SW assignable counter.

-Peter

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Reinette Chatre 1 year, 6 months ago

Hi Peter,

On 8/1/24 3:45 PM, Peter Newman wrote:
> On Thu, Aug 1, 2024 at 2:50 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>> On 7/17/24 10:19 AM, Moger, Babu wrote:
>>> On 7/12/24 17:03, Reinette Chatre wrote:
>>>> On 7/3/24 2:48 PM, Babu Moger wrote:

>>>>> # Examples
>>>>>
>>>>> a. Check if ABMC support is available
>>>>>       #mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>
>>>>>       #cat /sys/fs/resctrl/info/L3_MON/mbm_mode
>>>>>       [abmc]
>>>>>       legacy
>>>>>
>>>>>       Linux kernel detected ABMC feature and it is enabled.
>>>>
>>>> How about renaming "abmc" to "mbm_cntrs"? This will match the num_mbm_cntrs
>>>> info file and be the final step to make this generic so that another
>>>> architecture
>>>> can more easily support assignining hardware counters without needing to call
>>>> the feature AMD's "abmc".
>>>
>>> I think we aleady settled this with "mbm_cntr_assignable".
>>>
>>> For soft-RMID" it will be mbm_sw_assignable.
>>
>> Maybe getting a bit long but how about "mbm_cntr_sw_assignable" to match
>> with the term "mbm_cntr" in accompanying "num_mbm_cntrs"?
> 
> My users are pushing for a consistent interface regardless of whether
> counter assignment is implemented in hardware or software, so I would
> like to avoid exposing implementation differences in the interface
> where possible.

This seems a reasonable ask but can we be confident that if hardware
supports assignable counters then there will never be a reason to use
software assignable counters? (This needs to also consider how/if Arm
may use this feature.)

I am of course not familiar with details of the software implementation
- could there be benefits to using it even if hardware counters are
supported?

What I would like to avoid is future complexity of needing a new mount/config
option that user space needs to use to select if a single "mbm_cntr_assignable"
is backed by hardware or software.

> The main semantic difference with SW assignments is that it is not
> possible to assign counters to individual events. Because the
> implementation is assigning RMIDs to groups, assignment results in all
> events being counted.
> 
> I was considering introducing a boolean mbm_assign_events node to
> indicate whether assigning individual events is supported. If true,
> num_mbm_cntrs indicates the number of events which can be counted,
> otherwise it indicates the number of groups to which counters can be
> assigned and attempting to assign a single event is silently upgraded
> to assigning counters to all events in the group.

How were you envisioning your users using the control file ("mbm_control")
in these scenarios? Does this file's interface even work for SW assignment
scenarios?

Users should expect consistent interface for "mbm_control" also.

It sounds to me that a potential "mbm_assign_events" will be false for SW
assignments. That would mean that "num_mbm_cntrs" will
contain the number of groups to which counters can be assigned?
Would user space be required to always enable all flags (enable all events) of
all domains to the same values ... or would enabling of one flag (one event)
in one domain automatically result in all flags (all events) enabled for all
domains ... or would enabling of one flag (one event) in one domain only appear
to user space to be enabled while in reality all flags/events are actually enabled?

> However, If we don't expect to see these semantics in any other
> implementation, these semantics could be implicit in the definition of
> a SW assignable counter.

It is not clear to me how implementation differences between hardware
and software assignment can be hidden from user space. It is possible
to let user space enable individual events and then silently upgrade it
to all events. I see two options here, either "mbm_control" needs to
explicitly show this "silent upgrade" so that user space knows which
events are actually enabled, or "mbm_control" only shows flags/events enabled
from user space perspective. In the former scenario, this needs more
user space support since a generic user space cannot be confident which
flags are set after writing to "mbm_control". In the latter scenario,
meaning of "num_mbm_cntrs" becomes unclear since user space is expected
to rely on it to know which events can be enabled and if some are
actually "silently enabled" when user space still thinks it needs to be
enabled the number of available counters becomes vague.

It is not clear to me how to present hardware and software assignable
counters with a single consistent interface. Actually, what if the
"mbm_mode" is what distinguishes how counters are assigned instead of how
it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
"mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
and "mbm_cntr_group_assignable" is used? Could that replace a
potential "mbm_assign_events" while also supporting user space in
interactions with "mbm_control"?

Reinette

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Peter Newman 1 year, 6 months ago

Hi Reinette,

On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Peter,
>
> On 8/1/24 3:45 PM, Peter Newman wrote:
> > On Thu, Aug 1, 2024 at 2:50 PM Reinette Chatre
> > <reinette.chatre@intel.com> wrote:
> >> On 7/17/24 10:19 AM, Moger, Babu wrote:
> >>> On 7/12/24 17:03, Reinette Chatre wrote:
> >>>> On 7/3/24 2:48 PM, Babu Moger wrote:
>
> >>>>> # Examples
> >>>>>
> >>>>> a. Check if ABMC support is available
> >>>>>       #mount -t resctrl resctrl /sys/fs/resctrl/
> >>>>>
> >>>>>       #cat /sys/fs/resctrl/info/L3_MON/mbm_mode
> >>>>>       [abmc]
> >>>>>       legacy
> >>>>>
> >>>>>       Linux kernel detected ABMC feature and it is enabled.
> >>>>
> >>>> How about renaming "abmc" to "mbm_cntrs"? This will match the num_mbm_cntrs
> >>>> info file and be the final step to make this generic so that another
> >>>> architecture
> >>>> can more easily support assignining hardware counters without needing to call
> >>>> the feature AMD's "abmc".
> >>>
> >>> I think we aleady settled this with "mbm_cntr_assignable".
> >>>
> >>> For soft-RMID" it will be mbm_sw_assignable.
> >>
> >> Maybe getting a bit long but how about "mbm_cntr_sw_assignable" to match
> >> with the term "mbm_cntr" in accompanying "num_mbm_cntrs"?
> >
> > My users are pushing for a consistent interface regardless of whether
> > counter assignment is implemented in hardware or software, so I would
> > like to avoid exposing implementation differences in the interface
> > where possible.
>
> This seems a reasonable ask but can we be confident that if hardware
> supports assignable counters then there will never be a reason to use
> software assignable counters? (This needs to also consider how/if Arm
> may use this feature.)
>
> I am of course not familiar with details of the software implementation
> - could there be benefits to using it even if hardware counters are
> supported?

I can't see any situation where the user would want to choose software
over hardware counters. The number of groups which can be monitored by
software assignable counters will always be less than with hardware,
due to the need for consuming one RMID (and the counters automatically
allocated to it by the AMD hardware) for all unassigned groups.

I consider software assignable a workaround to enable measuring
bandwidth reliably on a large number of groups on pre-ABMC AMD
hardware, or rather salvaging MBM on pre-ABMC hardware making use of
our users' effort to adapt to counter assignment in resctrl. We hope
no future implementations will choose to silently drop bandwidth
counts, so fingers crossed, the software implementation can be phased
out when these generations of AMD hardware are decommissioned.

The MPAM specification natively supports (or requires) counter
assignment in hardware. From what I recall in the last of James'
prototypes I looked at, MBM was only supported if the implementation
provided as many bandwidth counters as there were possible monitoring
groups, so that it could assume a monitor IDs for every PARTID:PMG
combination.

>
> What I would like to avoid is future complexity of needing a new mount/config
> option that user space needs to use to select if a single "mbm_cntr_assignable"
> is backed by hardware or software.

In my testing so far, automatically enabling counter assignment and
automatically allocating counters for all events in new groups works
well enough.

The only configuration I need is the ability to disable the automatic
counter allocation so that a userspace agent can have control of where
all the counters are assigned at all times. It's easy to implement
this as a simple flag if the user accepts that they need to manually
deallocate any automatically-allocated counters from groups created
before the flag was cleared.

>
> > The main semantic difference with SW assignments is that it is not
> > possible to assign counters to individual events. Because the
> > implementation is assigning RMIDs to groups, assignment results in all
> > events being counted.
> >
> > I was considering introducing a boolean mbm_assign_events node to
> > indicate whether assigning individual events is supported. If true,
> > num_mbm_cntrs indicates the number of events which can be counted,
> > otherwise it indicates the number of groups to which counters can be
> > assigned and attempting to assign a single event is silently upgraded
> > to assigning counters to all events in the group.
>
> How were you envisioning your users using the control file ("mbm_control")
> in these scenarios? Does this file's interface even work for SW assignment
> scenarios?
>
> Users should expect consistent interface for "mbm_control" also.
>
> It sounds to me that a potential "mbm_assign_events" will be false for SW
> assignments. That would mean that "num_mbm_cntrs" will
> contain the number of groups to which counters can be assigned?
> Would user space be required to always enable all flags (enable all events) of
> all domains to the same values ... or would enabling of one flag (one event)
> in one domain automatically result in all flags (all events) enabled for all
> domains ... or would enabling of one flag (one event) in one domain only appear
> to user space to be enabled while in reality all flags/events are actually enabled?

I believe mbm_control should always accurately reflect which events
are being counted.

The behavior as I've implemented today is:

# cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events
0

# cat /sys/fs/resctrl/info/L3_MON/mbm_control
test//0=_;1=_;
//0=_;1=_;

# echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
# cat /sys/fs/resctrl/info/L3_MON/mbm_control
test//0=_;1=tl;
//0=_;1=_;

# echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
# cat /sys/fs/resctrl/info/L3_MON/mbm_control
test//0=_;1=_;
//0=_;1=_;

>
> > However, If we don't expect to see these semantics in any other
> > implementation, these semantics could be implicit in the definition of
> > a SW assignable counter.
>
> It is not clear to me how implementation differences between hardware
> and software assignment can be hidden from user space. It is possible
> to let user space enable individual events and then silently upgrade it
> to all events. I see two options here, either "mbm_control" needs to
> explicitly show this "silent upgrade" so that user space knows which
> events are actually enabled, or "mbm_control" only shows flags/events enabled
> from user space perspective. In the former scenario, this needs more
> user space support since a generic user space cannot be confident which
> flags are set after writing to "mbm_control". In the latter scenario,
> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
> to rely on it to know which events can be enabled and if some are
> actually "silently enabled" when user space still thinks it needs to be
> enabled the number of available counters becomes vague.
>
> It is not clear to me how to present hardware and software assignable
> counters with a single consistent interface. Actually, what if the
> "mbm_mode" is what distinguishes how counters are assigned instead of how
> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
> and "mbm_cntr_group_assignable" is used? Could that replace a
> potential "mbm_assign_events" while also supporting user space in
> interactions with "mbm_control"?

If I understand this correctly, is this a preference that the info
node be named differently if its value will have different units,
rather than a second node to indicate what the value of num_mbm_cntrs
actually means? This sounds reasonable to me.

I think it's also important to note that in MPAM, the MBWU (memory
bandwidth usage) monitors don't have a concept of local versus total
bandwidth, so event assignment would likely not apply there either.
What the counted bandwidth actually represents is more implicit in the
monitor's position in the memory system in the particular
implementation. On a theoretical multi-socket system, resctrl would
require knowledge about the system's architecture to stitch together
the counts from different types of monitors to produce a local and
total value. I don't know if we'd program this SoC-specific knowledge
into the kernel to produce a unified MBM resource like we're
accustomed to now or if we'd present multiple MBM resources, each only
providing an mbm_total_bytes event. In this case, the counters would
have to be assigned separately in each MBM resource, especially if the
different MBM resources support a different number of counters.

Thanks,
-Peter

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Reinette Chatre 1 year, 6 months ago

Hi Peter,

On 8/2/24 11:49 AM, Peter Newman wrote:
> On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>> On 8/1/24 3:45 PM, Peter Newman wrote:
>>> On Thu, Aug 1, 2024 at 2:50 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>> On 7/17/24 10:19 AM, Moger, Babu wrote:
>>>>> On 7/12/24 17:03, Reinette Chatre wrote:
>>>>>> On 7/3/24 2:48 PM, Babu Moger wrote:
>>
>>>>>>> # Examples
>>>>>>>
>>>>>>> a. Check if ABMC support is available
>>>>>>>        #mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>>>
>>>>>>>        #cat /sys/fs/resctrl/info/L3_MON/mbm_mode
>>>>>>>        [abmc]
>>>>>>>        legacy
>>>>>>>
>>>>>>>        Linux kernel detected ABMC feature and it is enabled.
>>>>>>
>>>>>> How about renaming "abmc" to "mbm_cntrs"? This will match the num_mbm_cntrs
>>>>>> info file and be the final step to make this generic so that another
>>>>>> architecture
>>>>>> can more easily support assignining hardware counters without needing to call
>>>>>> the feature AMD's "abmc".
>>>>>
>>>>> I think we aleady settled this with "mbm_cntr_assignable".
>>>>>
>>>>> For soft-RMID" it will be mbm_sw_assignable.
>>>>
>>>> Maybe getting a bit long but how about "mbm_cntr_sw_assignable" to match
>>>> with the term "mbm_cntr" in accompanying "num_mbm_cntrs"?
>>>
>>> My users are pushing for a consistent interface regardless of whether
>>> counter assignment is implemented in hardware or software, so I would
>>> like to avoid exposing implementation differences in the interface
>>> where possible.
>>
>> This seems a reasonable ask but can we be confident that if hardware
>> supports assignable counters then there will never be a reason to use
>> software assignable counters? (This needs to also consider how/if Arm
>> may use this feature.)
>>
>> I am of course not familiar with details of the software implementation
>> - could there be benefits to using it even if hardware counters are
>> supported?
> 
> I can't see any situation where the user would want to choose software
> over hardware counters. The number of groups which can be monitored by
> software assignable counters will always be less than with hardware,
> due to the need for consuming one RMID (and the counters automatically
> allocated to it by the AMD hardware) for all unassigned groups.

Thank you for clarifying. This seems specific to this software implementation,
and I missed that there was a shift from soft-RMIDs to soft-ABMC. If I remember
correctly this depends on undocumented hardware specific knowledge.
  
> I consider software assignable a workaround to enable measuring
> bandwidth reliably on a large number of groups on pre-ABMC AMD
> hardware, or rather salvaging MBM on pre-ABMC hardware making use of
> our users' effort to adapt to counter assignment in resctrl. We hope
> no future implementations will choose to silently drop bandwidth
> counts, so fingers crossed, the software implementation can be phased
> out when these generations of AMD hardware are decommissioned.

That sounds ideal.

> 
> The MPAM specification natively supports (or requires) counter
> assignment in hardware. From what I recall in the last of James'
> prototypes I looked at, MBM was only supported if the implementation
> provided as many bandwidth counters as there were possible monitoring
> groups, so that it could assume a monitor IDs for every PARTID:PMG
> combination.

Thank you for this insight.

> 
>>
>> What I would like to avoid is future complexity of needing a new mount/config
>> option that user space needs to use to select if a single "mbm_cntr_assignable"
>> is backed by hardware or software.
> 
> In my testing so far, automatically enabling counter assignment and
> automatically allocating counters for all events in new groups works
> well enough.
> 
> The only configuration I need is the ability to disable the automatic
> counter allocation so that a userspace agent can have control of where
> all the counters are assigned at all times. It's easy to implement
> this as a simple flag if the user accepts that they need to manually
> deallocate any automatically-allocated counters from groups created
> before the flag was cleared.
> 
>>
>>> The main semantic difference with SW assignments is that it is not
>>> possible to assign counters to individual events. Because the
>>> implementation is assigning RMIDs to groups, assignment results in all
>>> events being counted.
>>>
>>> I was considering introducing a boolean mbm_assign_events node to
>>> indicate whether assigning individual events is supported. If true,
>>> num_mbm_cntrs indicates the number of events which can be counted,
>>> otherwise it indicates the number of groups to which counters can be
>>> assigned and attempting to assign a single event is silently upgraded
>>> to assigning counters to all events in the group.
>>
>> How were you envisioning your users using the control file ("mbm_control")
>> in these scenarios? Does this file's interface even work for SW assignment
>> scenarios?
>>
>> Users should expect consistent interface for "mbm_control" also.
>>
>> It sounds to me that a potential "mbm_assign_events" will be false for SW
>> assignments. That would mean that "num_mbm_cntrs" will
>> contain the number of groups to which counters can be assigned?
>> Would user space be required to always enable all flags (enable all events) of
>> all domains to the same values ... or would enabling of one flag (one event)
>> in one domain automatically result in all flags (all events) enabled for all
>> domains ... or would enabling of one flag (one event) in one domain only appear
>> to user space to be enabled while in reality all flags/events are actually enabled?
> 
> I believe mbm_control should always accurately reflect which events
> are being counted.

I agree.

> 
> The behavior as I've implemented today is:
> 
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events
> 0
> 
> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> test//0=_;1=_;
> //0=_;1=_;
> 
> # echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> test//0=_;1=tl;
> //0=_;1=_;
> 
> # echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> test//0=_;1=_;
> //0=_;1=_;
> 
> 

This highlights how there cannot be a generic/consistent interface between hardware
and software implementation. If resctrl implements something like above without any
other hints to user space then it will push complexity to user space since user space
would not know if setting one flag results in setting more than that flag, which may
force a user space implementation to always follow a write with a read that
needs to confirm what actually resulted from the write. Similarly, that removing a
flag impacts other flags needs to be clear without user space needing to "try and
see what happens".

It is not clear to me how to interpret the above example when it comes to the
RMID management though. If the RMID assignment is per group then I expected all
the domains of a group to have the same flag(s)?

>>
>>> However, If we don't expect to see these semantics in any other
>>> implementation, these semantics could be implicit in the definition of
>>> a SW assignable counter.
>>
>> It is not clear to me how implementation differences between hardware
>> and software assignment can be hidden from user space. It is possible
>> to let user space enable individual events and then silently upgrade it
>> to all events. I see two options here, either "mbm_control" needs to
>> explicitly show this "silent upgrade" so that user space knows which
>> events are actually enabled, or "mbm_control" only shows flags/events enabled
>> from user space perspective. In the former scenario, this needs more
>> user space support since a generic user space cannot be confident which
>> flags are set after writing to "mbm_control". In the latter scenario,
>> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
>> to rely on it to know which events can be enabled and if some are
>> actually "silently enabled" when user space still thinks it needs to be
>> enabled the number of available counters becomes vague.
>>
>> It is not clear to me how to present hardware and software assignable
>> counters with a single consistent interface. Actually, what if the
>> "mbm_mode" is what distinguishes how counters are assigned instead of how
>> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
>> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
>> and "mbm_cntr_group_assignable" is used? Could that replace a
>> potential "mbm_assign_events" while also supporting user space in
>> interactions with "mbm_control"?
> 
> If I understand this correctly, is this a preference that the info
> node be named differently if its value will have different units,
> rather than a second node to indicate what the value of num_mbm_cntrs
> actually means? This sounds reasonable to me.

Indeed. As you highlighted, user space may not need to know if
counters are backed by hardware or software, but user space needs to
know what to expect from (how to interact with) interface.

> I think it's also important to note that in MPAM, the MBWU (memory
> bandwidth usage) monitors don't have a concept of local versus total
> bandwidth, so event assignment would likely not apply there either.
> What the counted bandwidth actually represents is more implicit in the
> monitor's position in the memory system in the particular
> implementation. On a theoretical multi-socket system, resctrl would
> require knowledge about the system's architecture to stitch together
> the counts from different types of monitors to produce a local and
> total value. I don't know if we'd program this SoC-specific knowledge
> into the kernel to produce a unified MBM resource like we're
> accustomed to now or if we'd present multiple MBM resources, each only
> providing an mbm_total_bytes event. In this case, the counters would
> have to be assigned separately in each MBM resource, especially if the
> different MBM resources support a different number of counters.
> 

"total" and "local" bandwidth is already in grey area after the
introduction of mbm_total_bytes_config/mbm_local_bytes_config where
user space could set values reported to not be constrained by the
"total" and "local" terms. We keep sticking with it though, even in
this implementation that uses the "t" and "l" flags, knowing that
what is actually monitored when "l" is set is just what the user
configured via mbm_local_bytes_config, which theoretically
can be "total" bandwidth.

Reinette

ps. I will be offline next week.

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Moger, Babu 1 year, 6 months ago

Hi Peter/Reinette,

On 8/2/2024 3:55 PM, Reinette Chatre wrote:
> Hi Peter,
> 
> On 8/2/24 11:49 AM, Peter Newman wrote:
>> On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre
>> <reinette.chatre@intel.com> wrote:
>>> On 8/1/24 3:45 PM, Peter Newman wrote:
>>>> On Thu, Aug 1, 2024 at 2:50 PM Reinette Chatre
>>>> <reinette.chatre@intel.com> wrote:
>>>>> On 7/17/24 10:19 AM, Moger, Babu wrote:
>>>>>> On 7/12/24 17:03, Reinette Chatre wrote:
>>>>>>> On 7/3/24 2:48 PM, Babu Moger wrote:
>>>
>>>>>>>> # Examples
>>>>>>>>
>>>>>>>> a. Check if ABMC support is available
>>>>>>>>        #mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>>>>
>>>>>>>>        #cat /sys/fs/resctrl/info/L3_MON/mbm_mode
>>>>>>>>        [abmc]
>>>>>>>>        legacy
>>>>>>>>
>>>>>>>>        Linux kernel detected ABMC feature and it is enabled.
>>>>>>>
>>>>>>> How about renaming "abmc" to "mbm_cntrs"? This will match the 
>>>>>>> num_mbm_cntrs
>>>>>>> info file and be the final step to make this generic so that another
>>>>>>> architecture
>>>>>>> can more easily support assignining hardware counters without 
>>>>>>> needing to call
>>>>>>> the feature AMD's "abmc".
>>>>>>
>>>>>> I think we aleady settled this with "mbm_cntr_assignable".
>>>>>>
>>>>>> For soft-RMID" it will be mbm_sw_assignable.
>>>>>
>>>>> Maybe getting a bit long but how about "mbm_cntr_sw_assignable" to 
>>>>> match
>>>>> with the term "mbm_cntr" in accompanying "num_mbm_cntrs"?
>>>>
>>>> My users are pushing for a consistent interface regardless of whether
>>>> counter assignment is implemented in hardware or software, so I would
>>>> like to avoid exposing implementation differences in the interface
>>>> where possible.
>>>
>>> This seems a reasonable ask but can we be confident that if hardware
>>> supports assignable counters then there will never be a reason to use
>>> software assignable counters? (This needs to also consider how/if Arm
>>> may use this feature.)
>>>
>>> I am of course not familiar with details of the software implementation
>>> - could there be benefits to using it even if hardware counters are
>>> supported?
>>
>> I can't see any situation where the user would want to choose software
>> over hardware counters. The number of groups which can be monitored by
>> software assignable counters will always be less than with hardware,
>> due to the need for consuming one RMID (and the counters automatically
>> allocated to it by the AMD hardware) for all unassigned groups.
> 
> Thank you for clarifying. This seems specific to this software 
> implementation,
> and I missed that there was a shift from soft-RMIDs to soft-ABMC. If I 
> remember
> correctly this depends on undocumented hardware specific knowledge.
> 
>> I consider software assignable a workaround to enable measuring
>> bandwidth reliably on a large number of groups on pre-ABMC AMD
>> hardware, or rather salvaging MBM on pre-ABMC hardware making use of
>> our users' effort to adapt to counter assignment in resctrl. We hope
>> no future implementations will choose to silently drop bandwidth
>> counts, so fingers crossed, the software implementation can be phased
>> out when these generations of AMD hardware are decommissioned.
> 
> That sounds ideal.
> 
>>
>> The MPAM specification natively supports (or requires) counter
>> assignment in hardware. From what I recall in the last of James'
>> prototypes I looked at, MBM was only supported if the implementation
>> provided as many bandwidth counters as there were possible monitoring
>> groups, so that it could assume a monitor IDs for every PARTID:PMG
>> combination.
> 
> Thank you for this insight.
> 
>>
>>>
>>> What I would like to avoid is future complexity of needing a new 
>>> mount/config
>>> option that user space needs to use to select if a single 
>>> "mbm_cntr_assignable"
>>> is backed by hardware or software.
>>
>> In my testing so far, automatically enabling counter assignment and
>> automatically allocating counters for all events in new groups works
>> well enough.
>>
>> The only configuration I need is the ability to disable the automatic
>> counter allocation so that a userspace agent can have control of where
>> all the counters are assigned at all times. It's easy to implement
>> this as a simple flag if the user accepts that they need to manually
>> deallocate any automatically-allocated counters from groups created
>> before the flag was cleared.
>>
>>>
>>>> The main semantic difference with SW assignments is that it is not
>>>> possible to assign counters to individual events. Because the
>>>> implementation is assigning RMIDs to groups, assignment results in all
>>>> events being counted.
>>>>
>>>> I was considering introducing a boolean mbm_assign_events node to
>>>> indicate whether assigning individual events is supported. If true,
>>>> num_mbm_cntrs indicates the number of events which can be counted,
>>>> otherwise it indicates the number of groups to which counters can be
>>>> assigned and attempting to assign a single event is silently upgraded
>>>> to assigning counters to all events in the group.
>>>
>>> How were you envisioning your users using the control file 
>>> ("mbm_control")
>>> in these scenarios? Does this file's interface even work for SW 
>>> assignment
>>> scenarios?
>>>
>>> Users should expect consistent interface for "mbm_control" also.
>>>
>>> It sounds to me that a potential "mbm_assign_events" will be false 
>>> for SW
>>> assignments. That would mean that "num_mbm_cntrs" will
>>> contain the number of groups to which counters can be assigned?
>>> Would user space be required to always enable all flags (enable all 
>>> events) of
>>> all domains to the same values ... or would enabling of one flag (one 
>>> event)
>>> in one domain automatically result in all flags (all events) enabled 
>>> for all
>>> domains ... or would enabling of one flag (one event) in one domain 
>>> only appear
>>> to user space to be enabled while in reality all flags/events are 
>>> actually enabled?
>>
>> I believe mbm_control should always accurately reflect which events
>> are being counted.
> 
> I agree.
> 
>>
>> The behavior as I've implemented today is:
>>
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events
>> 0
>>
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>> test//0=_;1=_;
>> //0=_;1=_;
>>
>> # echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>> test//0=_;1=tl;
>> //0=_;1=_;
>>
>> # echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>> test//0=_;1=_;
>> //0=_;1=_;
>>
>>
> 
> This highlights how there cannot be a generic/consistent interface 
> between hardware
> and software implementation. If resctrl implements something like above 
> without any
> other hints to user space then it will push complexity to user space 
> since user space
> would not know if setting one flag results in setting more than that 
> flag, which may
> force a user space implementation to always follow a write with a read that
> needs to confirm what actually resulted from the write. Similarly, that 
> removing a
> flag impacts other flags needs to be clear without user space needing to 
> "try and
> see what happens".
> 
> It is not clear to me how to interpret the above example when it comes 
> to the
> RMID management though. If the RMID assignment is per group then I 
> expected all
> the domains of a group to have the same flag(s)?
> 
>>>
>>>> However, If we don't expect to see these semantics in any other
>>>> implementation, these semantics could be implicit in the definition of
>>>> a SW assignable counter.
>>>
>>> It is not clear to me how implementation differences between hardware
>>> and software assignment can be hidden from user space. It is possible
>>> to let user space enable individual events and then silently upgrade it
>>> to all events. I see two options here, either "mbm_control" needs to
>>> explicitly show this "silent upgrade" so that user space knows which
>>> events are actually enabled, or "mbm_control" only shows flags/events 
>>> enabled
>>> from user space perspective. In the former scenario, this needs more
>>> user space support since a generic user space cannot be confident which
>>> flags are set after writing to "mbm_control". In the latter scenario,
>>> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
>>> to rely on it to know which events can be enabled and if some are
>>> actually "silently enabled" when user space still thinks it needs to be
>>> enabled the number of available counters becomes vague.
>>>
>>> It is not clear to me how to present hardware and software assignable
>>> counters with a single consistent interface. Actually, what if the
>>> "mbm_mode" is what distinguishes how counters are assigned instead of 
>>> how
>>> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
>>> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
>>> and "mbm_cntr_group_assignable" is used? Could that replace a
>>> potential "mbm_assign_events" while also supporting user space in
>>> interactions with "mbm_control"?
>>
>> If I understand this correctly, is this a preference that the info
>> node be named differently if its value will have different units,
>> rather than a second node to indicate what the value of num_mbm_cntrs
>> actually means? This sounds reasonable to me.
> 
> Indeed. As you highlighted, user space may not need to know if
> counters are backed by hardware or software, but user space needs to
> know what to expect from (how to interact with) interface.
> 
>> I think it's also important to note that in MPAM, the MBWU (memory
>> bandwidth usage) monitors don't have a concept of local versus total
>> bandwidth, so event assignment would likely not apply there either.
>> What the counted bandwidth actually represents is more implicit in the
>> monitor's position in the memory system in the particular
>> implementation. On a theoretical multi-socket system, resctrl would
>> require knowledge about the system's architecture to stitch together
>> the counts from different types of monitors to produce a local and
>> total value. I don't know if we'd program this SoC-specific knowledge
>> into the kernel to produce a unified MBM resource like we're
>> accustomed to now or if we'd present multiple MBM resources, each only
>> providing an mbm_total_bytes event. In this case, the counters would
>> have to be assigned separately in each MBM resource, especially if the
>> different MBM resources support a different number of counters.
>>
> 
> "total" and "local" bandwidth is already in grey area after the
> introduction of mbm_total_bytes_config/mbm_local_bytes_config where
> user space could set values reported to not be constrained by the
> "total" and "local" terms. We keep sticking with it though, even in
> this implementation that uses the "t" and "l" flags, knowing that
> what is actually monitored when "l" is set is just what the user
> configured via mbm_local_bytes_config, which theoretically
> can be "total" bandwidth.
> 
> Reinette
> 
> ps. I will be offline next week.

Thanks for heads up.

Looks like we still need to figure out few things about the interface.

However, I need resolve few issues with v5. I can go ahead and post v6 
next week. We can continue our discussion. That way we are making some 
forward progress in the series. Let me know  what do you think.

Thanks
- Babu Moger

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Peter Newman 1 year, 6 months ago

Hi Reinette,

On Fri, Aug 2, 2024 at 1:55 PM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Peter,
>
> On 8/2/24 11:49 AM, Peter Newman wrote:
> > On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre
> >> I am of course not familiar with details of the software implementation
> >> - could there be benefits to using it even if hardware counters are
> >> supported?
> >
> > I can't see any situation where the user would want to choose software
> > over hardware counters. The number of groups which can be monitored by
> > software assignable counters will always be less than with hardware,
> > due to the need for consuming one RMID (and the counters automatically
> > allocated to it by the AMD hardware) for all unassigned groups.
>
> Thank you for clarifying. This seems specific to this software implementation,
> and I missed that there was a shift from soft-RMIDs to soft-ABMC. If I remember
> correctly this depends on undocumented hardware specific knowledge.

For the benefit of anyone else who needs to monitor bandwidth on a
large number of monitoring groups on pre-ABMC AMD implementations,
hopefully a future AMD publication will clarify, at least on some
existing, pre-ABMC models, exactly when the QM_CTR.U bit is set.


> >
> > The behavior as I've implemented today is:
> >
> > # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events
> > 0
> >
> > # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> > test//0=_;1=_;
> > //0=_;1=_;
> >
> > # echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
> > # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> > test//0=_;1=tl;
> > //0=_;1=_;
> >
> > # echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
> > # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> > test//0=_;1=_;
> > //0=_;1=_;
> >
> >
>
> This highlights how there cannot be a generic/consistent interface between hardware
> and software implementation. If resctrl implements something like above without any
> other hints to user space then it will push complexity to user space since user space
> would not know if setting one flag results in setting more than that flag, which may
> force a user space implementation to always follow a write with a read that
> needs to confirm what actually resulted from the write. Similarly, that removing a
> flag impacts other flags needs to be clear without user space needing to "try and
> see what happens".

I'll return to this topic in the context of MPAM below...

> It is not clear to me how to interpret the above example when it comes to the
> RMID management though. If the RMID assignment is per group then I expected all
> the domains of a group to have the same flag(s)?

The group RMIDs are never programmed into any MSRs and the RMID space
is independent in each domain, so it is still possible to do
per-domain assignment. (and like with soft RMIDs, this enables us to
create unlimited groups, but we've never been limited by the size of
the RMID space)

However, in our use cases, jobs are not confined to any domain, so
bandwidth measurements must be done simultaneously in all domains, so
we have no current use for per-domain assignment. But if any Google
users did begin to see value in confining jobs to domains, this could
change.

>
> >>
> >>> However, If we don't expect to see these semantics in any other
> >>> implementation, these semantics could be implicit in the definition of
> >>> a SW assignable counter.
> >>
> >> It is not clear to me how implementation differences between hardware
> >> and software assignment can be hidden from user space. It is possible
> >> to let user space enable individual events and then silently upgrade it
> >> to all events. I see two options here, either "mbm_control" needs to
> >> explicitly show this "silent upgrade" so that user space knows which
> >> events are actually enabled, or "mbm_control" only shows flags/events enabled
> >> from user space perspective. In the former scenario, this needs more
> >> user space support since a generic user space cannot be confident which
> >> flags are set after writing to "mbm_control". In the latter scenario,
> >> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
> >> to rely on it to know which events can be enabled and if some are
> >> actually "silently enabled" when user space still thinks it needs to be
> >> enabled the number of available counters becomes vague.
> >>
> >> It is not clear to me how to present hardware and software assignable
> >> counters with a single consistent interface. Actually, what if the
> >> "mbm_mode" is what distinguishes how counters are assigned instead of how
> >> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
> >> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
> >> and "mbm_cntr_group_assignable" is used? Could that replace a
> >> potential "mbm_assign_events" while also supporting user space in
> >> interactions with "mbm_control"?
> >
> > If I understand this correctly, is this a preference that the info
> > node be named differently if its value will have different units,
> > rather than a second node to indicate what the value of num_mbm_cntrs
> > actually means? This sounds reasonable to me.
>
> Indeed. As you highlighted, user space may not need to know if
> counters are backed by hardware or software, but user space needs to
> know what to expect from (how to interact with) interface.
>
> > I think it's also important to note that in MPAM, the MBWU (memory
> > bandwidth usage) monitors don't have a concept of local versus total
> > bandwidth, so event assignment would likely not apply there either.
> > What the counted bandwidth actually represents is more implicit in the
> > monitor's position in the memory system in the particular
> > implementation. On a theoretical multi-socket system, resctrl would
> > require knowledge about the system's architecture to stitch together
> > the counts from different types of monitors to produce a local and
> > total value. I don't know if we'd program this SoC-specific knowledge
> > into the kernel to produce a unified MBM resource like we're
> > accustomed to now or if we'd present multiple MBM resources, each only
> > providing an mbm_total_bytes event. In this case, the counters would
> > have to be assigned separately in each MBM resource, especially if the
> > different MBM resources support a different number of counters.
> >
>
> "total" and "local" bandwidth is already in grey area after the
> introduction of mbm_total_bytes_config/mbm_local_bytes_config where
> user space could set values reported to not be constrained by the
> "total" and "local" terms. We keep sticking with it though, even in
> this implementation that uses the "t" and "l" flags, knowing that
> what is actually monitored when "l" is set is just what the user
> configured via mbm_local_bytes_config, which theoretically
> can be "total" bandwidth.

If it makes sense to support a separate, group-assignment interface at
least for MPAM, this would be a better fit for soft-ABMC, even if it
does have to stay downstream.

Thanks,
-Peter

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Reinette Chatre 1 year, 5 months ago

Hi Peter,

On 8/2/24 3:50 PM, Peter Newman wrote:
> On Fri, Aug 2, 2024 at 1:55 PM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>> On 8/2/24 11:49 AM, Peter Newman wrote:
>>> On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre
>>>> I am of course not familiar with details of the software implementation
>>>> - could there be benefits to using it even if hardware counters are
>>>> supported?
>>>
>>> I can't see any situation where the user would want to choose software
>>> over hardware counters. The number of groups which can be monitored by
>>> software assignable counters will always be less than with hardware,
>>> due to the need for consuming one RMID (and the counters automatically
>>> allocated to it by the AMD hardware) for all unassigned groups.
>>
>> Thank you for clarifying. This seems specific to this software implementation,
>> and I missed that there was a shift from soft-RMIDs to soft-ABMC. If I remember
>> correctly this depends on undocumented hardware specific knowledge.
> 
> For the benefit of anyone else who needs to monitor bandwidth on a
> large number of monitoring groups on pre-ABMC AMD implementations,
> hopefully a future AMD publication will clarify, at least on some
> existing, pre-ABMC models, exactly when the QM_CTR.U bit is set.
> 
> 
>>>
>>> The behavior as I've implemented today is:
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events
>>> 0
>>>
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>> test//0=_;1=_;
>>> //0=_;1=_;
>>>
>>> # echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>> test//0=_;1=tl;
>>> //0=_;1=_;
>>>
>>> # echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>> test//0=_;1=_;
>>> //0=_;1=_;
>>>
>>>
>>
>> This highlights how there cannot be a generic/consistent interface between hardware
>> and software implementation. If resctrl implements something like above without any
>> other hints to user space then it will push complexity to user space since user space
>> would not know if setting one flag results in setting more than that flag, which may
>> force a user space implementation to always follow a write with a read that
>> needs to confirm what actually resulted from the write. Similarly, that removing a
>> flag impacts other flags needs to be clear without user space needing to "try and
>> see what happens".
> 
> I'll return to this topic in the context of MPAM below...
> 
>> It is not clear to me how to interpret the above example when it comes to the
>> RMID management though. If the RMID assignment is per group then I expected all
>> the domains of a group to have the same flag(s)?
> 
> The group RMIDs are never programmed into any MSRs and the RMID space
> is independent in each domain, so it is still possible to do
> per-domain assignment. (and like with soft RMIDs, this enables us to
> create unlimited groups, but we've never been limited by the size of
> the RMID space)
> 
> However, in our use cases, jobs are not confined to any domain, so
> bandwidth measurements must be done simultaneously in all domains, so
> we have no current use for per-domain assignment. But if any Google
> users did begin to see value in confining jobs to domains, this could
> change.
> 
>>
>>>>
>>>>> However, If we don't expect to see these semantics in any other
>>>>> implementation, these semantics could be implicit in the definition of
>>>>> a SW assignable counter.
>>>>
>>>> It is not clear to me how implementation differences between hardware
>>>> and software assignment can be hidden from user space. It is possible
>>>> to let user space enable individual events and then silently upgrade it
>>>> to all events. I see two options here, either "mbm_control" needs to
>>>> explicitly show this "silent upgrade" so that user space knows which
>>>> events are actually enabled, or "mbm_control" only shows flags/events enabled
>>>> from user space perspective. In the former scenario, this needs more
>>>> user space support since a generic user space cannot be confident which
>>>> flags are set after writing to "mbm_control". In the latter scenario,
>>>> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
>>>> to rely on it to know which events can be enabled and if some are
>>>> actually "silently enabled" when user space still thinks it needs to be
>>>> enabled the number of available counters becomes vague.
>>>>
>>>> It is not clear to me how to present hardware and software assignable
>>>> counters with a single consistent interface. Actually, what if the
>>>> "mbm_mode" is what distinguishes how counters are assigned instead of how
>>>> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
>>>> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
>>>> and "mbm_cntr_group_assignable" is used? Could that replace a
>>>> potential "mbm_assign_events" while also supporting user space in
>>>> interactions with "mbm_control"?
>>>
>>> If I understand this correctly, is this a preference that the info
>>> node be named differently if its value will have different units,
>>> rather than a second node to indicate what the value of num_mbm_cntrs
>>> actually means? This sounds reasonable to me.
>>
>> Indeed. As you highlighted, user space may not need to know if
>> counters are backed by hardware or software, but user space needs to
>> know what to expect from (how to interact with) interface.
>>
>>> I think it's also important to note that in MPAM, the MBWU (memory
>>> bandwidth usage) monitors don't have a concept of local versus total
>>> bandwidth, so event assignment would likely not apply there either.
>>> What the counted bandwidth actually represents is more implicit in the
>>> monitor's position in the memory system in the particular
>>> implementation. On a theoretical multi-socket system, resctrl would
>>> require knowledge about the system's architecture to stitch together
>>> the counts from different types of monitors to produce a local and
>>> total value. I don't know if we'd program this SoC-specific knowledge
>>> into the kernel to produce a unified MBM resource like we're
>>> accustomed to now or if we'd present multiple MBM resources, each only
>>> providing an mbm_total_bytes event. In this case, the counters would
>>> have to be assigned separately in each MBM resource, especially if the
>>> different MBM resources support a different number of counters.
>>>
>>
>> "total" and "local" bandwidth is already in grey area after the
>> introduction of mbm_total_bytes_config/mbm_local_bytes_config where
>> user space could set values reported to not be constrained by the
>> "total" and "local" terms. We keep sticking with it though, even in
>> this implementation that uses the "t" and "l" flags, knowing that
>> what is actually monitored when "l" is set is just what the user
>> configured via mbm_local_bytes_config, which theoretically
>> can be "total" bandwidth.
> 
> If it makes sense to support a separate, group-assignment interface at
> least for MPAM, this would be a better fit for soft-ABMC, even if it
> does have to stay downstream.

(apologies for the delay)

Could we please take a step back and confirm/agree what is meant with "group-
assignment"? In a previous message [1] I latched onto the statement
"the implementation is assigning RMIDs to groups, assignment results in all
events being counted.". In this I understood "groups" to be resctrl groups
and I understood this to mean that when a (soft-ABMC) counter is assigned
it applies to the entire resctrl group (all domains, all events). The
subsequent example in [2] was thus unexpected to me when the interface
was used to assign a (soft-ABMC) counter to the group but not all domains
were impacted.

Considering this, could you please elaborate what is meant with
"group assignment"?

Thank you

Reinette

[1] https://lore.kernel.org/lkml/CALPaoCi_TBZnULHQpYns+H+30jODZvyQpUHJRDHNwjQzajrD=A@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CALPaoCi1CwLy_HbFNOxPfdReEJstd3c+DvOMJHb5P9jBP+iatw@mail.gmail.com/

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Peter Newman 1 year, 5 months ago

Hi Reinette,

On Wed, Aug 14, 2024 at 10:37 AM Reinette Chatre
<reinette.chatre@intel.com> wrote:
>
> Hi Peter,
>
> On 8/2/24 3:50 PM, Peter Newman wrote:
> > On Fri, Aug 2, 2024 at 1:55 PM Reinette Chatre
> > <reinette.chatre@intel.com> wrote:
> >> On 8/2/24 11:49 AM, Peter Newman wrote:
> >>> On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre
> >>>> I am of course not familiar with details of the software implementation
> >>>> - could there be benefits to using it even if hardware counters are
> >>>> supported?
> >>>
> >>> I can't see any situation where the user would want to choose software
> >>> over hardware counters. The number of groups which can be monitored by
> >>> software assignable counters will always be less than with hardware,
> >>> due to the need for consuming one RMID (and the counters automatically
> >>> allocated to it by the AMD hardware) for all unassigned groups.
> >>
> >> Thank you for clarifying. This seems specific to this software implementation,
> >> and I missed that there was a shift from soft-RMIDs to soft-ABMC. If I remember
> >> correctly this depends on undocumented hardware specific knowledge.
> >
> > For the benefit of anyone else who needs to monitor bandwidth on a
> > large number of monitoring groups on pre-ABMC AMD implementations,
> > hopefully a future AMD publication will clarify, at least on some
> > existing, pre-ABMC models, exactly when the QM_CTR.U bit is set.
> >
> >
> >>>
> >>> The behavior as I've implemented today is:
> >>>
> >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events
> >>> 0
> >>>
> >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> >>> test//0=_;1=_;
> >>> //0=_;1=_;
> >>>
> >>> # echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
> >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> >>> test//0=_;1=tl;
> >>> //0=_;1=_;
> >>>
> >>> # echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
> >>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> >>> test//0=_;1=_;
> >>> //0=_;1=_;
> >>>
> >>>
> >>
> >> This highlights how there cannot be a generic/consistent interface between hardware
> >> and software implementation. If resctrl implements something like above without any
> >> other hints to user space then it will push complexity to user space since user space
> >> would not know if setting one flag results in setting more than that flag, which may
> >> force a user space implementation to always follow a write with a read that
> >> needs to confirm what actually resulted from the write. Similarly, that removing a
> >> flag impacts other flags needs to be clear without user space needing to "try and
> >> see what happens".
> >
> > I'll return to this topic in the context of MPAM below...
> >
> >> It is not clear to me how to interpret the above example when it comes to the
> >> RMID management though. If the RMID assignment is per group then I expected all
> >> the domains of a group to have the same flag(s)?
> >
> > The group RMIDs are never programmed into any MSRs and the RMID space
> > is independent in each domain, so it is still possible to do
> > per-domain assignment. (and like with soft RMIDs, this enables us to
> > create unlimited groups, but we've never been limited by the size of
> > the RMID space)
> >
> > However, in our use cases, jobs are not confined to any domain, so
> > bandwidth measurements must be done simultaneously in all domains, so
> > we have no current use for per-domain assignment. But if any Google
> > users did begin to see value in confining jobs to domains, this could
> > change.
> >
> >>
> >>>>
> >>>>> However, If we don't expect to see these semantics in any other
> >>>>> implementation, these semantics could be implicit in the definition of
> >>>>> a SW assignable counter.
> >>>>
> >>>> It is not clear to me how implementation differences between hardware
> >>>> and software assignment can be hidden from user space. It is possible
> >>>> to let user space enable individual events and then silently upgrade it
> >>>> to all events. I see two options here, either "mbm_control" needs to
> >>>> explicitly show this "silent upgrade" so that user space knows which
> >>>> events are actually enabled, or "mbm_control" only shows flags/events enabled
> >>>> from user space perspective. In the former scenario, this needs more
> >>>> user space support since a generic user space cannot be confident which
> >>>> flags are set after writing to "mbm_control". In the latter scenario,
> >>>> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
> >>>> to rely on it to know which events can be enabled and if some are
> >>>> actually "silently enabled" when user space still thinks it needs to be
> >>>> enabled the number of available counters becomes vague.
> >>>>
> >>>> It is not clear to me how to present hardware and software assignable
> >>>> counters with a single consistent interface. Actually, what if the
> >>>> "mbm_mode" is what distinguishes how counters are assigned instead of how
> >>>> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
> >>>> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
> >>>> and "mbm_cntr_group_assignable" is used? Could that replace a
> >>>> potential "mbm_assign_events" while also supporting user space in
> >>>> interactions with "mbm_control"?
> >>>
> >>> If I understand this correctly, is this a preference that the info
> >>> node be named differently if its value will have different units,
> >>> rather than a second node to indicate what the value of num_mbm_cntrs
> >>> actually means? This sounds reasonable to me.
> >>
> >> Indeed. As you highlighted, user space may not need to know if
> >> counters are backed by hardware or software, but user space needs to
> >> know what to expect from (how to interact with) interface.
> >>
> >>> I think it's also important to note that in MPAM, the MBWU (memory
> >>> bandwidth usage) monitors don't have a concept of local versus total
> >>> bandwidth, so event assignment would likely not apply there either.
> >>> What the counted bandwidth actually represents is more implicit in the
> >>> monitor's position in the memory system in the particular
> >>> implementation. On a theoretical multi-socket system, resctrl would
> >>> require knowledge about the system's architecture to stitch together
> >>> the counts from different types of monitors to produce a local and
> >>> total value. I don't know if we'd program this SoC-specific knowledge
> >>> into the kernel to produce a unified MBM resource like we're
> >>> accustomed to now or if we'd present multiple MBM resources, each only
> >>> providing an mbm_total_bytes event. In this case, the counters would
> >>> have to be assigned separately in each MBM resource, especially if the
> >>> different MBM resources support a different number of counters.
> >>>
> >>
> >> "total" and "local" bandwidth is already in grey area after the
> >> introduction of mbm_total_bytes_config/mbm_local_bytes_config where
> >> user space could set values reported to not be constrained by the
> >> "total" and "local" terms. We keep sticking with it though, even in
> >> this implementation that uses the "t" and "l" flags, knowing that
> >> what is actually monitored when "l" is set is just what the user
> >> configured via mbm_local_bytes_config, which theoretically
> >> can be "total" bandwidth.
> >
> > If it makes sense to support a separate, group-assignment interface at
> > least for MPAM, this would be a better fit for soft-ABMC, even if it
> > does have to stay downstream.
>
> (apologies for the delay)
>
> Could we please take a step back and confirm/agree what is meant with "group-
> assignment"? In a previous message [1] I latched onto the statement
> "the implementation is assigning RMIDs to groups, assignment results in all
> events being counted.". In this I understood "groups" to be resctrl groups
> and I understood this to mean that when a (soft-ABMC) counter is assigned
> it applies to the entire resctrl group (all domains, all events). The
> subsequent example in [2] was thus unexpected to me when the interface
> was used to assign a (soft-ABMC) counter to the group but not all domains
> were impacted.
>
> Considering this, could you please elaborate what is meant with
> "group assignment"?

By "group assignment", I just mean assigning counters to individual
MBM events is not possible, or that assignment results in counters
being assigned to all MBM events for a group in a domain.

I only omitted per-domain assignment in soft-ABMC before because
Google doesn't have a use-case for it. I started the prototype before
Babu's proposed interface required domain-scoped assignments[1]. Now
that some sort of domain selector is required, I'm reconsidering.

-Peter

[1] https://lore.kernel.org/lkml/cover.1705688538.git.babu.moger@amd.com/

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Reinette Chatre 1 year, 5 months ago

Hi Peter,

On 8/15/24 4:06 PM, Peter Newman wrote:
> On Wed, Aug 14, 2024 at 10:37 AM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Peter,
>>
>> On 8/2/24 3:50 PM, Peter Newman wrote:
>>> On Fri, Aug 2, 2024 at 1:55 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>> On 8/2/24 11:49 AM, Peter Newman wrote:
>>>>> On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre
>>>>>> I am of course not familiar with details of the software implementation
>>>>>> - could there be benefits to using it even if hardware counters are
>>>>>> supported?
>>>>>
>>>>> I can't see any situation where the user would want to choose software
>>>>> over hardware counters. The number of groups which can be monitored by
>>>>> software assignable counters will always be less than with hardware,
>>>>> due to the need for consuming one RMID (and the counters automatically
>>>>> allocated to it by the AMD hardware) for all unassigned groups.
>>>>
>>>> Thank you for clarifying. This seems specific to this software implementation,
>>>> and I missed that there was a shift from soft-RMIDs to soft-ABMC. If I remember
>>>> correctly this depends on undocumented hardware specific knowledge.
>>>
>>> For the benefit of anyone else who needs to monitor bandwidth on a
>>> large number of monitoring groups on pre-ABMC AMD implementations,
>>> hopefully a future AMD publication will clarify, at least on some
>>> existing, pre-ABMC models, exactly when the QM_CTR.U bit is set.
>>>
>>>
>>>>>
>>>>> The behavior as I've implemented today is:
>>>>>
>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events
>>>>> 0
>>>>>
>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>>>> test//0=_;1=_;
>>>>> //0=_;1=_;
>>>>>
>>>>> # echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>>>> test//0=_;1=tl;
>>>>> //0=_;1=_;
>>>>>
>>>>> # echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
>>>>> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
>>>>> test//0=_;1=_;
>>>>> //0=_;1=_;
>>>>>
>>>>>
>>>>
>>>> This highlights how there cannot be a generic/consistent interface between hardware
>>>> and software implementation. If resctrl implements something like above without any
>>>> other hints to user space then it will push complexity to user space since user space
>>>> would not know if setting one flag results in setting more than that flag, which may
>>>> force a user space implementation to always follow a write with a read that
>>>> needs to confirm what actually resulted from the write. Similarly, that removing a
>>>> flag impacts other flags needs to be clear without user space needing to "try and
>>>> see what happens".
>>>
>>> I'll return to this topic in the context of MPAM below...
>>>
>>>> It is not clear to me how to interpret the above example when it comes to the
>>>> RMID management though. If the RMID assignment is per group then I expected all
>>>> the domains of a group to have the same flag(s)?
>>>
>>> The group RMIDs are never programmed into any MSRs and the RMID space
>>> is independent in each domain, so it is still possible to do
>>> per-domain assignment. (and like with soft RMIDs, this enables us to
>>> create unlimited groups, but we've never been limited by the size of
>>> the RMID space)
>>>
>>> However, in our use cases, jobs are not confined to any domain, so
>>> bandwidth measurements must be done simultaneously in all domains, so
>>> we have no current use for per-domain assignment. But if any Google
>>> users did begin to see value in confining jobs to domains, this could
>>> change.
>>>
>>>>
>>>>>>
>>>>>>> However, If we don't expect to see these semantics in any other
>>>>>>> implementation, these semantics could be implicit in the definition of
>>>>>>> a SW assignable counter.
>>>>>>
>>>>>> It is not clear to me how implementation differences between hardware
>>>>>> and software assignment can be hidden from user space. It is possible
>>>>>> to let user space enable individual events and then silently upgrade it
>>>>>> to all events. I see two options here, either "mbm_control" needs to
>>>>>> explicitly show this "silent upgrade" so that user space knows which
>>>>>> events are actually enabled, or "mbm_control" only shows flags/events enabled
>>>>>> from user space perspective. In the former scenario, this needs more
>>>>>> user space support since a generic user space cannot be confident which
>>>>>> flags are set after writing to "mbm_control". In the latter scenario,
>>>>>> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
>>>>>> to rely on it to know which events can be enabled and if some are
>>>>>> actually "silently enabled" when user space still thinks it needs to be
>>>>>> enabled the number of available counters becomes vague.
>>>>>>
>>>>>> It is not clear to me how to present hardware and software assignable
>>>>>> counters with a single consistent interface. Actually, what if the
>>>>>> "mbm_mode" is what distinguishes how counters are assigned instead of how
>>>>>> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
>>>>>> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
>>>>>> and "mbm_cntr_group_assignable" is used? Could that replace a
>>>>>> potential "mbm_assign_events" while also supporting user space in
>>>>>> interactions with "mbm_control"?
>>>>>
>>>>> If I understand this correctly, is this a preference that the info
>>>>> node be named differently if its value will have different units,
>>>>> rather than a second node to indicate what the value of num_mbm_cntrs
>>>>> actually means? This sounds reasonable to me.
>>>>
>>>> Indeed. As you highlighted, user space may not need to know if
>>>> counters are backed by hardware or software, but user space needs to
>>>> know what to expect from (how to interact with) interface.
>>>>
>>>>> I think it's also important to note that in MPAM, the MBWU (memory
>>>>> bandwidth usage) monitors don't have a concept of local versus total
>>>>> bandwidth, so event assignment would likely not apply there either.
>>>>> What the counted bandwidth actually represents is more implicit in the
>>>>> monitor's position in the memory system in the particular
>>>>> implementation. On a theoretical multi-socket system, resctrl would
>>>>> require knowledge about the system's architecture to stitch together
>>>>> the counts from different types of monitors to produce a local and
>>>>> total value. I don't know if we'd program this SoC-specific knowledge
>>>>> into the kernel to produce a unified MBM resource like we're
>>>>> accustomed to now or if we'd present multiple MBM resources, each only
>>>>> providing an mbm_total_bytes event. In this case, the counters would
>>>>> have to be assigned separately in each MBM resource, especially if the
>>>>> different MBM resources support a different number of counters.
>>>>>
>>>>
>>>> "total" and "local" bandwidth is already in grey area after the
>>>> introduction of mbm_total_bytes_config/mbm_local_bytes_config where
>>>> user space could set values reported to not be constrained by the
>>>> "total" and "local" terms. We keep sticking with it though, even in
>>>> this implementation that uses the "t" and "l" flags, knowing that
>>>> what is actually monitored when "l" is set is just what the user
>>>> configured via mbm_local_bytes_config, which theoretically
>>>> can be "total" bandwidth.
>>>
>>> If it makes sense to support a separate, group-assignment interface at
>>> least for MPAM, this would be a better fit for soft-ABMC, even if it
>>> does have to stay downstream.
>>
>> (apologies for the delay)
>>
>> Could we please take a step back and confirm/agree what is meant with "group-
>> assignment"? In a previous message [1] I latched onto the statement
>> "the implementation is assigning RMIDs to groups, assignment results in all
>> events being counted.". In this I understood "groups" to be resctrl groups
>> and I understood this to mean that when a (soft-ABMC) counter is assigned
>> it applies to the entire resctrl group (all domains, all events). The
>> subsequent example in [2] was thus unexpected to me when the interface
>> was used to assign a (soft-ABMC) counter to the group but not all domains
>> were impacted.
>>
>> Considering this, could you please elaborate what is meant with
>> "group assignment"?
> 
> By "group assignment", I just mean assigning counters to individual
> MBM events is not possible, or that assignment results in counters
> being assigned to all MBM events for a group in a domain.

Thank you for clarifying. I still think it is possible to use an entry
in "mbm_mode" to indicate to user space what to expect from the mbm_control
interface but I withdraw my original naming suggestions since it would create
confusion about what is meant by "group".

> 
> I only omitted per-domain assignment in soft-ABMC before because
> Google doesn't have a use-case for it. I started the prototype before
> Babu's proposed interface required domain-scoped assignments[1]. Now
> that some sort of domain selector is required, I'm reconsidering.

Could you please elaborate what you mean with the required "domain selector"?
The latest ABMC version (v6) added support for assigning all domains using '*'.

Reinette

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Moger, Babu 1 year, 6 months ago

Hi Peter/Reinette,

On 8/2/2024 1:49 PM, Peter Newman wrote:
> Hi Reinette,
> 
> On Fri, Aug 2, 2024 at 9:14 AM Reinette Chatre
> <reinette.chatre@intel.com> wrote:
>>
>> Hi Peter,
>>
>> On 8/1/24 3:45 PM, Peter Newman wrote:
>>> On Thu, Aug 1, 2024 at 2:50 PM Reinette Chatre
>>> <reinette.chatre@intel.com> wrote:
>>>> On 7/17/24 10:19 AM, Moger, Babu wrote:
>>>>> On 7/12/24 17:03, Reinette Chatre wrote:
>>>>>> On 7/3/24 2:48 PM, Babu Moger wrote:
>>
>>>>>>> # Examples
>>>>>>>
>>>>>>> a. Check if ABMC support is available
>>>>>>>        #mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>>>
>>>>>>>        #cat /sys/fs/resctrl/info/L3_MON/mbm_mode
>>>>>>>        [abmc]
>>>>>>>        legacy
>>>>>>>
>>>>>>>        Linux kernel detected ABMC feature and it is enabled.
>>>>>>
>>>>>> How about renaming "abmc" to "mbm_cntrs"? This will match the num_mbm_cntrs
>>>>>> info file and be the final step to make this generic so that another
>>>>>> architecture
>>>>>> can more easily support assignining hardware counters without needing to call
>>>>>> the feature AMD's "abmc".
>>>>>
>>>>> I think we aleady settled this with "mbm_cntr_assignable".
>>>>>
>>>>> For soft-RMID" it will be mbm_sw_assignable.
>>>>
>>>> Maybe getting a bit long but how about "mbm_cntr_sw_assignable" to match
>>>> with the term "mbm_cntr" in accompanying "num_mbm_cntrs"?
>>>
>>> My users are pushing for a consistent interface regardless of whether
>>> counter assignment is implemented in hardware or software, so I would
>>> like to avoid exposing implementation differences in the interface
>>> where possible.
>>
>> This seems a reasonable ask but can we be confident that if hardware
>> supports assignable counters then there will never be a reason to use
>> software assignable counters? (This needs to also consider how/if Arm
>> may use this feature.)
>>
>> I am of course not familiar with details of the software implementation
>> - could there be benefits to using it even if hardware counters are
>> supported?
> 
> I can't see any situation where the user would want to choose software
> over hardware counters. The number of groups which can be monitored by
> software assignable counters will always be less than with hardware,
> due to the need for consuming one RMID (and the counters automatically
> allocated to it by the AMD hardware) for all unassigned groups.
> 
> I consider software assignable a workaround to enable measuring
> bandwidth reliably on a large number of groups on pre-ABMC AMD
> hardware, or rather salvaging MBM on pre-ABMC hardware making use of
> our users' effort to adapt to counter assignment in resctrl. We hope
> no future implementations will choose to silently drop bandwidth
> counts, so fingers crossed, the software implementation can be phased
> out when these generations of AMD hardware are decommissioned.
> 
> The MPAM specification natively supports (or requires) counter
> assignment in hardware. From what I recall in the last of James'
> prototypes I looked at, MBM was only supported if the implementation
> provided as many bandwidth counters as there were possible monitoring
> groups, so that it could assume a monitor IDs for every PARTID:PMG
> combination.
> 
>>
>> What I would like to avoid is future complexity of needing a new mount/config
>> option that user space needs to use to select if a single "mbm_cntr_assignable"
>> is backed by hardware or software.
> 
> In my testing so far, automatically enabling counter assignment and
> automatically allocating counters for all events in new groups works
> well enough.
> 
> The only configuration I need is the ability to disable the automatic
> counter allocation so that a userspace agent can have control of where
> all the counters are assigned at all times. It's easy to implement
> this as a simple flag if the user accepts that they need to manually
> deallocate any automatically-allocated counters from groups created
> before the flag was cleared.
> 
>>
>>> The main semantic difference with SW assignments is that it is not
>>> possible to assign counters to individual events. Because the
>>> implementation is assigning RMIDs to groups, assignment results in all
>>> events being counted.
>>>
>>> I was considering introducing a boolean mbm_assign_events node to
>>> indicate whether assigning individual events is supported. If true,
>>> num_mbm_cntrs indicates the number of events which can be counted,
>>> otherwise it indicates the number of groups to which counters can be
>>> assigned and attempting to assign a single event is silently upgraded
>>> to assigning counters to all events in the group.
>>
>> How were you envisioning your users using the control file ("mbm_control")
>> in these scenarios? Does this file's interface even work for SW assignment
>> scenarios?
>>
>> Users should expect consistent interface for "mbm_control" also.
>>
>> It sounds to me that a potential "mbm_assign_events" will be false for SW
>> assignments. That would mean that "num_mbm_cntrs" will
>> contain the number of groups to which counters can be assigned?
>> Would user space be required to always enable all flags (enable all events) of
>> all domains to the same values ... or would enabling of one flag (one event)
>> in one domain automatically result in all flags (all events) enabled for all
>> domains ... or would enabling of one flag (one event) in one domain only appear
>> to user space to be enabled while in reality all flags/events are actually enabled?
> 
> I believe mbm_control should always accurately reflect which events
> are being counted.
> 
> The behavior as I've implemented today is:
> 
> # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_events
> 0
> 
> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> test//0=_;1=_;
> //0=_;1=_;
> 
> # echo "test//1+l" > /sys/fs/resctrl/info/L3_MON/mbm_control
> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> test//0=_;1=tl;
> //0=_;1=_;
> 
> # echo "test//1-t" > /sys/fs/resctrl/info/L3_MON/mbm_control
> # cat /sys/fs/resctrl/info/L3_MON/mbm_control
> test//0=_;1=_;
> //0=_;1=_;

It enables/disables the events automatically("silent upgrade/degrade").
This looks good to me.

> 
> 
>>
>>> However, If we don't expect to see these semantics in any other
>>> implementation, these semantics could be implicit in the definition of
>>> a SW assignable counter.
>>
>> It is not clear to me how implementation differences between hardware
>> and software assignment can be hidden from user space. It is possible
>> to let user space enable individual events and then silently upgrade it
>> to all events. I see two options here, either "mbm_control" needs to
>> explicitly show this "silent upgrade" so that user space knows which
>> events are actually enabled, or "mbm_control" only shows flags/events enabled
>> from user space perspective. In the former scenario, this needs more
>> user space support since a generic user space cannot be confident which
>> flags are set after writing to "mbm_control". In the latter scenario,
>> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
>> to rely on it to know which events can be enabled and if some are
>> actually "silently enabled" when user space still thinks it needs to be
>> enabled the number of available counters becomes vague.
>>
>> It is not clear to me how to present hardware and software assignable
>> counters with a single consistent interface. Actually, what if the
>> "mbm_mode" is what distinguishes how counters are assigned instead of how
>> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
>> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
>> and "mbm_cntr_group_assignable" is used? Could that replace a
>> potential "mbm_assign_events" while also supporting user space in
>> interactions with "mbm_control"?
> 
> If I understand this correctly, is this a preference that the info
> node be named differently if its value will have different units,
> rather than a second node to indicate what the value of num_mbm_cntrs
> actually means? This sounds reasonable to me.

Looks like we are agreeing with "silent upgrade/degrade" option.

"mbm_mode" will look like below(Replaced event with evt and group with grp).

#cat /sys/fs/resctrl/infor/L3_MON/mbm_mode
[mbm_cntr_evt_assignable]
mbm_cntr_grp_assignable
legacy

Does that look ok?

I am not clear on num_mbm_cntrs in case of mbm_cntr_grp_assignable.

Peter, How do you figure out how many counters are available in soft-ABMC?


> 
> I think it's also important to note that in MPAM, the MBWU (memory
> bandwidth usage) monitors don't have a concept of local versus total
> bandwidth, so event assignment would likely not apply there either.
> What the counted bandwidth actually represents is more implicit in the
> monitor's position in the memory system in the particular
> implementation. On a theoretical multi-socket system, resctrl would
> require knowledge about the system's architecture to stitch together
> the counts from different types of monitors to produce a local and
> total value. I don't know if we'd program this SoC-specific knowledge
> into the kernel to produce a unified MBM resource like we're
> accustomed to now or if we'd present multiple MBM resources, each only
> providing an mbm_total_bytes event. In this case, the counters would
> have to be assigned separately in each MBM resource, especially if the
> different MBM resources support a different number of counters.
> 
> Thanks,
> -Peter
> 

-- 
- Babu Moger

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Moger, Babu 1 year, 6 months ago

Hi Peter/Reinette,

On 8/2/2024 11:13 AM, Reinette Chatre wrote:
> Hi Peter,
> 
> On 8/1/24 3:45 PM, Peter Newman wrote:
>> On Thu, Aug 1, 2024 at 2:50 PM Reinette Chatre
>> <reinette.chatre@intel.com> wrote:
>>> On 7/17/24 10:19 AM, Moger, Babu wrote:
>>>> On 7/12/24 17:03, Reinette Chatre wrote:
>>>>> On 7/3/24 2:48 PM, Babu Moger wrote:
> 
>>>>>> # Examples
>>>>>>
>>>>>> a. Check if ABMC support is available
>>>>>>       #mount -t resctrl resctrl /sys/fs/resctrl/
>>>>>>
>>>>>>       #cat /sys/fs/resctrl/info/L3_MON/mbm_mode
>>>>>>       [abmc]
>>>>>>       legacy
>>>>>>
>>>>>>       Linux kernel detected ABMC feature and it is enabled.
>>>>>
>>>>> How about renaming "abmc" to "mbm_cntrs"? This will match the 
>>>>> num_mbm_cntrs
>>>>> info file and be the final step to make this generic so that another
>>>>> architecture
>>>>> can more easily support assignining hardware counters without 
>>>>> needing to call
>>>>> the feature AMD's "abmc".
>>>>
>>>> I think we aleady settled this with "mbm_cntr_assignable".
>>>>
>>>> For soft-RMID" it will be mbm_sw_assignable.
>>>
>>> Maybe getting a bit long but how about "mbm_cntr_sw_assignable" to match
>>> with the term "mbm_cntr" in accompanying "num_mbm_cntrs"?
>>
>> My users are pushing for a consistent interface regardless of whether
>> counter assignment is implemented in hardware or software, so I would
>> like to avoid exposing implementation differences in the interface
>> where possible.
> 
> This seems a reasonable ask but can we be confident that if hardware
> supports assignable counters then there will never be a reason to use
> software assignable counters? (This needs to also consider how/if Arm
> may use this feature.)
> 
> I am of course not familiar with details of the software implementation
> - could there be benefits to using it even if hardware counters are
> supported?
> 
> What I would like to avoid is future complexity of needing a new 
> mount/config
> option that user space needs to use to select if a single 
> "mbm_cntr_assignable"
> is backed by hardware or software.
> 
>> The main semantic difference with SW assignments is that it is not
>> possible to assign counters to individual events. Because the
>> implementation is assigning RMIDs to groups, assignment results in all
>> events being counted.
>>
>> I was considering introducing a boolean mbm_assign_events node to
>> indicate whether assigning individual events is supported. If true,
>> num_mbm_cntrs indicates the number of events which can be counted,
>> otherwise it indicates the number of groups to which counters can be
>> assigned and attempting to assign a single event is silently upgraded
>> to assigning counters to all events in the group.
> 
> How were you envisioning your users using the control file ("mbm_control")
> in these scenarios? Does this file's interface even work for SW assignment
> scenarios?
> 
> Users should expect consistent interface for "mbm_control" also.
> 
> It sounds to me that a potential "mbm_assign_events" will be false for SW
> assignments. That would mean that "num_mbm_cntrs" will
> contain the number of groups to which counters can be assigned?
> Would user space be required to always enable all flags (enable all 
> events) of
> all domains to the same values ... or would enabling of one flag (one 
> event)
> in one domain automatically result in all flags (all events) enabled for 
> all
> domains ... or would enabling of one flag (one event) in one domain only 
> appear
> to user space to be enabled while in reality all flags/events are 
> actually enabled?
> 
>> However, If we don't expect to see these semantics in any other
>> implementation, these semantics could be implicit in the definition of
>> a SW assignable counter.
> 
> It is not clear to me how implementation differences between hardware
> and software assignment can be hidden from user space. It is possible
> to let user space enable individual events and then silently upgrade it
> to all events. I see two options here, either "mbm_control" needs to
> explicitly show this "silent upgrade" so that user space knows which
> events are actually enabled, or "mbm_control" only shows flags/events 
> enabled
> from user space perspective. In the former scenario, this needs more
> user space support since a generic user space cannot be confident which
> flags are set after writing to "mbm_control". In the latter scenario,
> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
> to rely on it to know which events can be enabled and if some are
> actually "silently enabled" when user space still thinks it needs to be
> enabled the number of available counters becomes vague.
> 
> It is not clear to me how to present hardware and software assignable
> counters with a single consistent interface. Actually, what if the
> "mbm_mode" is what distinguishes how counters are assigned instead of how
> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
> and "mbm_cntr_group_assignable" is used? Could that replace a
> potential "mbm_assign_events" while also supporting user space in
> interactions with "mbm_control"?

If I understand correctly, current interface might work for both the sw 
and hw assignments.

In case of SW assignment, you need to manage two counters at context 
switch time. One for total event and one for local event. Basically, you 
need to calculate delta for both events. You need to do rmid read for 
both events and then calculate the delta.

If the user assigns only one event you do the calculations only for the 
event user is interested in. That will save cycles as well. In this case 
"mbm_control" will report as one one event is assigned.

In many cases user will not interested in both the events. Also events 
are configurable so users can get what they want with just one event.

Does that make sense?

-- 
- Babu Moger

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Peter Newman 1 year, 6 months ago

Hi Babu,

On Fri, Aug 2, 2024 at 11:49 AM Moger, Babu <bmoger@amd.com> wrote:
>
> Hi Peter/Reinette,
>
> On 8/2/2024 11:13 AM, Reinette Chatre wrote:
> > Hi Peter,
> >
> > On 8/1/24 3:45 PM, Peter Newman wrote:
> >> However, If we don't expect to see these semantics in any other
> >> implementation, these semantics could be implicit in the definition of
> >> a SW assignable counter.
> >
> > It is not clear to me how implementation differences between hardware
> > and software assignment can be hidden from user space. It is possible
> > to let user space enable individual events and then silently upgrade it
> > to all events. I see two options here, either "mbm_control" needs to
> > explicitly show this "silent upgrade" so that user space knows which
> > events are actually enabled, or "mbm_control" only shows flags/events
> > enabled
> > from user space perspective. In the former scenario, this needs more
> > user space support since a generic user space cannot be confident which
> > flags are set after writing to "mbm_control". In the latter scenario,
> > meaning of "num_mbm_cntrs" becomes unclear since user space is expected
> > to rely on it to know which events can be enabled and if some are
> > actually "silently enabled" when user space still thinks it needs to be
> > enabled the number of available counters becomes vague.
> >
> > It is not clear to me how to present hardware and software assignable
> > counters with a single consistent interface. Actually, what if the
> > "mbm_mode" is what distinguishes how counters are assigned instead of how
> > it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
> > "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
> > and "mbm_cntr_group_assignable" is used? Could that replace a
> > potential "mbm_assign_events" while also supporting user space in
> > interactions with "mbm_control"?
>
> If I understand correctly, current interface might work for both the sw
> and hw assignments.
>
> In case of SW assignment, you need to manage two counters at context
> switch time. One for total event and one for local event. Basically, you
> need to calculate delta for both events. You need to do rmid read for
> both events and then calculate the delta.
>
> If the user assigns only one event you do the calculations only for the
> event user is interested in. That will save cycles as well. In this case
> "mbm_control" will report as one one event is assigned.
>
> In many cases user will not interested in both the events. Also events
> are configurable so users can get what they want with just one event.
>
> Does that make sense?

I think you've confused soft-RMID for soft-ABMC. Or more likely I've
confused you by not using consistent terminology.

soft-RMIDs are simulated by reading the counters of HW RMIDs
permanently assigned to each CPU at context switch. We found the
context switch cost of this approach unacceptable.

soft-ABMC is permanently associating an RMID with the local and total
counter-pair that will be automatically associated with it when it is
first loaded into a PQR_ASSOC MSR in a domain, then using the
mbm_control interface to choose which group to associate with these
RMIDs. This does not require any context switching work. This
technique is specific to the behavior of AMD hardware.

-Peter

Re: [PATCH v5 00/20] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

Posted by Moger, Babu 1 year, 6 months ago

Hi  Peter,

On 8/2/2024 2:13 PM, Peter Newman wrote:
> Hi Babu,
> 
> On Fri, Aug 2, 2024 at 11:49 AM Moger, Babu <bmoger@amd.com> wrote:
>>
>> Hi Peter/Reinette,
>>
>> On 8/2/2024 11:13 AM, Reinette Chatre wrote:
>>> Hi Peter,
>>>
>>> On 8/1/24 3:45 PM, Peter Newman wrote:
>>>> However, If we don't expect to see these semantics in any other
>>>> implementation, these semantics could be implicit in the definition of
>>>> a SW assignable counter.
>>>
>>> It is not clear to me how implementation differences between hardware
>>> and software assignment can be hidden from user space. It is possible
>>> to let user space enable individual events and then silently upgrade it
>>> to all events. I see two options here, either "mbm_control" needs to
>>> explicitly show this "silent upgrade" so that user space knows which
>>> events are actually enabled, or "mbm_control" only shows flags/events
>>> enabled
>>> from user space perspective. In the former scenario, this needs more
>>> user space support since a generic user space cannot be confident which
>>> flags are set after writing to "mbm_control". In the latter scenario,
>>> meaning of "num_mbm_cntrs" becomes unclear since user space is expected
>>> to rely on it to know which events can be enabled and if some are
>>> actually "silently enabled" when user space still thinks it needs to be
>>> enabled the number of available counters becomes vague.
>>>
>>> It is not clear to me how to present hardware and software assignable
>>> counters with a single consistent interface. Actually, what if the
>>> "mbm_mode" is what distinguishes how counters are assigned instead of how
>>> it is backed (hw vs sw)? What if, instead of "mbm_cntr_assignable" and
>>> "mbm_cntr_sw_assignable" MBM modes the terms "mbm_cntr_event_assignable"
>>> and "mbm_cntr_group_assignable" is used? Could that replace a
>>> potential "mbm_assign_events" while also supporting user space in
>>> interactions with "mbm_control"?
>>
>> If I understand correctly, current interface might work for both the sw
>> and hw assignments.
>>
>> In case of SW assignment, you need to manage two counters at context
>> switch time. One for total event and one for local event. Basically, you
>> need to calculate delta for both events. You need to do rmid read for
>> both events and then calculate the delta.
>>
>> If the user assigns only one event you do the calculations only for the
>> event user is interested in. That will save cycles as well. In this case
>> "mbm_control" will report as one one event is assigned.
>>
>> In many cases user will not interested in both the events. Also events
>> are configurable so users can get what they want with just one event.
>>
>> Does that make sense?
> 
> I think you've confused soft-RMID for soft-ABMC. Or more likely I've
> confused you by not using consistent terminology.
> 
> soft-RMIDs are simulated by reading the counters of HW RMIDs
> permanently assigned to each CPU at context switch. We found the
> context switch cost of this approach unacceptable.
> 
> soft-ABMC is permanently associating an RMID with the local and total
> counter-pair that will be automatically associated with it when it is
> first loaded into a PQR_ASSOC MSR in a domain, then using the
> mbm_control interface to choose which group to associate with these
> RMIDs. This does not require any context switching work. This
> technique is specific to the behavior of AMD hardware.

Got it.

I assume you have not posted the patches for this yet right?

thanks

Babu Moger