[v11] x86/resctrl : Support AMD Assignable Bandwidth Monitoring Counters (ABMC)

[PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups

Posted by Babu Moger 1 year ago

Provide the interface to list the assignment states of all the resctrl
groups in mbm_cntr_assign mode.

Example:
$ mount -t resctrl resctrl /sys/fs/resctrl/
$ cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
//0=tl;1=tl

List follows the following format:

"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

Format for specific type of groups:

- Default CTRL_MON group:
  "//<domain_id>=<flags>"

- Non-default CTRL_MON group:
  "<CTRL_MON group>//<domain_id>=<flags>"

- Child MON group of default CTRL_MON group:
  "/<MON group>/<domain_id>=<flags>"

- Child MON group of non-default CTRL_MON group:
  "<CTRL_MON group>/<MON group>/<domain_id>=<flags>"

Flags can be one of the following:
t  MBM total event is assigned
l  MBM local event is assigned
tl Both total and local MBM events are assigned
_  None of the MBM events are assigned

Signed-off-by: Babu Moger <babu.moger@amd.com>
---
v11: Fixed printing the separator after each domain while listing the group assignments.
     Renamed rdtgroup_mbm_assign_control_show to resctrl_mbm_assign_control_show().

v10: Changes mostly due to domain specific counter assignment.

v9: Minor parameter update in resctrl_mbm_event_assigned().

v8: Moved resctrl_mbm_event_assigned() in here as it is first used here.
    Moved rdt_last_cmd_clear() before making any call.
    Updated the commit log.
    Corrected the doc format.

v7: Renamed the interface name from 'mbm_control' to 'mbm_assign_control'
    to match 'mbm_assign_mode'.
    Removed Arch references from FS code.
    Added rdt_last_cmd_clear() before the command processing.
    Added rdtgroup_mutex before all the calls.
    Removed references of ABMC from FS code.

v6: The domain specific assignment can be determined looking at mbm_cntr_map.
    Removed rdtgroup_abmc_dom_cfg() and rdtgroup_abmc_dom_state().
    Removed the switch statement for the domain_state detection.
    Determined the flags incremently.
    Removed special handling of default group while printing..

v5: Replaced "assignment flags" with "flags".
    Changes related to mon structure.
    Changes related renaming the interface from mbm_assign_control to
    mbm_control.

v4: Added functionality to query domain specific assigment in.
    rdtgroup_abmc_dom_state().

v3: New patch.
    Addresses the feedback to provide the global assignment interface.
    https://lore.kernel.org/lkml/c73f444b-83a1-4e9a-95d3-54c5165ee782@intel.com/
---
 Documentation/arch/x86/resctrl.rst     | 44 ++++++++++++++
 arch/x86/kernel/cpu/resctrl/monitor.c  |  1 +
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 81 ++++++++++++++++++++++++++
 3 files changed, 126 insertions(+)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 5d18c4c8bc48..3040e5c4cd76 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -330,6 +330,50 @@ with the following files:
 	 # cat /sys/fs/resctrl/info/L3_MON/available_mbm_cntrs
 	 0=30;1=30
 
+"mbm_assign_control":
+	Reports the resctrl group and monitor status of each group.
+
+	List follows the following format:
+		"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
+
+	Format for specific type of groups:
+
+	* Default CTRL_MON group:
+		"//<domain_id>=<flags>"
+
+	* Non-default CTRL_MON group:
+		"<CTRL_MON group>//<domain_id>=<flags>"
+
+	* Child MON group of default CTRL_MON group:
+		"/<MON group>/<domain_id>=<flags>"
+
+	* Child MON group of non-default CTRL_MON group:
+		"<CTRL_MON group>/<MON group>/<domain_id>=<flags>"
+
+	Flags can be one of the following:
+	::
+
+	 t  MBM total event is assigned.
+	 l  MBM local event is assigned.
+	 tl Both MBM total and local events are assigned.
+	 _  None of the MBM events are assigned.
+
+	Examples:
+	::
+
+	 # mkdir /sys/fs/resctrl/mon_groups/child_default_mon_grp
+	 # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp
+	 # mkdir /sys/fs/resctrl/non_default_ctrl_mon_grp/mon_groups/child_non_default_mon_grp
+
+	 # cat /sys/fs/resctrl/info/L3_MON/mbm_assign_control
+	 non_default_ctrl_mon_grp//0=tl;1=tl
+	 non_default_ctrl_mon_grp/child_non_default_mon_grp/0=tl;1=tl
+	 //0=tl;1=tl
+	 /child_default_mon_grp/0=tl;1=tl
+
+	There are four resctrl groups. All the groups have total and local MBM events
+	assigned on domain 0 and 1.
+
 "max_threshold_occupancy":
 		Read/write file provides the largest value (in
 		bytes) at which a previously used LLC_occupancy
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 024aabbecbb5..2dd6c47c9276 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -1236,6 +1236,7 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
 			hw_res->mbm_cntr_assign_enabled = true;
 			resctrl_file_fflags_init("num_mbm_cntrs", RFTYPE_MON_INFO);
 			resctrl_file_fflags_init("available_mbm_cntrs", RFTYPE_MON_INFO);
+			resctrl_file_fflags_init("mbm_assign_control", RFTYPE_MON_INFO);
 		}
 	}
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5d305d0ac053..6e29827239e0 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -975,6 +975,81 @@ static ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static char *rdtgroup_mon_state_to_str(struct rdt_resource *r,
+				       struct rdt_mon_domain *d,
+				       struct rdtgroup *rdtgrp, char *str)
+{
+	char *tmp = str;
+
+	/* Query the total and local event flags for the domain */
+	if (mbm_cntr_get(r, d, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID) != -ENOENT)
+		*tmp++ = 't';
+
+	if (mbm_cntr_get(r, d, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID) != -ENOENT)
+		*tmp++ = 'l';
+
+	if (tmp == str)
+		*tmp++ = '_';
+
+	*tmp = '\0';
+	return str;
+}
+
+static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
+					   struct seq_file *s, void *v)
+{
+	struct rdt_resource *r = of->kn->parent->priv;
+	struct rdtgroup *rdtg, *crg;
+	struct rdt_mon_domain *dom;
+	char str[10];
+	bool sep;
+
+	cpus_read_lock();
+	mutex_lock(&rdtgroup_mutex);
+	rdt_last_cmd_clear();
+
+	if (!resctrl_arch_mbm_cntr_assign_enabled(r)) {
+		rdt_last_cmd_puts("mbm_cntr_assign mode is not enabled\n");
+		mutex_unlock(&rdtgroup_mutex);
+		cpus_read_unlock();
+		return -EINVAL;
+	}
+
+	list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
+		seq_printf(s, "%s//", rdtg->kn->name);
+
+		sep = false;
+		list_for_each_entry(dom, &r->mon_domains, hdr.list) {
+			if (sep)
+				seq_puts(s, ";");
+
+			seq_printf(s, "%d=%s", dom->hdr.id,
+				   rdtgroup_mon_state_to_str(r, dom, rdtg, str));
+
+			sep = true;
+		}
+		seq_putc(s, '\n');
+
+		list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, mon.crdtgrp_list) {
+			seq_printf(s, "%s/%s/", rdtg->kn->name, crg->kn->name);
+
+			sep = false;
+			list_for_each_entry(dom, &r->mon_domains, hdr.list) {
+				if (sep)
+					seq_puts(s, ";");
+				seq_printf(s, "%d=%s", dom->hdr.id,
+					   rdtgroup_mon_state_to_str(r, dom, crg, str));
+				sep = true;
+			}
+			seq_putc(s, '\n');
+		}
+	}
+
+	mutex_unlock(&rdtgroup_mutex);
+	cpus_read_unlock();
+	return 0;
+}
+
 #ifdef CONFIG_PROC_CPU_RESCTRL
 
 /*
@@ -1996,6 +2071,12 @@ static struct rftype res_common_files[] = {
 		.seq_show	= mbm_local_bytes_config_show,
 		.write		= mbm_local_bytes_config_write,
 	},
+	{
+		.name		= "mbm_assign_control",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= resctrl_mbm_assign_control_show,
+	},
 	{
 		.name		= "mbm_assign_mode",
 		.mode		= 0644,
-- 
2.34.1

Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups

Posted by Dave Martin 11 months, 3 weeks ago

On Wed, Jan 22, 2025 at 02:20:30PM -0600, Babu Moger wrote:
> Provide the interface to list the assignment states of all the resctrl
> groups in mbm_cntr_assign mode.

[...]

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 5d305d0ac053..6e29827239e0 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -975,6 +975,81 @@ static ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of,

[...]

> +static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
> +					   struct seq_file *s, void *v)
> +{

[...]

> +	list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
> +		seq_printf(s, "%s//", rdtg->kn->name);
> +
> +		sep = false;
> +		list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> +			if (sep)
> +				seq_puts(s, ";");
> +
> +			seq_printf(s, "%d=%s", dom->hdr.id,
> +				   rdtgroup_mon_state_to_str(r, dom, rdtg, str));
> +
> +			sep = true;
> +		}
> +		seq_putc(s, '\n');
> +
> +		list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, mon.crdtgrp_list) {
> +			seq_printf(s, "%s/%s/", rdtg->kn->name, crg->kn->name);
> +
> +			sep = false;
> +			list_for_each_entry(dom, &r->mon_domains, hdr.list) {
> +				if (sep)
> +					seq_puts(s, ";");
> +				seq_printf(s, "%d=%s", dom->hdr.id,
> +					   rdtgroup_mon_state_to_str(r, dom, crg, str));

Unlike the other resctrl files, it looks like the total size of this
data will scale up with the number of existing monitoring groups
and the lengths of the group names (in addition to the number of
monitoring domains).

So, this can easily be more than a page, overflowing internal limits
in the seq_file and kernfs code.

Do we need to track some state between read() calls?  This can be done
by overriding the kernfs .open() and .release() methods and hanging
some state data (or an rdtgroup_file pointer) on of->priv.

Also, if we allow the data to be read out in chunks, then we would
either have to snapshot all the data in one go and stash the unread
tail in the kernel, or we would need to move over to RCU-based
enumeration or similar -- otherwise releasing rdtgroup_mutex in the
middle of the enumeration in order to return data to userspace is going
to be a problem...

> +				sep = true;
> +			}
> +			seq_putc(s, '\n');
> +		}
> +	}
> +
> +	mutex_unlock(&rdtgroup_mutex);
> +	cpus_read_unlock();
> +	return 0;
> +}

[...]

Cheers
---Dave

Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups

Posted by Moger, Babu 11 months, 3 weeks ago

Hi Dave,

On 2/19/25 07:53, Dave Martin wrote:
> On Wed, Jan 22, 2025 at 02:20:30PM -0600, Babu Moger wrote:
>> Provide the interface to list the assignment states of all the resctrl
>> groups in mbm_cntr_assign mode.
> 
> [...]
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 5d305d0ac053..6e29827239e0 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -975,6 +975,81 @@ static ssize_t resctrl_mbm_assign_mode_write(struct kernfs_open_file *of,
> 
> [...]
> 
>> +static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
>> +					   struct seq_file *s, void *v)
>> +{
> 
> [...]
> 
>> +	list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
>> +		seq_printf(s, "%s//", rdtg->kn->name);
>> +
>> +		sep = false;
>> +		list_for_each_entry(dom, &r->mon_domains, hdr.list) {
>> +			if (sep)
>> +				seq_puts(s, ";");
>> +
>> +			seq_printf(s, "%d=%s", dom->hdr.id,
>> +				   rdtgroup_mon_state_to_str(r, dom, rdtg, str));
>> +
>> +			sep = true;
>> +		}
>> +		seq_putc(s, '\n');
>> +
>> +		list_for_each_entry(crg, &rdtg->mon.crdtgrp_list, mon.crdtgrp_list) {
>> +			seq_printf(s, "%s/%s/", rdtg->kn->name, crg->kn->name);
>> +
>> +			sep = false;
>> +			list_for_each_entry(dom, &r->mon_domains, hdr.list) {
>> +				if (sep)
>> +					seq_puts(s, ";");
>> +				seq_printf(s, "%d=%s", dom->hdr.id,
>> +					   rdtgroup_mon_state_to_str(r, dom, crg, str));
> 
> Unlike the other resctrl files, it looks like the total size of this
> data will scale up with the number of existing monitoring groups
> and the lengths of the group names (in addition to the number of
> monitoring domains).
> 
> So, this can easily be more than a page, overflowing internal limits
> in the seq_file and kernfs code.
> 
> Do we need to track some state between read() calls?  This can be done
> by overriding the kernfs .open() and .release() methods and hanging
> some state data (or an rdtgroup_file pointer) on of->priv.
> 
> Also, if we allow the data to be read out in chunks, then we would
> either have to snapshot all the data in one go and stash the unread
> tail in the kernel, or we would need to move over to RCU-based
> enumeration or similar -- otherwise releasing rdtgroup_mutex in the
> middle of the enumeration in order to return data to userspace is going
> to be a problem...

Good catch.

I see similar buffer overflow is handled by calling seq_buf_clear()
(look at process_durations() or in show_user_instructions()).

How about handling this by calling rdt_last_cmd_clear() before printing
each group?

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 484d6009869f..1828f59eb723 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -1026,6 +1026,7 @@ static int resctrl_mbm_assign_control_show(struct
kernfs_open_file *of,
        }

        list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
+               rdt_last_cmd_clear();
                seq_printf(s, "%s//", rdtg->kn->name);

                sep = false;
@@ -1041,6 +1042,7 @@ static int resctrl_mbm_assign_control_show(struct
kernfs_open_file *of,
                seq_putc(s, '\n');

                list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
mon.crdtgrp_list) {
+                       rdt_last_cmd_clear();
                        seq_printf(s, "%s/%s/", rdtg->kn->name,
crg->kn->name);

                        sep = false;


> 
>> +				sep = true;
>> +			}
>> +			seq_putc(s, '\n');
>> +		}
>> +	}
>> +
>> +	mutex_unlock(&rdtgroup_mutex);
>> +	cpus_read_unlock();
>> +	return 0;
>> +}
> 
> [...]
> 
> Cheers
> ---Dave
> 

-- 
Thanks
Babu Moger

Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups

Posted by Dave Martin 11 months, 3 weeks ago

Hi,

On Wed, Feb 19, 2025 at 03:09:51PM -0600, Moger, Babu wrote:
> Hi Dave,
> 
> On 2/19/25 07:53, Dave Martin wrote:
> > On Wed, Jan 22, 2025 at 02:20:30PM -0600, Babu Moger wrote:
> >> Provide the interface to list the assignment states of all the resctrl
> >> groups in mbm_cntr_assign mode.

[...]

> >> +static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
> >> +					   struct seq_file *s, void *v)
> >> +{

[...]

> > Unlike the other resctrl files, it looks like the total size of this
> > data will scale up with the number of existing monitoring groups
> > and the lengths of the group names (in addition to the number of
> > monitoring domains).
> > 
> > So, this can easily be more than a page, overflowing internal limits
> > in the seq_file and kernfs code.
> > 
> > Do we need to track some state between read() calls?  This can be done
> > by overriding the kernfs .open() and .release() methods and hanging
> > some state data (or an rdtgroup_file pointer) on of->priv.
> > 
> > Also, if we allow the data to be read out in chunks, then we would
> > either have to snapshot all the data in one go and stash the unread
> > tail in the kernel, or we would need to move over to RCU-based
> > enumeration or similar -- otherwise releasing rdtgroup_mutex in the
> > middle of the enumeration in order to return data to userspace is going
> > to be a problem...
> 
> Good catch.
> 
> I see similar buffer overflow is handled by calling seq_buf_clear()
> (look at process_durations() or in show_user_instructions()).
> 
> How about handling this by calling rdt_last_cmd_clear() before printing
> each group?

Does this work?

Once seq_buf_has_overflowed() returns nonzero, data has been lost, no?
So far as I can see, show_user_instructions() just gives up on printing
the affected line, while process_durations() tries to anticipate
overflow and prints out the accumulated text to dmesg before clearing
the buffer.

In our case, we cannot send more data to userspace than was requested
in the read() call, so we might have nowhere to drain the seq_buf
contents to in order to free up space.

sysfs "expects" userspace to do a big enough read() that this problem
doesn't happen.  In practice this is OK because people usually read
through a buffered I/O layer like stdio, and in realistic
implementations the user-side I/O buffer is large enough to hide this
issue.

But mbm_assign_control data is dynamically generated and potentially
much bigger than a typical sysfs file.

> 
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 484d6009869f..1828f59eb723 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -1026,6 +1026,7 @@ static int resctrl_mbm_assign_control_show(struct
> kernfs_open_file *of,
>         }
> 
>         list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
> +               rdt_last_cmd_clear();
>                 seq_printf(s, "%s//", rdtg->kn->name);
> 
>                 sep = false;
> @@ -1041,6 +1042,7 @@ static int resctrl_mbm_assign_control_show(struct
> kernfs_open_file *of,
>                 seq_putc(s, '\n');
> 
>                 list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
> mon.crdtgrp_list) {
> +                       rdt_last_cmd_clear();

I don't see how this helps.

Surely last_cmd_status has nothing to do with s?

[...]

Cheers
---Dave

Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups

Posted by Moger, Babu 11 months, 3 weeks ago

Hi Dave,

On 2/20/25 09:44, Dave Martin wrote:
> Hi,
> 
> On Wed, Feb 19, 2025 at 03:09:51PM -0600, Moger, Babu wrote:
>> Hi Dave,
>>
>> On 2/19/25 07:53, Dave Martin wrote:
>>> On Wed, Jan 22, 2025 at 02:20:30PM -0600, Babu Moger wrote:
>>>> Provide the interface to list the assignment states of all the resctrl
>>>> groups in mbm_cntr_assign mode.
> 
> [...]
> 
>>>> +static int resctrl_mbm_assign_control_show(struct kernfs_open_file *of,
>>>> +					   struct seq_file *s, void *v)
>>>> +{
> 
> [...]
> 
>>> Unlike the other resctrl files, it looks like the total size of this
>>> data will scale up with the number of existing monitoring groups
>>> and the lengths of the group names (in addition to the number of
>>> monitoring domains).
>>>
>>> So, this can easily be more than a page, overflowing internal limits
>>> in the seq_file and kernfs code.
>>>
>>> Do we need to track some state between read() calls?  This can be done
>>> by overriding the kernfs .open() and .release() methods and hanging
>>> some state data (or an rdtgroup_file pointer) on of->priv.
>>>
>>> Also, if we allow the data to be read out in chunks, then we would
>>> either have to snapshot all the data in one go and stash the unread
>>> tail in the kernel, or we would need to move over to RCU-based
>>> enumeration or similar -- otherwise releasing rdtgroup_mutex in the
>>> middle of the enumeration in order to return data to userspace is going
>>> to be a problem...
>>
>> Good catch.
>>
>> I see similar buffer overflow is handled by calling seq_buf_clear()
>> (look at process_durations() or in show_user_instructions()).
>>
>> How about handling this by calling rdt_last_cmd_clear() before printing
>> each group?
> 
> Does this work?
> 
> Once seq_buf_has_overflowed() returns nonzero, data has been lost, no?
> So far as I can see, show_user_instructions() just gives up on printing
> the affected line, while process_durations() tries to anticipate
> overflow and prints out the accumulated text to dmesg before clearing
> the buffer.

Yea. Agree,

> 
> In our case, we cannot send more data to userspace than was requested
> in the read() call, so we might have nowhere to drain the seq_buf
> contents to in order to free up space.
> 
> sysfs "expects" userspace to do a big enough read() that this problem
> doesn't happen.  In practice this is OK because people usually read
> through a buffered I/O layer like stdio, and in realistic
> implementations the user-side I/O buffer is large enough to hide this
> issue.
> 
> But mbm_assign_control data is dynamically generated and potentially
> much bigger than a typical sysfs file.

I have no idea how to handle this case. We may have to live with this
problem. Let us know if there are any ideas.

> 
>>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 484d6009869f..1828f59eb723 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -1026,6 +1026,7 @@ static int resctrl_mbm_assign_control_show(struct
>> kernfs_open_file *of,
>>         }
>>
>>         list_for_each_entry(rdtg, &rdt_all_groups, rdtgroup_list) {
>> +               rdt_last_cmd_clear();
>>                 seq_printf(s, "%s//", rdtg->kn->name);
>>
>>                 sep = false;
>> @@ -1041,6 +1042,7 @@ static int resctrl_mbm_assign_control_show(struct
>> kernfs_open_file *of,
>>                 seq_putc(s, '\n');
>>
>>                 list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
>> mon.crdtgrp_list) {
>> +                       rdt_last_cmd_clear();
> 
> I don't see how this helps.
> 
> Surely last_cmd_status has nothing to do with s?

Correct. Clearly, I misunderstood the problem.

> 
> [...]
> 
> Cheers
> ---Dave
> 

-- 
Thanks
Babu Moger

Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups

Posted by Dave Martin 11 months, 3 weeks ago

On Thu, Feb 20, 2025 at 03:29:12PM -0600, Moger, Babu wrote:
> Hi Dave,
> 
> On 2/20/25 09:44, Dave Martin wrote:
> > Hi,
> > 
> > On Wed, Feb 19, 2025 at 03:09:51PM -0600, Moger, Babu wrote:

[...]

> >> Good catch.
> >>
> >> I see similar buffer overflow is handled by calling seq_buf_clear()
> >> (look at process_durations() or in show_user_instructions()).
> >>
> >> How about handling this by calling rdt_last_cmd_clear() before printing
> >> each group?
> > 
> > Does this work?
> > 
> > Once seq_buf_has_overflowed() returns nonzero, data has been lost, no?
> > So far as I can see, show_user_instructions() just gives up on printing
> > the affected line, while process_durations() tries to anticipate
> > overflow and prints out the accumulated text to dmesg before clearing
> > the buffer.
> 
> Yea. Agree,
> 
> > 
> > In our case, we cannot send more data to userspace than was requested
> > in the read() call, so we might have nowhere to drain the seq_buf
> > contents to in order to free up space.
> > 
> > sysfs "expects" userspace to do a big enough read() that this problem
> > doesn't happen.  In practice this is OK because people usually read
> > through a buffered I/O layer like stdio, and in realistic
> > implementations the user-side I/O buffer is large enough to hide this
> > issue.
> > 
> > But mbm_assign_control data is dynamically generated and potentially
> > much bigger than a typical sysfs file.
> 
> I have no idea how to handle this case. We may have to live with this
> problem. Let us know if there are any ideas.

I think the current implication is that this will work for now provided
that the generated text fits in a page.


Reinette, what's your view on accepting this limitation in the interest
of stabilising this series, and tidying up this corner case later?

As for writes to this file, we're unlikely to hit the limit unless
there are a lot of RMIDs available and many groups with excessively
long names.

This looks perfectly fixable, but it might be better to settle the
design of this series first before we worry too much about it.

[...]

Cheers
---Dave

Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups

Posted by Reinette Chatre 11 months, 3 weeks ago

Hi Dave,

On 2/21/25 8:00 AM, Dave Martin wrote:
> On Thu, Feb 20, 2025 at 03:29:12PM -0600, Moger, Babu wrote:
>> Hi Dave,
>>
>> On 2/20/25 09:44, Dave Martin wrote:
>>> Hi,
>>>
>>> On Wed, Feb 19, 2025 at 03:09:51PM -0600, Moger, Babu wrote:
> 
> [...]
> 
>>>> Good catch.
>>>>
>>>> I see similar buffer overflow is handled by calling seq_buf_clear()
>>>> (look at process_durations() or in show_user_instructions()).
>>>>
>>>> How about handling this by calling rdt_last_cmd_clear() before printing
>>>> each group?
>>>
>>> Does this work?
>>>
>>> Once seq_buf_has_overflowed() returns nonzero, data has been lost, no?
>>> So far as I can see, show_user_instructions() just gives up on printing
>>> the affected line, while process_durations() tries to anticipate
>>> overflow and prints out the accumulated text to dmesg before clearing
>>> the buffer.
>>
>> Yea. Agree,
>>
>>>
>>> In our case, we cannot send more data to userspace than was requested
>>> in the read() call, so we might have nowhere to drain the seq_buf
>>> contents to in order to free up space.
>>>
>>> sysfs "expects" userspace to do a big enough read() that this problem
>>> doesn't happen.  In practice this is OK because people usually read
>>> through a buffered I/O layer like stdio, and in realistic
>>> implementations the user-side I/O buffer is large enough to hide this
>>> issue.
>>>
>>> But mbm_assign_control data is dynamically generated and potentially
>>> much bigger than a typical sysfs file.
>>
>> I have no idea how to handle this case. We may have to live with this
>> problem. Let us know if there are any ideas.
> 
> I think the current implication is that this will work for now provided
> that the generated text fits in a page.
> 
> 
> Reinette, what's your view on accepting this limitation in the interest
> of stabilising this series, and tidying up this corner case later?
> 
> As for writes to this file, we're unlikely to hit the limit unless
> there are a lot of RMIDs available and many groups with excessively
> long names.

I am more concerned about reads to this file. If only 4K writes are
supported then user space can reconfigure the system in page sized
portions. It may not be efficient if the user wants to reconfigure the
entire system but it will work. The problem with reads is that if
larger than 4K reads are required but not supported then it will be
impossible for user space to learn state of the system.

We may already be at that limit. Peter described [1] how AMD systems
already have 32 L3 monitoring domains and 256 RMIDs. With conservative
resource group names of 10 characters I see one line per monitoring group
that could look like below and thus easily be above 200 characters:

resgroupAA/mongroupAA/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl;12=tl;13=tl;14=tl;15=tl;16=tl;17=tl;18=tl;19=tl;20=tl;21=tl;22=tl;23=tl;24=tl;25=tl;26=tl;27=tl;28=tl;29=tl;30=tl;31=tl;32=tl

Multiplying that with the existing possible 256 monitor groups the limit
is exceeded today.

I understand that all domains having "tl" flags are not possible today, but
even if that is changed to "_" the resulting display still looks to
easily exceed a page when many RMIDs are in use.

> 
> This looks perfectly fixable, but it might be better to settle the
> design of this series first before we worry too much about it.

I think it fair to delay support of writing more than a page of
data but it looks to me like we need a solution to support displaying
more than a page of data to user space.

Reinette

[1] https://lore.kernel.org/lkml/20241106154306.2721688-2-peternewman@google.com/

Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups

Posted by Dave Martin 11 months, 3 weeks ago

On Fri, Feb 21, 2025 at 12:10:44PM -0800, Reinette Chatre wrote:
> Hi Dave,
> 
> On 2/21/25 8:00 AM, Dave Martin wrote:
> > On Thu, Feb 20, 2025 at 03:29:12PM -0600, Moger, Babu wrote:
> >> Hi Dave,
> >>
> >> On 2/20/25 09:44, Dave Martin wrote:

[...]

> >>> But mbm_assign_control data is dynamically generated and potentially
> >>> much bigger than a typical sysfs file.
> >>
> >> I have no idea how to handle this case. We may have to live with this
> >> problem. Let us know if there are any ideas.
> > 
> > I think the current implication is that this will work for now provided
> > that the generated text fits in a page.
> > 
> > 
> > Reinette, what's your view on accepting this limitation in the interest
> > of stabilising this series, and tidying up this corner case later?
> > 
> > As for writes to this file, we're unlikely to hit the limit unless
> > there are a lot of RMIDs available and many groups with excessively
> > long names.
> 
> I am more concerned about reads to this file. If only 4K writes are
> supported then user space can reconfigure the system in page sized
> portions. It may not be efficient if the user wants to reconfigure the
> entire system but it will work. The problem with reads is that if
> larger than 4K reads are required but not supported then it will be
> impossible for user space to learn state of the system.
> 
> We may already be at that limit. Peter described [1] how AMD systems
> already have 32 L3 monitoring domains and 256 RMIDs. With conservative
> resource group names of 10 characters I see one line per monitoring group
> that could look like below and thus easily be above 200 characters:
> 
> resgroupAA/mongroupAA/0=tl;1=tl;2=tl;3=tl;4=tl;5=tl;6=tl;7=tl;8=tl;9=tl;10=tl;11=tl;12=tl;13=tl;14=tl;15=tl;16=tl;17=tl;18=tl;19=tl;20=tl;21=tl;22=tl;23=tl;24=tl;25=tl;26=tl;27=tl;28=tl;29=tl;30=tl;31=tl;32=tl
> 
> Multiplying that with the existing possible 256 monitor groups the limit
> is exceeded today.

That's useful to know.  I guess we probably shouldn't just kick this
issue down the road, then -- at least on the read side (as you say).

> I understand that all domains having "tl" flags are not possible today, but
> even if that is changed to "_" the resulting display still looks to
> easily exceed a page when many RMIDs are in use.
> 
> > 
> > This looks perfectly fixable, but it might be better to settle the
> > design of this series first before we worry too much about it.
> 
> I think it fair to delay support of writing more than a page of
> data but it looks to me like we need a solution to support displaying
> more than a page of data to user space.
> 
> Reinette
> 
> [1] https://lore.kernel.org/lkml/20241106154306.2721688-2-peternewman@google.com/

Ack; if I can't find an off-the-shelf solution for this, I'll try to
hack something as minimal as possible to provide the required
behaviour, but I won't try to make it optimal or pretty for now.

It has just occurred to be that ftrace has large, multi-line text files
in sysfs, so I'll try to find out how they handle that there.  Maybe
there is some infrastructure we can re-use.

Either way, hopefully that will move the discussion forward (unless
someone else comes up with a better idea first!)

Cheers
---Dave

RE: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups

Posted by Luck, Tony 11 months, 3 weeks ago

> It has just occurred to be that ftrace has large, multi-line text files
> in sysfs, so I'll try to find out how they handle that there.  Maybe
> there is some infrastructure we can re-use.

Resctrl was built on top of "kernfs" because that was a simple base
that met needs at the time.

Do we need to look at either extending capabilities of kernfs? Or
move to sysfs?

-Tony

Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups

Posted by Dave Martin 11 months, 2 weeks ago

Hi,

On Tue, Feb 25, 2025 at 12:39:04PM +0000, Dave Martin wrote:
> Hi Tony,
> 
> On Mon, Feb 24, 2025 at 05:23:06PM +0000, Luck, Tony wrote:
> > > It has just occurred to be that ftrace has large, multi-line text files
> > > in sysfs, so I'll try to find out how they handle that there.  Maybe
> > > there is some infrastructure we can re-use.
> > 
> > Resctrl was built on top of "kernfs" because that was a simple base
> > that met needs at the time.
> > 
> > Do we need to look at either extending capabilities of kernfs? Or
> > move to sysfs?
> > 
> > -Tony
> 
> I took a look at what ftrace does: it basically rolls its own buffering
> implementation, sufficient for its needs.
> 
> The ftrace code is internal and not trivial to pick up and plonk into
> resctrl.  We also have another possible requirement that ftrace doesn't
> have (whole-file atomicity).  But ftrace's example at least confirms
> that there is probably no off-the-shelf implementation for this in the
> kernel.

[...]

After having spent a bit of time looking into this, I think we are probably
OK, at least for reading these files.

seq_file will loop over the file's show() callback, growing the seq_file
buffer until show() can run without overrunning the buffer.

This means that the show() callback receives a buffer that is magically big
enough, but there may be some "speculative" calls whose output never goes
to userspace.  Once seq_file has the data, it deals with the userspace-
facing I/O buffering internally, so we shouldn't have to worry about that.

I'll try to hack up a test next week to confirm that this works.

The seq_file approach appears sound, but may be inefficient if the initial
guess at the buffer size (= PAGE_SIZE) is frequently too small.  (There is
single_open_size() though, which allows the buffer to be preallocated with
a specified size and may be useful.)

seq_file doesn't help with the write side at all, but I think we agreed
that handling large file writes properly is less urgent.

Cheers
---Dave

RE: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups

Posted by Luck, Tony 11 months, 2 weeks ago

> After having spent a bit of time looking into this, I think we are probably
> OK, at least for reading these files.
>
> seq_file will loop over the file's show() callback, growing the seq_file
> buffer until show() can run without overrunning the buffer.
>
> This means that the show() callback receives a buffer that is magically big
> enough, but there may be some "speculative" calls whose output never goes
> to userspace.  Once seq_file has the data, it deals with the userspace-
> facing I/O buffering internally, so we shouldn't have to worry about that.

Doesn't this depend on the size of the user read(2) syscall request?

If the total size of the resctrl file is very large, we have a potential issue:

1) User asks for 4KB, owns the resctrl mutex.

2) resctrl uses seq_file and fills with more than 4KB

3) User gets the first 4KB, releases the resctrl mutex

4) Some other pending resctrl operation now gets the mutex and makes changes that affect the contents of this file

5) User asks for next 4K (when it reaquires resctrl mutex)

6) resctrl uses seq_file() to construct new image of file incorporating changes because of step 4

7) User gets the second 4KB from the seq_file buffer (which doesn't fit cleanly next to data it got in step 3).

-Tony

Re: [PATCH v11 22/23] x86/resctrl: Introduce interface to list assignment states of all the groups

Posted by Dave Martin 11 months, 1 week ago

On Mon, Mar 03, 2025 at 07:30:48PM +0000, Luck, Tony wrote:
> > After having spent a bit of time looking into this, I think we are probably
> > OK, at least for reading these files.
> >
> > seq_file will loop over the file's show() callback, growing the seq_file
> > buffer until show() can run without overrunning the buffer.
> >
> > This means that the show() callback receives a buffer that is magically big
> > enough, but there may be some "speculative" calls whose output never goes
> > to userspace.  Once seq_file has the data, it deals with the userspace-
> > facing I/O buffering internally, so we shouldn't have to worry about that.
> 
> Doesn't this depend on the size of the user read(2) syscall request?

Yes and no.

If I've understood correctly:

To service a given read() call, seq_file calls down into the backend to
generate some whole record, then copies it out to userspace, then
repeats this process so long as there is any space left in the user
buffer.

For resctrl files, we don't implement a seq_file iterator: there is no
.next(), no .llseek(), and we don't implement any notion of file
position.  So our _show() functions generate a single big record that
contains the whole dump -- frequently multiple lines of text.

(This might or might not be desirable, but it is at least simple.)

If a _show() function in resctrl holds rdtgroup_mutex throughout, then
whatever it dumps will be dumped atomically with respect to other
resctrl operations that take this mutex.

So, to flesh out your scenario:

> 
> If the total size of the resctrl file is very large, we have a potential issue:

Let's say it's 5KB.

> 1) User asks for 4KB, owns the resctrl mutex.

(Note, rdtgroup_mutex is only held temporarily inside the resctrlfs
backend to these operations; at the start of the process, it is not
held.)

> 2) resctrl uses seq_file and fills with more than 4KB

(It's actually seq_file that uses resctrl here via callbacks: seq_file
sits in between the VFS layer and resctrl.)

When a .show() callback is called, resctrl doesn't know how much data
to generate; it just writes stuff out with seq_printf() etc.

If there's too much to fit in the default seq_file buffer, the data
gets truncated and the seq_file will get internally marked as having
overflowed.  resctrl could check for this condition in order to avoid
formatting text that will get thrown away due to truncation, but this
is not required.  When the .show() callback returns, the seq_file
implementation will respond to the overflow by growing the buffer and
retrying the whole thing until this doesn't occur (see the loop
preceding the "Fill" label in seq_file.c:seq_read_iter().)

This terminates with a seq_file buffer that contains all the output
(untruncated), or with an -ENOMEM failure (which would be punted to
userspace).

So, assuming nothing went wrong, the seq_file buffer now has the 5KB of
data.  rdtgroup_mutex is not held (it was only held in the _show()
callback).

> 3) User gets the first 4KB, releases the resctrl mutex

Userspace gets the first 4KB, and seq_file's notion of the file
position is advanced by this amount, and the generated text is
kept in the seq_file's buffer.

> 4) Some other pending resctrl operation now gets the mutex and makes changes that affect the contents of this file

The un-read data remains buffered in seq_file.  Other resctrl
operations can happen, so the buffered data may become stale, but it is
still an atomic snapshot.

> 5) User asks for next 4K (when it reaquires resctrl mutex)

If an iterator is implemented, seq_file might try to generate another
record to fill the requested space.  But we don't have an iterator, so
the generated data remains as-is.

> 6) resctrl uses seq_file() to construct new image of file incorporating changes because of step 4

I think this happens only if the file is reopened or lseek()'d, and
only if .llseek() is wired up in struct file_operations.  Resctrl
doesn't seem to do this (whether by accident or by design).

So userspace just sees a non-seekable file.

> 7) User gets the second 4KB from the seq_file buffer (which doesn't fit cleanly next to data it got in step 3).

Userspace gets the final 1K of the data that was generated in response
to the original read() call.

If userspace tries to read again, it will get EOF (again, because we
don't have an iterator -- meaning that no additional records can be
generated).

I haven't traced in detail through the code, but that's my
understanding.

Cheers
---Dave