[PATCH v7 13/24] x86/resctrl: Queue mon_event_read() instead of sending an IPI

James Morse posted 24 patches 2 years, 1 month ago
There is a newer version of this series
[PATCH v7 13/24] x86/resctrl: Queue mon_event_read() instead of sending an IPI
Posted by James Morse 2 years, 1 month ago
Intel is blessed with an abundance of monitors, one per RMID, that can be
read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
the number implemented is up to the manufacturer. This means when there are
fewer monitors than needed, they need to be allocated and freed.

MPAM's CSU monitors are used to back the 'llc_occupancy' monitor file. The
CSU counter is allowed to return 'not ready' for a small number of
micro-seconds after programming. To allow one CSU hardware monitor to be
used for multiple control or monitor groups, the CPU accessing the
monitor needs to be able to block when configuring and reading the
counter.

Worse, the domain may be broken up into slices, and the MMIO accesses
for each slice may need performing from different CPUs.

These two details mean MPAMs monitor code needs to be able to sleep, and
IPI another CPU in the domain to read from a resource that has been sliced.

mon_event_read() already invokes mon_event_count() via IPI, which means
this isn't possible. On systems using nohz-full, some CPUs need to be
interrupted to run kernel work as they otherwise stay in user-space
running realtime workloads. Interrupting these CPUs should be avoided,
and scheduling work on them may never complete.

Change mon_event_read() to pick a housekeeping CPU, (one that is not using
nohz_full) and schedule mon_event_count() and wait. If all the CPUs
in a domain are using nohz-full, then an IPI is used as the fallback.

This function is only used in response to a user-space filesystem request
(not the timing sensitive overflow code).

This allows MPAM to hide the slice behaviour from resctrl, and to keep
the monitor-allocation in monitor.c. When the IPI fallback is used on
machines where MPAM needs to make an access on multiple CPUs, the counter
read will always fail.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
Changes since v2:
 * Use cpumask_any_housekeeping() and fallback to an IPI if needed.

Changes since v3:
 * Actually include the IPI fallback code.

Changes since v4:
 * Tinkered with existing capitalisation.

Changes since v5:
 * Added a newline.

Changes since v6:
 * Moved lockdep annotations to a later patch.
---
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 26 +++++++++++++++++++++--
 arch/x86/kernel/cpu/resctrl/monitor.c     |  2 +-
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index beccb0e87ba7..d07f99245851 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -19,6 +19,8 @@
 #include <linux/kernfs.h>
 #include <linux/seq_file.h>
 #include <linux/slab.h>
+#include <linux/tick.h>
+
 #include "internal.h"
 
 /*
@@ -522,12 +524,21 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
 	return ret;
 }
 
+static int smp_mon_event_count(void *arg)
+{
+	mon_event_count(arg);
+
+	return 0;
+}
+
 void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
 		    int evtid, int first)
 {
+	int cpu;
+
 	/*
-	 * setup the parameters to send to the IPI to read the data.
+	 * Setup the parameters to pass to mon_event_count() to read the data.
 	 */
 	rr->rgrp = rdtgrp;
 	rr->evtid = evtid;
@@ -536,7 +547,18 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
 	rr->val = 0;
 	rr->first = first;
 
-	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+	cpu = cpumask_any_housekeeping(&d->cpu_mask);
+
+	/*
+	 * cpumask_any_housekeeping() prefers housekeeping CPUs, but
+	 * are all the CPUs nohz_full? If yes, pick a CPU to IPI.
+	 * MPAM's resctrl_arch_rmid_read() is unable to read the
+	 * counters on some platforms if its called in irq context.
+	 */
+	if (tick_nohz_full_cpu(cpu))
+		smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+	else
+		smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
 }
 
 int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 718770aea2af..fa3319021881 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -588,7 +588,7 @@ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
 }
 
 /*
- * This is called via IPI to read the CQM/MBM counters
+ * This is scheduled by mon_event_read() to read the CQM/MBM counters
  * on a domain.
  */
 void mon_event_count(void *info)
-- 
2.39.2
Re: [PATCH v7 13/24] x86/resctrl: Queue mon_event_read() instead of sending an IPI
Posted by Moger, Babu 2 years, 1 month ago
Hi James,

On 10/25/23 13:03, James Morse wrote:
> Intel is blessed with an abundance of monitors, one per RMID, that can be
> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
> the number implemented is up to the manufacturer. This means when there are
> fewer monitors than needed, they need to be allocated and freed.
> 
> MPAM's CSU monitors are used to back the 'llc_occupancy' monitor file. The
> CSU counter is allowed to return 'not ready' for a small number of
> micro-seconds after programming. To allow one CSU hardware monitor to be
> used for multiple control or monitor groups, the CPU accessing the
> monitor needs to be able to block when configuring and reading the
> counter.
> 
> Worse, the domain may be broken up into slices, and the MMIO accesses
> for each slice may need performing from different CPUs.
> 
> These two details mean MPAMs monitor code needs to be able to sleep, and
> IPI another CPU in the domain to read from a resource that has been sliced.
> 
> mon_event_read() already invokes mon_event_count() via IPI, which means
> this isn't possible. On systems using nohz-full, some CPUs need to be
> interrupted to run kernel work as they otherwise stay in user-space
> running realtime workloads. Interrupting these CPUs should be avoided,
> and scheduling work on them may never complete.
> 
> Change mon_event_read() to pick a housekeeping CPU, (one that is not using
> nohz_full) and schedule mon_event_count() and wait. If all the CPUs
> in a domain are using nohz-full, then an IPI is used as the fallback.
> 
> This function is only used in response to a user-space filesystem request
> (not the timing sensitive overflow code).
> 
> This allows MPAM to hide the slice behaviour from resctrl, and to keep
> the monitor-allocation in monitor.c. When the IPI fallback is used on
> machines where MPAM needs to make an access on multiple CPUs, the counter
> read will always fail.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Tested-by: Peter Newman <peternewman@google.com>
> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Reviewed-by: Peter Newman <peternewman@google.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> Changes since v2:
>  * Use cpumask_any_housekeeping() and fallback to an IPI if needed.
> 
> Changes since v3:
>  * Actually include the IPI fallback code.
> 
> Changes since v4:
>  * Tinkered with existing capitalisation.
> 
> Changes since v5:
>  * Added a newline.
> 
> Changes since v6:
>  * Moved lockdep annotations to a later patch.
> ---
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 26 +++++++++++++++++++++--
>  arch/x86/kernel/cpu/resctrl/monitor.c     |  2 +-
>  2 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index beccb0e87ba7..d07f99245851 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -19,6 +19,8 @@
>  #include <linux/kernfs.h>
>  #include <linux/seq_file.h>
>  #include <linux/slab.h>
> +#include <linux/tick.h>
> +
>  #include "internal.h"
>  
>  /*
> @@ -522,12 +524,21 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
>  	return ret;
>  }
>  
> +static int smp_mon_event_count(void *arg)
> +{
> +	mon_event_count(arg);
> +
> +	return 0;
> +}

Shouldn't this function defined as "void" similar to mon_event_count?
Return code is not used anywhere.

> +
>  void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>  		    struct rdt_domain *d, struct rdtgroup *rdtgrp,
>  		    int evtid, int first)
>  {
> +	int cpu;
> +
>  	/*
> -	 * setup the parameters to send to the IPI to read the data.
> +	 * Setup the parameters to pass to mon_event_count() to read the data.
>  	 */
>  	rr->rgrp = rdtgrp;
>  	rr->evtid = evtid;
> @@ -536,7 +547,18 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>  	rr->val = 0;
>  	rr->first = first;
>  
> -	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> +	cpu = cpumask_any_housekeeping(&d->cpu_mask);
> +
> +	/*
> +	 * cpumask_any_housekeeping() prefers housekeeping CPUs, but
> +	 * are all the CPUs nohz_full? If yes, pick a CPU to IPI.
> +	 * MPAM's resctrl_arch_rmid_read() is unable to read the
> +	 * counters on some platforms if its called in irq context.
> +	 */
> +	if (tick_nohz_full_cpu(cpu))
> +		smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> +	else
> +		smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
>  }
>  
>  int rdtgroup_mondata_show(struct seq_file *m, void *arg)
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 718770aea2af..fa3319021881 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -588,7 +588,7 @@ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
>  }
>  
>  /*
> - * This is called via IPI to read the CQM/MBM counters
> + * This is scheduled by mon_event_read() to read the CQM/MBM counters
>   * on a domain.
>   */
>  void mon_event_count(void *info)

-- 
Thanks
Babu Moger
Re: [PATCH v7 13/24] x86/resctrl: Queue mon_event_read() instead of sending an IPI
Posted by James Morse 2 years ago
Hi Babu,

On 09/11/2023 20:40, Moger, Babu wrote:
> On 10/25/23 13:03, James Morse wrote:
>> Intel is blessed with an abundance of monitors, one per RMID, that can be
>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>> the number implemented is up to the manufacturer. This means when there are
>> fewer monitors than needed, they need to be allocated and freed.
>>
>> MPAM's CSU monitors are used to back the 'llc_occupancy' monitor file. The
>> CSU counter is allowed to return 'not ready' for a small number of
>> micro-seconds after programming. To allow one CSU hardware monitor to be
>> used for multiple control or monitor groups, the CPU accessing the
>> monitor needs to be able to block when configuring and reading the
>> counter.
>>
>> Worse, the domain may be broken up into slices, and the MMIO accesses
>> for each slice may need performing from different CPUs.
>>
>> These two details mean MPAMs monitor code needs to be able to sleep, and
>> IPI another CPU in the domain to read from a resource that has been sliced.
>>
>> mon_event_read() already invokes mon_event_count() via IPI, which means
>> this isn't possible. On systems using nohz-full, some CPUs need to be
>> interrupted to run kernel work as they otherwise stay in user-space
>> running realtime workloads. Interrupting these CPUs should be avoided,
>> and scheduling work on them may never complete.
>>
>> Change mon_event_read() to pick a housekeeping CPU, (one that is not using
>> nohz_full) and schedule mon_event_count() and wait. If all the CPUs
>> in a domain are using nohz-full, then an IPI is used as the fallback.
>>
>> This function is only used in response to a user-space filesystem request
>> (not the timing sensitive overflow code).
>>
>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>> the monitor-allocation in monitor.c. When the IPI fallback is used on
>> machines where MPAM needs to make an access on multiple CPUs, the counter
>> read will always fail.

>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>> index beccb0e87ba7..d07f99245851 100644
>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c

>> @@ -522,12 +524,21 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
>>  	return ret;
>>  }
>>  
>> +static int smp_mon_event_count(void *arg)
>> +{
>> +	mon_event_count(arg);
>> +
>> +	return 0;
>> +}
> 
> Shouldn't this function defined as "void" similar to mon_event_count?
> Return code is not used anywhere.

smp_call_on_cpu() requires it to return an int, even if the value is not used.

This wrapper only exists because smp_call_on_cpu() takes a different prototype to
smp_call_function_any().


Thanks,

James

>> @@ -536,7 +547,18 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>  	rr->val = 0;
>>  	rr->first = first;
>>  
>> -	smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>> +	cpu = cpumask_any_housekeeping(&d->cpu_mask);
>> +
>> +	/*
>> +	 * cpumask_any_housekeeping() prefers housekeeping CPUs, but
>> +	 * are all the CPUs nohz_full? If yes, pick a CPU to IPI.
>> +	 * MPAM's resctrl_arch_rmid_read() is unable to read the
>> +	 * counters on some platforms if its called in irq context.
>> +	 */
>> +	if (tick_nohz_full_cpu(cpu))
>> +		smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>> +	else
>> +		smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
>>  }
>>
Re: [PATCH v7 13/24] x86/resctrl: Queue mon_event_read() instead of sending an IPI
Posted by Reinette Chatre 2 years, 1 month ago
Hi James,

On 10/25/2023 11:03 AM, James Morse wrote:
> Intel is blessed with an abundance of monitors, one per RMID, that can be
> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
> the number implemented is up to the manufacturer. This means when there are
> fewer monitors than needed, they need to be allocated and freed.
> 
> MPAM's CSU monitors are used to back the 'llc_occupancy' monitor file. The
> CSU counter is allowed to return 'not ready' for a small number of
> micro-seconds after programming. To allow one CSU hardware monitor to be
> used for multiple control or monitor groups, the CPU accessing the
> monitor needs to be able to block when configuring and reading the
> counter.
> 
> Worse, the domain may be broken up into slices, and the MMIO accesses
> for each slice may need performing from different CPUs.
> 
> These two details mean MPAMs monitor code needs to be able to sleep, and
> IPI another CPU in the domain to read from a resource that has been sliced.
> 
> mon_event_read() already invokes mon_event_count() via IPI, which means
> this isn't possible. On systems using nohz-full, some CPUs need to be
> interrupted to run kernel work as they otherwise stay in user-space
> running realtime workloads. Interrupting these CPUs should be avoided,
> and scheduling work on them may never complete.
> 
> Change mon_event_read() to pick a housekeeping CPU, (one that is not using
> nohz_full) and schedule mon_event_count() and wait. If all the CPUs
> in a domain are using nohz-full, then an IPI is used as the fallback.
> 
> This function is only used in response to a user-space filesystem request
> (not the timing sensitive overflow code).
> 
> This allows MPAM to hide the slice behaviour from resctrl, and to keep
> the monitor-allocation in monitor.c. When the IPI fallback is used on
> machines where MPAM needs to make an access on multiple CPUs, the counter
> read will always fail.
> 
> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Tested-by: Peter Newman <peternewman@google.com>
> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
> Reviewed-by: Peter Newman <peternewman@google.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---

Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>

Reinette