[v5] x86/resctrl: monitored closid+rmid together, separate arch/fs locking

[PATCH v5 14/24] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

Posted by James Morse 2 years, 6 months ago

MPAM's cache occupancy counters can take a little while to settle once
the monitor has been configured. The maximum settling time is described
to the driver via a firmware table. The value could be large enough
that it makes sense to sleep. To avoid exposing this to resctrl, it
should be hidden behind MPAM's resctrl_arch_rmid_read().

resctrl_arch_rmid_read() may be called via IPI meaning it is unable
to sleep. In this case resctrl_arch_rmid_read() should return an error
if it needs to sleep. This will only affect MPAM platforms where
the cache occupancy counter isn't available immediately, nohz_full is
in use, and there are there are no housekeeping CPUs in the necessary
domain.

There are three callers of resctrl_arch_rmid_read():
__mon_event_count() and __check_limbo() are both called from a
non-migrateable context. mon_event_read() invokes __mon_event_count()
using smp_call_on_cpu(), which adds work to the target CPUs workqueue.
rdtgroup_mutex() is held, meaning this cannot race with the resctrl
cpuhp callback. __check_limbo() is invoked via schedule_delayed_work_on()
also adds work to a per-cpu workqueue.

The remaining call is add_rmid_to_limbo() which is called in response
to a user-space syscall that frees an RMID. This opportunistically
reads the LLC occupancy counter on the current domain to see if the
RMID is over the dirty threshold. This has to disable preemption to
avoid reading the wrong domain's value. Disabling pre-emption here
prevents resctrl_arch_rmid_read() from sleeping.

add_rmid_to_limbo() walks each domain, but only reads the counter
on one domain. If the system has more than one domain, the RMID will
always be added to the limbo list. If the RMIDs usage was not over the
threshold, it will be removed from the list when __check_limbo() runs.
Make this the default behaviour. Free RMIDs are always added to the
limbo list for each domain.

The user visible effect of this is that a clean RMID is not available
for re-allocation immediately after 'rmdir()' completes, this behaviour
was never portable as it never happened on a machine with multiple
domains.

Removing this path allows resctrl_arch_rmid_read() to sleep if its called
with interrupts unmasked. Document this is the expected behaviour, and
add a might_sleep() annotation to catch changes that won't work on arm64.

Signed-off-by: James Morse <james.morse@arm.com>
---
The previous version allowed resctrl_arch_rmid_read() to be called on the
wrong CPUs, but now that this needs to take nohz_full and housekeeping into
account, its too complex.

Changes since v3:
 * Removed error handling for smp_call_function_any(), this can't race
   with the cpuhp callbacks as both hold rdtgroup_mutex.
 * Switched to the alternative of removing the counter read, this simplifies
   things dramatically.

Changes since v4:
 * Messed with capitalisation.
 * Removed some dead code now that entry->busy will never be zero in
   add_rmid_to_limbo().
 * Rephrased the comment above resctrl_arch_rmid_read_context_check().
---
 arch/x86/kernel/cpu/resctrl/monitor.c | 24 +++++-------------------
 include/linux/resctrl.h               | 18 +++++++++++++++++-
 2 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 32569354c4f1..08e3307863c3 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -283,6 +283,8 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 	struct arch_mbm_state *am;
 	int ret = 0;
 
+	resctrl_arch_rmid_read_context_check();
+
 	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
 		return -EINVAL;
 
@@ -470,8 +472,6 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
 	struct rdt_domain *d;
-	int cpu, err;
-	u64 val = 0;
 	u32 idx;
 
 	lockdep_assert_held(&rdtgroup_mutex);
@@ -479,17 +479,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 	idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
 
 	entry->busy = 0;
-	cpu = get_cpu();
 	list_for_each_entry(d, &r->domains, list) {
-		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
-			err = resctrl_arch_rmid_read(r, d, entry->closid,
-						     entry->rmid,
-						     QOS_L3_OCCUP_EVENT_ID,
-						     &val);
-			if (err || val <= resctrl_rmid_realloc_threshold)
-				continue;
-		}
-
 		/*
 		 * For the first limbo RMID in the domain,
 		 * setup up the limbo worker.
@@ -499,14 +489,10 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
 		set_bit(idx, d->rmid_busy_llc);
 		entry->busy++;
 	}
-	put_cpu();
 
-	if (entry->busy) {
-		rmid_limbo_count++;
-		if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
-			closid_num_dirty_rmid[entry->closid]++;
-	} else
-		list_add_tail(&entry->list, &rmid_free_lru);
+	rmid_limbo_count++;
+	if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
+		closid_num_dirty_rmid[entry->closid]++;
 }
 
 void free_rmid(u32 closid, u32 rmid)
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 660752406174..f7311102e94c 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -236,7 +236,12 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
  * @eventid:		eventid to read, e.g. L3 occupancy.
  * @val:		result of the counter read in bytes.
  *
- * Call from process context on a CPU that belongs to domain @d.
+ * Some architectures need to sleep when first programming some of the counters.
+ * (specifically: arm64's MPAM cache occupancy counters can return 'not ready'
+ *  for a short period of time). Call from a non-migrateable process context on
+ * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or
+ * schedule_work_on(). This function can be called with interrupts masked,
+ * e.g. using smp_call_function_any(), but may consistently return an error.
  *
  * Return:
  * 0 on success, or -EIO, -EINVAL etc on error.
@@ -245,6 +250,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
 			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
 			   u64 *val);
 
+/**
+ * resctrl_arch_rmid_read_context_check()  - warn about invalid contexts
+ *
+ * When built with CONFIG_DEBUG_ATOMIC_SLEEP generate a warning when
+ * resctrl_arch_rmid_read() is called with preemption disabled.
+ */
+static inline void resctrl_arch_rmid_read_context_check(void)
+{
+	if (!irqs_disabled())
+		might_sleep();
+}
 
 /**
  * resctrl_arch_reset_rmid() - Reset any private state associated with rmid
-- 
2.39.2

Re: [PATCH v5 14/24] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

Posted by Reinette Chatre 2 years, 6 months ago

Hi James,

On 7/28/2023 9:42 AM, James Morse wrote:
> MPAM's cache occupancy counters can take a little while to settle once
> the monitor has been configured. The maximum settling time is described
> to the driver via a firmware table. The value could be large enough
> that it makes sense to sleep. To avoid exposing this to resctrl, it
> should be hidden behind MPAM's resctrl_arch_rmid_read().
> 
> resctrl_arch_rmid_read() may be called via IPI meaning it is unable
> to sleep. In this case resctrl_arch_rmid_read() should return an error
> if it needs to sleep. This will only affect MPAM platforms where
> the cache occupancy counter isn't available immediately, nohz_full is
> in use, and there are there are no housekeeping CPUs in the necessary
> domain.
> 
> There are three callers of resctrl_arch_rmid_read():
> __mon_event_count() and __check_limbo() are both called from a
> non-migrateable context. mon_event_read() invokes __mon_event_count()
> using smp_call_on_cpu(), which adds work to the target CPUs workqueue.
> rdtgroup_mutex() is held, meaning this cannot race with the resctrl
> cpuhp callback. __check_limbo() is invoked via schedule_delayed_work_on()
> also adds work to a per-cpu workqueue.
> 
> The remaining call is add_rmid_to_limbo() which is called in response
> to a user-space syscall that frees an RMID. This opportunistically
> reads the LLC occupancy counter on the current domain to see if the
> RMID is over the dirty threshold. This has to disable preemption to
> avoid reading the wrong domain's value. Disabling pre-emption here
> prevents resctrl_arch_rmid_read() from sleeping.
> 
> add_rmid_to_limbo() walks each domain, but only reads the counter
> on one domain. If the system has more than one domain, the RMID will
> always be added to the limbo list. If the RMIDs usage was not over the
> threshold, it will be removed from the list when __check_limbo() runs.
> Make this the default behaviour. Free RMIDs are always added to the
> limbo list for each domain.
> 
> The user visible effect of this is that a clean RMID is not available
> for re-allocation immediately after 'rmdir()' completes, this behaviour
> was never portable as it never happened on a machine with multiple
> domains.
> 
> Removing this path allows resctrl_arch_rmid_read() to sleep if its called
> with interrupts unmasked. Document this is the expected behaviour, and
> add a might_sleep() annotation to catch changes that won't work on arm64.
> 
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> The previous version allowed resctrl_arch_rmid_read() to be called on the
> wrong CPUs, but now that this needs to take nohz_full and housekeeping into
> account, its too complex.
> 
> Changes since v3:
>  * Removed error handling for smp_call_function_any(), this can't race
>    with the cpuhp callbacks as both hold rdtgroup_mutex.
>  * Switched to the alternative of removing the counter read, this simplifies
>    things dramatically.
> 
> Changes since v4:
>  * Messed with capitalisation.
>  * Removed some dead code now that entry->busy will never be zero in
>    add_rmid_to_limbo().
>  * Rephrased the comment above resctrl_arch_rmid_read_context_check().
> ---
>  arch/x86/kernel/cpu/resctrl/monitor.c | 24 +++++-------------------
>  include/linux/resctrl.h               | 18 +++++++++++++++++-
>  2 files changed, 22 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 32569354c4f1..08e3307863c3 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -283,6 +283,8 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>  	struct arch_mbm_state *am;
>  	int ret = 0;
>  
> +	resctrl_arch_rmid_read_context_check();
> +
>  	if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
>  		return -EINVAL;
>  
> @@ -470,8 +472,6 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>  {
>  	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>  	struct rdt_domain *d;
> -	int cpu, err;
> -	u64 val = 0;
>  	u32 idx;
>  
>  	lockdep_assert_held(&rdtgroup_mutex);
> @@ -479,17 +479,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>  	idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
>  
>  	entry->busy = 0;
> -	cpu = get_cpu();
>  	list_for_each_entry(d, &r->domains, list) {
> -		if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
> -			err = resctrl_arch_rmid_read(r, d, entry->closid,
> -						     entry->rmid,
> -						     QOS_L3_OCCUP_EVENT_ID,
> -						     &val);
> -			if (err || val <= resctrl_rmid_realloc_threshold)
> -				continue;
> -		}
> -
>  		/*
>  		 * For the first limbo RMID in the domain,
>  		 * setup up the limbo worker.
> @@ -499,14 +489,10 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>  		set_bit(idx, d->rmid_busy_llc);
>  		entry->busy++;
>  	}
> -	put_cpu();
>  
> -	if (entry->busy) {
> -		rmid_limbo_count++;
> -		if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
> -			closid_num_dirty_rmid[entry->closid]++;
> -	} else
> -		list_add_tail(&entry->list, &rmid_free_lru);
> +	rmid_limbo_count++;
> +	if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID))
> +		closid_num_dirty_rmid[entry->closid]++;
>  }
>  
>  void free_rmid(u32 closid, u32 rmid)
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 660752406174..f7311102e94c 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -236,7 +236,12 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
>   * @eventid:		eventid to read, e.g. L3 occupancy.
>   * @val:		result of the counter read in bytes.
>   *
> - * Call from process context on a CPU that belongs to domain @d.
> + * Some architectures need to sleep when first programming some of the counters.
> + * (specifically: arm64's MPAM cache occupancy counters can return 'not ready'
> + *  for a short period of time). Call from a non-migrateable process context on
> + * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or
> + * schedule_work_on(). This function can be called with interrupts masked,
> + * e.g. using smp_call_function_any(), but may consistently return an error.

Considering that smp_call_function_any() explicitly disables preemption I
would like to learn more about why did you chose to word as "interrupts masked" vs
"preemption disabled"?

>   *
>   * Return:
>   * 0 on success, or -EIO, -EINVAL etc on error.
> @@ -245,6 +250,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>  			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
>  			   u64 *val);
>  
> +/**
> + * resctrl_arch_rmid_read_context_check()  - warn about invalid contexts
> + *
> + * When built with CONFIG_DEBUG_ATOMIC_SLEEP generate a warning when
> + * resctrl_arch_rmid_read() is called with preemption disabled.
> + */
> +static inline void resctrl_arch_rmid_read_context_check(void)
> +{
> +	if (!irqs_disabled())
> +		might_sleep();
> +}

Apologies but even after rereading the patch as well as your response to
the previous patch version several times I am not able to understand why the
code is looking like above. If, like according to the comment above, a
warning should be generated with preemption disabled, then should it not
just be "might_sleep()" without the "!irqs_disabled()" check?

I understand how for MPAM you want its code to be called in two different
contexts so I assume that the MPAM code would have two different paths,
one that can sleep and the other that cannot, both valid. It thus sounds
as though you want the x86 code to have context checks so that any issues
that could impact arm can be caught on x86? In that case, should the
x86 code also rather have two paths (one unused and the other has the
context check)?

>  
>  /**
>   * resctrl_arch_reset_rmid() - Reset any private state associated with rmid

Reinette

Re: [PATCH v5 14/24] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

Posted by James Morse 2 years, 5 months ago

Hi Reinette,

On 09/08/2023 23:36, Reinette Chatre wrote:
> On 7/28/2023 9:42 AM, James Morse wrote:
>> MPAM's cache occupancy counters can take a little while to settle once
>> the monitor has been configured. The maximum settling time is described
>> to the driver via a firmware table. The value could be large enough
>> that it makes sense to sleep. To avoid exposing this to resctrl, it
>> should be hidden behind MPAM's resctrl_arch_rmid_read().
>>
>> resctrl_arch_rmid_read() may be called via IPI meaning it is unable
>> to sleep. In this case resctrl_arch_rmid_read() should return an error
>> if it needs to sleep. This will only affect MPAM platforms where
>> the cache occupancy counter isn't available immediately, nohz_full is
>> in use, and there are there are no housekeeping CPUs in the necessary
>> domain.
>>
>> There are three callers of resctrl_arch_rmid_read():
>> __mon_event_count() and __check_limbo() are both called from a
>> non-migrateable context. mon_event_read() invokes __mon_event_count()
>> using smp_call_on_cpu(), which adds work to the target CPUs workqueue.
>> rdtgroup_mutex() is held, meaning this cannot race with the resctrl
>> cpuhp callback. __check_limbo() is invoked via schedule_delayed_work_on()
>> also adds work to a per-cpu workqueue.
>>
>> The remaining call is add_rmid_to_limbo() which is called in response
>> to a user-space syscall that frees an RMID. This opportunistically
>> reads the LLC occupancy counter on the current domain to see if the
>> RMID is over the dirty threshold. This has to disable preemption to
>> avoid reading the wrong domain's value. Disabling pre-emption here
>> prevents resctrl_arch_rmid_read() from sleeping.
>>
>> add_rmid_to_limbo() walks each domain, but only reads the counter
>> on one domain. If the system has more than one domain, the RMID will
>> always be added to the limbo list. If the RMIDs usage was not over the
>> threshold, it will be removed from the list when __check_limbo() runs.
>> Make this the default behaviour. Free RMIDs are always added to the
>> limbo list for each domain.
>>
>> The user visible effect of this is that a clean RMID is not available
>> for re-allocation immediately after 'rmdir()' completes, this behaviour
>> was never portable as it never happened on a machine with multiple
>> domains.
>>
>> Removing this path allows resctrl_arch_rmid_read() to sleep if its called
>> with interrupts unmasked. Document this is the expected behaviour, and
>> add a might_sleep() annotation to catch changes that won't work on arm64.


>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>> index 660752406174..f7311102e94c 100644
>> --- a/include/linux/resctrl.h
>> +++ b/include/linux/resctrl.h
>> @@ -236,7 +236,12 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
>>   * @eventid:		eventid to read, e.g. L3 occupancy.
>>   * @val:		result of the counter read in bytes.
>>   *
>> - * Call from process context on a CPU that belongs to domain @d.
>> + * Some architectures need to sleep when first programming some of the counters.
>> + * (specifically: arm64's MPAM cache occupancy counters can return 'not ready'
>> + *  for a short period of time). Call from a non-migrateable process context on
>> + * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or
>> + * schedule_work_on(). This function can be called with interrupts masked,
>> + * e.g. using smp_call_function_any(), but may consistently return an error.
> 
> Considering that smp_call_function_any() explicitly disables preemption I
> would like to learn more about why did you chose to word as "interrupts masked" vs
> "preemption disabled"?

smp_call_function_any() while it works out which CPU to run on, which may be this CPU. It
can't be migrated once it has picked the CPU to run on. But actually doing the work is
done by generic_exec_single(). This masks interrupts if calling locally, or invokes
__smp_call_single_queue() to raise the IPI. Obviously the other end of an IPI is running
with interrupts masked.

(If you wanted to schedule work on a remote CPU, that would be smp_call_on_cpu())


>>   *
>>   * Return:
>>   * 0 on success, or -EIO, -EINVAL etc on error.
>> @@ -245,6 +250,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>>  			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
>>  			   u64 *val);
>>  
>> +/**
>> + * resctrl_arch_rmid_read_context_check()  - warn about invalid contexts
>> + *
>> + * When built with CONFIG_DEBUG_ATOMIC_SLEEP generate a warning when
>> + * resctrl_arch_rmid_read() is called with preemption disabled.
>> + */
>> +static inline void resctrl_arch_rmid_read_context_check(void)
>> +{
>> +	if (!irqs_disabled())
>> +		might_sleep();
>> +}

> Apologies but even after rereading the patch as well as your response to
> the previous patch version several times I am not able to understand why the
> code is looking like above. If, like according to the comment above, a
> warning should be generated with preemption disabled, then should it not
> just be "might_sleep()" without the "!irqs_disabled()" check?

This would be simpler. But for NOHZ_FULL you wanted to keep the IPI, so the contract with
resctrl_arch_rmid_read() is that if interrupts are unmasked, it can sleep.

If it needs to sleep, the arch code has to check.
A bare might_sleep() would fire when called via IPI when NOHZ_FULL is enabled.

This check is about ensuring all code paths get checked for this condition, as it doesn't
matter for x86.


This results in MPAM's implementation of resctrl_arch_rmid_read() checking if interrupts
are masked before sending an IPI when it has to read the counters from a set of CPUs. In
the NOHZ_FULL case it can't do this, so it will always return an error.
Platforms needing this should be few and far between, I'm hoping people running NOHZ_FULL
on them is even rarer... they'd need to carefully select their housekeeping CPUs to make
this work.


> I understand how for MPAM you want its code to be called in two different
> contexts so I assume that the MPAM code would have two different paths,
> one that can sleep and the other that cannot, both valid. It thus sounds
> as though you want the x86 code to have context checks so that any issues
> that could impact arm can be caught on x86? In that case, should the
> x86 code also rather have two paths (one unused and the other has the
> context check)?

I did toy with having resctrl_arch_rmid_read_nosleep() and resctrl_arch_rmid_read(). But
this resulted in more code for both architectures, I felt it was simpler to just document
this requirement with this check. It's what resctrl is already doing.

resctrl_arch_rmid_read_nosleep() could be called from irq context.
resctrl_arch_rmid_read() can sleep.

On x86 resctrl_arch_rmid_read() would call resctrl_arch_rmid_read_nosleep() ... and on
arm64 the exact same thing would happen as the interrupts_disabled() check is buried deep
in the mpam driver, the resctrl glue code doesn't need to check for this.

The split approach would be simpler to document - but much more confusing as both
architectures call one helper from the other.


Thanks,

James

Re: [PATCH v5 14/24] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

Posted by Reinette Chatre 2 years, 5 months ago

Hi James,

On 8/24/2023 9:56 AM, James Morse wrote:
> On 09/08/2023 23:36, Reinette Chatre wrote:
>> On 7/28/2023 9:42 AM, James Morse wrote:
>>> MPAM's cache occupancy counters can take a little while to settle once
>>> the monitor has been configured. The maximum settling time is described
>>> to the driver via a firmware table. The value could be large enough
>>> that it makes sense to sleep. To avoid exposing this to resctrl, it
>>> should be hidden behind MPAM's resctrl_arch_rmid_read().
>>>
>>> resctrl_arch_rmid_read() may be called via IPI meaning it is unable
>>> to sleep. In this case resctrl_arch_rmid_read() should return an error
>>> if it needs to sleep. This will only affect MPAM platforms where
>>> the cache occupancy counter isn't available immediately, nohz_full is
>>> in use, and there are there are no housekeeping CPUs in the necessary
>>> domain.
>>>
>>> There are three callers of resctrl_arch_rmid_read():
>>> __mon_event_count() and __check_limbo() are both called from a
>>> non-migrateable context. mon_event_read() invokes __mon_event_count()
>>> using smp_call_on_cpu(), which adds work to the target CPUs workqueue.
>>> rdtgroup_mutex() is held, meaning this cannot race with the resctrl
>>> cpuhp callback. __check_limbo() is invoked via schedule_delayed_work_on()
>>> also adds work to a per-cpu workqueue.
>>>
>>> The remaining call is add_rmid_to_limbo() which is called in response
>>> to a user-space syscall that frees an RMID. This opportunistically
>>> reads the LLC occupancy counter on the current domain to see if the
>>> RMID is over the dirty threshold. This has to disable preemption to
>>> avoid reading the wrong domain's value. Disabling pre-emption here
>>> prevents resctrl_arch_rmid_read() from sleeping.
>>>
>>> add_rmid_to_limbo() walks each domain, but only reads the counter
>>> on one domain. If the system has more than one domain, the RMID will
>>> always be added to the limbo list. If the RMIDs usage was not over the
>>> threshold, it will be removed from the list when __check_limbo() runs.
>>> Make this the default behaviour. Free RMIDs are always added to the
>>> limbo list for each domain.
>>>
>>> The user visible effect of this is that a clean RMID is not available
>>> for re-allocation immediately after 'rmdir()' completes, this behaviour
>>> was never portable as it never happened on a machine with multiple
>>> domains.
>>>
>>> Removing this path allows resctrl_arch_rmid_read() to sleep if its called
>>> with interrupts unmasked. Document this is the expected behaviour, and
>>> add a might_sleep() annotation to catch changes that won't work on arm64.
> 
> 
>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>> index 660752406174..f7311102e94c 100644
>>> --- a/include/linux/resctrl.h
>>> +++ b/include/linux/resctrl.h
>>> @@ -236,7 +236,12 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
>>>   * @eventid:		eventid to read, e.g. L3 occupancy.
>>>   * @val:		result of the counter read in bytes.
>>>   *
>>> - * Call from process context on a CPU that belongs to domain @d.
>>> + * Some architectures need to sleep when first programming some of the counters.
>>> + * (specifically: arm64's MPAM cache occupancy counters can return 'not ready'
>>> + *  for a short period of time). Call from a non-migrateable process context on
>>> + * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or
>>> + * schedule_work_on(). This function can be called with interrupts masked,
>>> + * e.g. using smp_call_function_any(), but may consistently return an error.
>>
>> Considering that smp_call_function_any() explicitly disables preemption I
>> would like to learn more about why did you chose to word as "interrupts masked" vs
>> "preemption disabled"?
> 
> smp_call_function_any() while it works out which CPU to run on, which may be this CPU. It
> can't be migrated once it has picked the CPU to run on. But actually doing the work is
> done by generic_exec_single(). This masks interrupts if calling locally, or invokes
> __smp_call_single_queue() to raise the IPI. Obviously the other end of an IPI is running
> with interrupts masked.

I see, thank you for the detailed explanation.

> 
> (If you wanted to schedule work on a remote CPU, that would be smp_call_on_cpu())
> 
> 
>>>   *
>>>   * Return:
>>>   * 0 on success, or -EIO, -EINVAL etc on error.
>>> @@ -245,6 +250,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>>>  			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
>>>  			   u64 *val);
>>>  
>>> +/**
>>> + * resctrl_arch_rmid_read_context_check()  - warn about invalid contexts
>>> + *
>>> + * When built with CONFIG_DEBUG_ATOMIC_SLEEP generate a warning when
>>> + * resctrl_arch_rmid_read() is called with preemption disabled.
>>> + */
>>> +static inline void resctrl_arch_rmid_read_context_check(void)
>>> +{
>>> +	if (!irqs_disabled())
>>> +		might_sleep();
>>> +}
> 
>> Apologies but even after rereading the patch as well as your response to
>> the previous patch version several times I am not able to understand why the
>> code is looking like above. If, like according to the comment above, a
>> warning should be generated with preemption disabled, then should it not
>> just be "might_sleep()" without the "!irqs_disabled()" check?
> 
> This would be simpler. But for NOHZ_FULL you wanted to keep the IPI, so the contract with
> resctrl_arch_rmid_read() is that if interrupts are unmasked, it can sleep.

Thank you. This appears to be the key. Could you please add this
information to resctrl_arch_rmid_read_context_check()'s description?

> If it needs to sleep, the arch code has to check.
> A bare might_sleep() would fire when called via IPI when NOHZ_FULL is enabled.
> 
> This check is about ensuring all code paths get checked for this condition, as it doesn't
> matter for x86.
> 
> 
> This results in MPAM's implementation of resctrl_arch_rmid_read() checking if interrupts
> are masked before sending an IPI when it has to read the counters from a set of CPUs. In
> the NOHZ_FULL case it can't do this, so it will always return an error.
> Platforms needing this should be few and far between, I'm hoping people running NOHZ_FULL
> on them is even rarer... they'd need to carefully select their housekeeping CPUs to make
> this work.
> 
> 
>> I understand how for MPAM you want its code to be called in two different
>> contexts so I assume that the MPAM code would have two different paths,
>> one that can sleep and the other that cannot, both valid. It thus sounds
>> as though you want the x86 code to have context checks so that any issues
>> that could impact arm can be caught on x86? In that case, should the
>> x86 code also rather have two paths (one unused and the other has the
>> context check)?
> 
> I did toy with having resctrl_arch_rmid_read_nosleep() and resctrl_arch_rmid_read(). But
> this resulted in more code for both architectures, I felt it was simpler to just document
> this requirement with this check. It's what resctrl is already doing.
> 
> resctrl_arch_rmid_read_nosleep() could be called from irq context.
> resctrl_arch_rmid_read() can sleep.
> 
> On x86 resctrl_arch_rmid_read() would call resctrl_arch_rmid_read_nosleep() ... and on
> arm64 the exact same thing would happen as the interrupts_disabled() check is buried deep
> in the mpam driver, the resctrl glue code doesn't need to check for this.
> 
> The split approach would be simpler to document - but much more confusing as both
> architectures call one helper from the other.

I see. Than you for considering the idea.

Reinette

Re: [PATCH v5 14/24] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

Posted by James Morse 2 years, 5 months ago

Hi Reinette,

On 8/25/23 00:02, Reinette Chatre wrote:
> On 8/24/2023 9:56 AM, James Morse wrote:
>> On 09/08/2023 23:36, Reinette Chatre wrote:
>>> On 7/28/2023 9:42 AM, James Morse wrote:
>>>> MPAM's cache occupancy counters can take a little while to settle once
>>>> the monitor has been configured. The maximum settling time is described
>>>> to the driver via a firmware table. The value could be large enough
>>>> that it makes sense to sleep. To avoid exposing this to resctrl, it
>>>> should be hidden behind MPAM's resctrl_arch_rmid_read().
>>>>
>>>> resctrl_arch_rmid_read() may be called via IPI meaning it is unable
>>>> to sleep. In this case resctrl_arch_rmid_read() should return an error
>>>> if it needs to sleep. This will only affect MPAM platforms where
>>>> the cache occupancy counter isn't available immediately, nohz_full is
>>>> in use, and there are there are no housekeeping CPUs in the necessary
>>>> domain.
>>>>
>>>> There are three callers of resctrl_arch_rmid_read():
>>>> __mon_event_count() and __check_limbo() are both called from a
>>>> non-migrateable context. mon_event_read() invokes __mon_event_count()
>>>> using smp_call_on_cpu(), which adds work to the target CPUs workqueue.
>>>> rdtgroup_mutex() is held, meaning this cannot race with the resctrl
>>>> cpuhp callback. __check_limbo() is invoked via schedule_delayed_work_on()
>>>> also adds work to a per-cpu workqueue.
>>>>
>>>> The remaining call is add_rmid_to_limbo() which is called in response
>>>> to a user-space syscall that frees an RMID. This opportunistically
>>>> reads the LLC occupancy counter on the current domain to see if the
>>>> RMID is over the dirty threshold. This has to disable preemption to
>>>> avoid reading the wrong domain's value. Disabling pre-emption here
>>>> prevents resctrl_arch_rmid_read() from sleeping.
>>>>
>>>> add_rmid_to_limbo() walks each domain, but only reads the counter
>>>> on one domain. If the system has more than one domain, the RMID will
>>>> always be added to the limbo list. If the RMIDs usage was not over the
>>>> threshold, it will be removed from the list when __check_limbo() runs.
>>>> Make this the default behaviour. Free RMIDs are always added to the
>>>> limbo list for each domain.
>>>>
>>>> The user visible effect of this is that a clean RMID is not available
>>>> for re-allocation immediately after 'rmdir()' completes, this behaviour
>>>> was never portable as it never happened on a machine with multiple
>>>> domains.
>>>>
>>>> Removing this path allows resctrl_arch_rmid_read() to sleep if its called
>>>> with interrupts unmasked. Document this is the expected behaviour, and
>>>> add a might_sleep() annotation to catch changes that won't work on arm64.
>>
>>
>>>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>>>> index 660752406174..f7311102e94c 100644
>>>> --- a/include/linux/resctrl.h
>>>> +++ b/include/linux/resctrl.h
>>>> @@ -245,6 +250,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>>>>   			   u32 closid, u32 rmid, enum resctrl_event_id eventid,
>>>>   			   u64 *val);
>>>>   
>>>> +/**
>>>> + * resctrl_arch_rmid_read_context_check()  - warn about invalid contexts
>>>> + *
>>>> + * When built with CONFIG_DEBUG_ATOMIC_SLEEP generate a warning when
>>>> + * resctrl_arch_rmid_read() is called with preemption disabled.
>>>> + */
>>>> +static inline void resctrl_arch_rmid_read_context_check(void)
>>>> +{
>>>> +	if (!irqs_disabled())
>>>> +		might_sleep();
>>>> +}
>>
>>> Apologies but even after rereading the patch as well as your response to
>>> the previous patch version several times I am not able to understand why the
>>> code is looking like above. If, like according to the comment above, a
>>> warning should be generated with preemption disabled, then should it not
>>> just be "might_sleep()" without the "!irqs_disabled()" check?
>>
>> This would be simpler. But for NOHZ_FULL you wanted to keep the IPI, so the contract with
>> resctrl_arch_rmid_read() is that if interrupts are unmasked, it can sleep.
> 
> Thank you. This appears to be the key. Could you please add this
> information to resctrl_arch_rmid_read_context_check()'s description?

That comment now reads:
  * resctrl_arch_rmid_read_context_check()  - warn about invalid contexts
  *
  * When built with CONFIG_DEBUG_ATOMIC_SLEEP generate a warning when
  * resctrl_arch_rmid_read() is called with preemption disabled.
  *
  * The contract with resctrl_arch_rmid_read() is that if interrupts
  * are unmasked, it can sleep. This allows NOHZ_FULL systems to use an
  * IPI, (and fail if the call needed to sleep), while most of the time
  * the work is scheduled, allowing the call to sleep.



Thanks,

James

Re: [PATCH v5 14/24] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

Posted by Reinette Chatre 2 years, 5 months ago

Hi James,

On 9/8/2023 8:58 AM, James Morse wrote:
> 
> That comment now reads:
>  * resctrl_arch_rmid_read_context_check()  - warn about invalid contexts
>  *
>  * When built with CONFIG_DEBUG_ATOMIC_SLEEP generate a warning when
>  * resctrl_arch_rmid_read() is called with preemption disabled.
>  *
>  * The contract with resctrl_arch_rmid_read() is that if interrupts
>  * are unmasked, it can sleep. This allows NOHZ_FULL systems to use an
>  * IPI, (and fail if the call needed to sleep), while most of the time
>  * the work is scheduled, allowing the call to sleep.
> 

Thank you very much.

Reinette