[PATCH] xen/smp: Speed up on_selected_cpus()

Andrew Cooper posted 1 patch 2 years, 3 months ago
Test gitlab-ci passed
Patches applied successfully (tree, apply log)
git fetch https://gitlab.com/xen-project/patchew/xen tags/patchew/20220204203115.13290-1-andrew.cooper3@citrix.com
xen/common/smp.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
[PATCH] xen/smp: Speed up on_selected_cpus()
Posted by Andrew Cooper 2 years, 3 months ago
cpumask_weight() is a horribly expensive way to find if no bits are set, made
worse by the fact that the calculation is performed with the global call_lock
held.

Switch to using cpumask_empty() instead, which will short circuit as soon as
it find any set bit in the cpumask.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Juergen Gross <jgross@suse.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>

I have not done performance testing, but I would be surprised if this cannot
be measured on a busy or large box.
---
 xen/common/smp.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/xen/common/smp.c b/xen/common/smp.c
index 781bcf2c246c..a011f541f1ea 100644
--- a/xen/common/smp.c
+++ b/xen/common/smp.c
@@ -50,8 +50,6 @@ void on_selected_cpus(
     void *info,
     int wait)
 {
-    unsigned int nr_cpus;
-
     ASSERT(local_irq_is_enabled());
     ASSERT(cpumask_subset(selected, &cpu_online_map));
 
@@ -59,8 +57,7 @@ void on_selected_cpus(
 
     cpumask_copy(&call_data.selected, selected);
 
-    nr_cpus = cpumask_weight(&call_data.selected);
-    if ( nr_cpus == 0 )
+    if ( cpumask_empty(&call_data.selected) )
         goto out;
 
     call_data.func = func;
-- 
2.11.0


Re: [PATCH] xen/smp: Speed up on_selected_cpus()
Posted by Jan Beulich 2 years, 2 months ago
On 04.02.2022 21:31, Andrew Cooper wrote:
> cpumask_weight() is a horribly expensive way to find if no bits are set, made
> worse by the fact that the calculation is performed with the global call_lock
> held.
> 
> Switch to using cpumask_empty() instead, which will short circuit as soon as
> it find any set bit in the cpumask.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

May I suggest to drop "horribly"? How expensive one is compared to the other
depends on the number of CPUs actually enumerated in the system. (And of
course I still have that conversion to POPCNT alternatives patching pending,
where Roger did ask for some re-work in reply to v2, but where it has
remained unclear whether investing time into that wouldn't be in vein,
considering some of your replies on v1. Thus would have further shrunk the
difference, without me meaning to say the change here isn't a good one.)

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


Re: [PATCH] xen/smp: Speed up on_selected_cpus()
Posted by Andrew Cooper 2 years, 2 months ago
On 07/02/2022 08:11, Jan Beulich wrote:
> On 04.02.2022 21:31, Andrew Cooper wrote:
>> cpumask_weight() is a horribly expensive way to find if no bits are set, made
>> worse by the fact that the calculation is performed with the global call_lock
>> held.
>>
>> Switch to using cpumask_empty() instead, which will short circuit as soon as
>> it find any set bit in the cpumask.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> May I suggest to drop "horribly"? How expensive one is compared to the other
> depends on the number of CPUs actually enumerated in the system.

In absolute terms perhaps, but they both scale as O(nr_cpus).  Hamming
weight has a far larger constant.

>  (And of
> course I still have that conversion to POPCNT alternatives patching pending,
> where Roger did ask for some re-work in reply to v2, but where it has
> remained unclear whether investing time into that wouldn't be in vein,
> considering some of your replies on v1. Thus would have further shrunk the
> difference, without me meaning to say the change here isn't a good one.)

There is a perfectly clear and simple way forward.  It's the one which
doesn't fight the optimiser and actively regress the code generation in
the calling functions, and add an unreasonable quantity technical debt
into the marginal paths.

I will ack a version where you're not adding complexity for negative gains.

~Andrew
Re: [PATCH] xen/smp: Speed up on_selected_cpus()
Posted by Jan Beulich 2 years, 2 months ago
On 07.02.2022 18:06, Andrew Cooper wrote:
> On 07/02/2022 08:11, Jan Beulich wrote:
>>  (And of
>> course I still have that conversion to POPCNT alternatives patching pending,
>> where Roger did ask for some re-work in reply to v2, but where it has
>> remained unclear whether investing time into that wouldn't be in vein,
>> considering some of your replies on v1. Thus would have further shrunk the
>> difference, without me meaning to say the change here isn't a good one.)
> 
> There is a perfectly clear and simple way forward.  It's the one which
> doesn't fight the optimiser and actively regress the code generation in
> the calling functions, and add an unreasonable quantity technical debt
> into the marginal paths.
> 
> I will ack a version where you're not adding complexity for negative gains.

Thanks, at least some form of a reply. I'm afraid I can't really translate
this to which parts of the change you'd be okay with and which parts need
changing. I didn't think I would "fight the optimiser and actively regress
the code generation in the calling functions" in v2 (this may have been
different in v1, but I haven't gone back to check; I wonder though whether
you're mixing this with e.g. the BMI patching series I've long given up on).

Jan


Re: [PATCH] xen/smp: Speed up on_selected_cpus()
Posted by Julien Grall 2 years, 2 months ago
Hi,

On 04/02/2022 20:31, Andrew Cooper wrote:
> cpumask_weight() is a horribly expensive way to find if no bits are set, made
> worse by the fact that the calculation is performed with the global call_lock
> held.

I looked at the archive because I was wondering why we were using 
cpumask_weight here. It looks like this was a left-over of the rework in 
ac3fc35d919c "x86: Fix flush_area_mask() and on_selected_cpus() to not 
race updates".

> 
> Switch to using cpumask_empty() instead, which will short circuit as soon as
> it find any set bit in the cpumask.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Julien Grall <jgrall@amazon.com>

> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> CC: Juergen Gross <jgross@suse.com>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Julien Grall <julien@xen.org>
> CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
> CC: Bertrand Marquis <bertrand.marquis@arm.com>
> 
> I have not done performance testing, but I would be surprised if this cannot
> be measured on a busy or large box.
> ---
>   xen/common/smp.c | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/xen/common/smp.c b/xen/common/smp.c
> index 781bcf2c246c..a011f541f1ea 100644
> --- a/xen/common/smp.c
> +++ b/xen/common/smp.c
> @@ -50,8 +50,6 @@ void on_selected_cpus(
>       void *info,
>       int wait)
>   {
> -    unsigned int nr_cpus;
> -
>       ASSERT(local_irq_is_enabled());
>       ASSERT(cpumask_subset(selected, &cpu_online_map));
>   
> @@ -59,8 +57,7 @@ void on_selected_cpus(
>   
>       cpumask_copy(&call_data.selected, selected);
>   
> -    nr_cpus = cpumask_weight(&call_data.selected);
> -    if ( nr_cpus == 0 )
> +    if ( cpumask_empty(&call_data.selected) )
>           goto out;
>   
>       call_data.func = func;

-- 
Julien Grall

Re: [PATCH] xen/smp: Speed up on_selected_cpus()
Posted by Andrew Cooper 2 years, 2 months ago
On 06/02/2022 19:40, Julien Grall wrote:
> Hi,
>
> On 04/02/2022 20:31, Andrew Cooper wrote:
>> cpumask_weight() is a horribly expensive way to find if no bits are
>> set, made
>> worse by the fact that the calculation is performed with the global
>> call_lock
>> held.
>
> I looked at the archive because I was wondering why we were using
> cpumask_weight here. It looks like this was a left-over of the rework
> in ac3fc35d919c "x86: Fix flush_area_mask() and on_selected_cpus() to
> not race updates".

That change shuffled the code, but didn't introduce the problem.

I'm pretty sure it was 433f14699d48 which dropped the !=0 user of nr_cpus.


Talking of, there is more efficiency to be gained by reworking the
second cpumask_empty() call to not restart from 0 on failure, because
that removes useless reads.


>
>>
>> Switch to using cpumask_empty() instead, which will short circuit as
>> soon as
>> it find any set bit in the cpumask.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
> Reviewed-by: Julien Grall <jgrall@amazon.com>

Thanks.

~Andrew