[v4] sched, net: NUMA-aware CPU spreading interface

[PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface

Posted by Valentin Schneider 1 year, 7 months ago

Hi folks,

Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit
from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut
it).

The proposed interface involved an array of CPUs and a temporary cpumask, and
being my difficult self what I'm proposing here is an interface that doesn't
require any temporary storage other than some stack variables (at the cost of
one wild macro).

Please note that this is based on top of Yury's bitmap-for-next [2] to leverage
his fancy new FIND_NEXT_BIT() macro.

[1]: https://lore.kernel.org/all/20220728191203.4055-1-tariqt@nvidia.com/
[2]: https://github.com/norov/linux.git/ -b bitmap-for-next

A note on treewide use of for_each_cpu_andnot()
===============================================

I've used the below coccinelle script to find places that could be patched (I
couldn't figure out the valid syntax to patch from coccinelle itself):

,-----
@tmpandnot@
expression tmpmask;
iterator for_each_cpu;
position p;
statement S;
@@
cpumask_andnot(tmpmask, ...);

...

(
for_each_cpu@p(..., tmpmask, ...)
	S
|
for_each_cpu@p(..., tmpmask, ...)
{
	...
}
)

@script:python depends on tmpandnot@
p << tmpandnot.p;
@@
coccilib.report.print_report(p[0], "andnot loop here")
'-----

Which yields (against c40e8341e3b3):

.//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here
.//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here
.//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here
.//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here
.//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here
.//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here
.//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here
.//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here
.//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here
.//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here
.//kernel/sched/core.c:345:1-13: andnot loop here
.//kernel/sched/core.c:366:1-13: andnot loop here
.//net/core/dev.c:3058:1-13: andnot loop here

A lot of those are actually of the shape

  for_each_cpu(cpu, mask) {
      ...
      cpumask_andnot(mask, ...);
  }

I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(),
but I decided to just stick to the one obvious one in __sched_core_flip().
  
Revisions
=========

v3 -> v4
++++++++

o Rebased on top of Yury's bitmap-for-next
o Added Tariq's mlx5e patch
o Made sched_numa_hop_mask() return cpu_online_mask for the NUMA_NO_NODE &&
  hops=0 case

v2 -> v3
++++++++

o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury)
o New patches to fix issues raised by running the above

o New patch to use for_each_cpu_andnot() in sched/core.c (Yury)

v1 -> v2
++++++++

o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury)
o Rebase onto v6.0-rc1

Cheers,
Valentin

Tariq Toukan (1):
  net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
    hints

Valentin Schneider (6):
  lib/find_bit: Introduce find_next_andnot_bit()
  cpumask: Introduce for_each_cpu_andnot()
  lib/test_cpumask: Add for_each_cpu_and(not) tests
  sched/core: Merge cpumask_andnot()+for_each_cpu() into
    for_each_cpu_andnot()
  sched/topology: Introduce sched_numa_hop_mask()
  sched/topology: Introduce for_each_numa_hop_cpu()

 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 13 +++++-
 include/linux/cpumask.h                      | 39 ++++++++++++++++
 include/linux/find.h                         | 33 +++++++++++++
 include/linux/topology.h                     | 49 ++++++++++++++++++++
 kernel/sched/core.c                          |  5 +-
 kernel/sched/topology.c                      | 31 +++++++++++++
 lib/cpumask_kunit.c                          | 19 ++++++++
 lib/find_bit.c                               |  9 ++++
 8 files changed, 192 insertions(+), 6 deletions(-)

--
2.31.1

Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface

Posted by Yury Norov 1 year, 7 months ago

On Fri, Sep 23, 2022 at 02:25:20PM +0100, Valentin Schneider wrote:
> Hi folks,

Hi,

I received only 1st patch of the series. Can you give me a link for
the full series so that I'll see how the new API is used?

Thanks,
Yury
 
> Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit
> from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut
> it).
> 
> The proposed interface involved an array of CPUs and a temporary cpumask, and
> being my difficult self what I'm proposing here is an interface that doesn't
> require any temporary storage other than some stack variables (at the cost of
> one wild macro).
> 
> Please note that this is based on top of Yury's bitmap-for-next [2] to leverage
> his fancy new FIND_NEXT_BIT() macro.
> 
> [1]: https://lore.kernel.org/all/20220728191203.4055-1-tariqt@nvidia.com/
> [2]: https://github.com/norov/linux.git/ -b bitmap-for-next
> 
> A note on treewide use of for_each_cpu_andnot()
> ===============================================
> 
> I've used the below coccinelle script to find places that could be patched (I
> couldn't figure out the valid syntax to patch from coccinelle itself):
> 
> ,-----
> @tmpandnot@
> expression tmpmask;
> iterator for_each_cpu;
> position p;
> statement S;
> @@
> cpumask_andnot(tmpmask, ...);
> 
> ...
> 
> (
> for_each_cpu@p(..., tmpmask, ...)
> 	S
> |
> for_each_cpu@p(..., tmpmask, ...)
> {
> 	...
> }
> )
> 
> @script:python depends on tmpandnot@
> p << tmpandnot.p;
> @@
> coccilib.report.print_report(p[0], "andnot loop here")
> '-----
> 
> Which yields (against c40e8341e3b3):
> 
> .//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here
> .//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here
> .//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here
> .//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here
> .//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here
> .//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here
> .//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here
> .//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here
> .//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here
> .//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here
> .//kernel/sched/core.c:345:1-13: andnot loop here
> .//kernel/sched/core.c:366:1-13: andnot loop here
> .//net/core/dev.c:3058:1-13: andnot loop here
> 
> A lot of those are actually of the shape
> 
>   for_each_cpu(cpu, mask) {
>       ...
>       cpumask_andnot(mask, ...);
>   }
> 
> I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(),
> but I decided to just stick to the one obvious one in __sched_core_flip().
>   
> Revisions
> =========
> 
> v3 -> v4
> ++++++++
> 
> o Rebased on top of Yury's bitmap-for-next
> o Added Tariq's mlx5e patch
> o Made sched_numa_hop_mask() return cpu_online_mask for the NUMA_NO_NODE &&
>   hops=0 case
> 
> v2 -> v3
> ++++++++
> 
> o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury)
> o New patches to fix issues raised by running the above
> 
> o New patch to use for_each_cpu_andnot() in sched/core.c (Yury)
> 
> v1 -> v2
> ++++++++
> 
> o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury)
> o Rebase onto v6.0-rc1
> 
> Cheers,
> Valentin
> 
> Tariq Toukan (1):
>   net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
>     hints
> 
> Valentin Schneider (6):
>   lib/find_bit: Introduce find_next_andnot_bit()
>   cpumask: Introduce for_each_cpu_andnot()
>   lib/test_cpumask: Add for_each_cpu_and(not) tests
>   sched/core: Merge cpumask_andnot()+for_each_cpu() into
>     for_each_cpu_andnot()
>   sched/topology: Introduce sched_numa_hop_mask()
>   sched/topology: Introduce for_each_numa_hop_cpu()
> 
>  drivers/net/ethernet/mellanox/mlx5/core/eq.c | 13 +++++-
>  include/linux/cpumask.h                      | 39 ++++++++++++++++
>  include/linux/find.h                         | 33 +++++++++++++
>  include/linux/topology.h                     | 49 ++++++++++++++++++++
>  kernel/sched/core.c                          |  5 +-
>  kernel/sched/topology.c                      | 31 +++++++++++++
>  lib/cpumask_kunit.c                          | 19 ++++++++
>  lib/find_bit.c                               |  9 ++++
>  8 files changed, 192 insertions(+), 6 deletions(-)
> 
> --
> 2.31.1

Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface

Posted by Valentin Schneider 1 year, 7 months ago

On 23/09/22 08:44, Yury Norov wrote:
> On Fri, Sep 23, 2022 at 02:25:20PM +0100, Valentin Schneider wrote:
>> Hi folks,
>
> Hi,
>
> I received only 1st patch of the series. Can you give me a link for
> the full series so that I'll see how the new API is used?
>

Sigh, I got this when sending these out:

  4.3.0 Temporary System Problem.  Try again later (10)

I'm going to re-send the missing ones and *hopefully* have them thread
properly. Sorry about that.

Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface

Posted by Tariq Toukan 1 year, 7 months ago


On 9/23/2022 4:25 PM, Valentin Schneider wrote:
> Hi folks,
> 
> Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit
> from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut
> it).
> 
> The proposed interface involved an array of CPUs and a temporary cpumask, and
> being my difficult self what I'm proposing here is an interface that doesn't
> require any temporary storage other than some stack variables (at the cost of
> one wild macro).
> 
> Please note that this is based on top of Yury's bitmap-for-next [2] to leverage
> his fancy new FIND_NEXT_BIT() macro.
> 
> [1]: https://lore.kernel.org/all/20220728191203.4055-1-tariqt@nvidia.com/
> [2]: https://github.com/norov/linux.git/ -b bitmap-for-next
> 
> A note on treewide use of for_each_cpu_andnot()
> ===============================================
> 
> I've used the below coccinelle script to find places that could be patched (I
> couldn't figure out the valid syntax to patch from coccinelle itself):
> 
> ,-----
> @tmpandnot@
> expression tmpmask;
> iterator for_each_cpu;
> position p;
> statement S;
> @@
> cpumask_andnot(tmpmask, ...);
> 
> ...
> 
> (
> for_each_cpu@p(..., tmpmask, ...)
> 	S
> |
> for_each_cpu@p(..., tmpmask, ...)
> {
> 	...
> }
> )
> 
> @script:python depends on tmpandnot@
> p << tmpandnot.p;
> @@
> coccilib.report.print_report(p[0], "andnot loop here")
> '-----
> 
> Which yields (against c40e8341e3b3):
> 
> .//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here
> .//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here
> .//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here
> .//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here
> .//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here
> .//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here
> .//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here
> .//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here
> .//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here
> .//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here
> .//kernel/sched/core.c:345:1-13: andnot loop here
> .//kernel/sched/core.c:366:1-13: andnot loop here
> .//net/core/dev.c:3058:1-13: andnot loop here
> 
> A lot of those are actually of the shape
> 
>    for_each_cpu(cpu, mask) {
>        ...
>        cpumask_andnot(mask, ...);
>    }
> 
> I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(),
> but I decided to just stick to the one obvious one in __sched_core_flip().
>    
> Revisions
> =========
> 
> v3 -> v4
> ++++++++
> 
> o Rebased on top of Yury's bitmap-for-next
> o Added Tariq's mlx5e patch
> o Made sched_numa_hop_mask() return cpu_online_mask for the NUMA_NO_NODE &&
>    hops=0 case
> 
> v2 -> v3
> ++++++++
> 
> o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury)
> o New patches to fix issues raised by running the above
> 
> o New patch to use for_each_cpu_andnot() in sched/core.c (Yury)
> 
> v1 -> v2
> ++++++++
> 
> o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury)
> o Rebase onto v6.0-rc1
> 
> Cheers,
> Valentin
> 
> Tariq Toukan (1):
>    net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
>      hints
> 
> Valentin Schneider (6):
>    lib/find_bit: Introduce find_next_andnot_bit()
>    cpumask: Introduce for_each_cpu_andnot()
>    lib/test_cpumask: Add for_each_cpu_and(not) tests
>    sched/core: Merge cpumask_andnot()+for_each_cpu() into
>      for_each_cpu_andnot()
>    sched/topology: Introduce sched_numa_hop_mask()
>    sched/topology: Introduce for_each_numa_hop_cpu()
> 
>   drivers/net/ethernet/mellanox/mlx5/core/eq.c | 13 +++++-
>   include/linux/cpumask.h                      | 39 ++++++++++++++++
>   include/linux/find.h                         | 33 +++++++++++++
>   include/linux/topology.h                     | 49 ++++++++++++++++++++
>   kernel/sched/core.c                          |  5 +-
>   kernel/sched/topology.c                      | 31 +++++++++++++
>   lib/cpumask_kunit.c                          | 19 ++++++++
>   lib/find_bit.c                               |  9 ++++
>   8 files changed, 192 insertions(+), 6 deletions(-)
> 
> --
> 2.31.1
> 

Valentin, thank you for investing your time here.

Acked-by: Tariq Toukan <tariqt@nvidia.com>

Tested on my mlx5 environment.
It works as expected, including the case of node == NUMA_NO_NODE.

Regards,
Tariq

Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface

Posted by Tariq Toukan 1 year, 6 months ago

On 9/23/2022 4:25 PM, Valentin Schneider wrote:
> Hi folks,
> 
> Tariq pointed out in [1] that drivers allocating IRQ vectors would benefit
> from having smarter NUMA-awareness (cpumask_local_spread() doesn't quite cut
> it).
> 
> The proposed interface involved an array of CPUs and a temporary cpumask, and
> being my difficult self what I'm proposing here is an interface that doesn't
> require any temporary storage other than some stack variables (at the cost of
> one wild macro).
> 
> Please note that this is based on top of Yury's bitmap-for-next [2] to leverage
> his fancy new FIND_NEXT_BIT() macro.
> 
> [1]: https://lore.kernel.org/all/20220728191203.4055-1-tariqt@nvidia.com/
> [2]: https://github.com/norov/linux.git/ -b bitmap-for-next
> 
> A note on treewide use of for_each_cpu_andnot()
> ===============================================
> 
> I've used the below coccinelle script to find places that could be patched (I
> couldn't figure out the valid syntax to patch from coccinelle itself):
> 
> ,-----
> @tmpandnot@
> expression tmpmask;
> iterator for_each_cpu;
> position p;
> statement S;
> @@
> cpumask_andnot(tmpmask, ...);
> 
> ...
> 
> (
> for_each_cpu@p(..., tmpmask, ...)
> 	S
> |
> for_each_cpu@p(..., tmpmask, ...)
> {
> 	...
> }
> )
> 
> @script:python depends on tmpandnot@
> p << tmpandnot.p;
> @@
> coccilib.report.print_report(p[0], "andnot loop here")
> '-----
> 
> Which yields (against c40e8341e3b3):
> 
> .//arch/powerpc/kernel/smp.c:1587:1-13: andnot loop here
> .//arch/powerpc/kernel/smp.c:1530:1-13: andnot loop here
> .//arch/powerpc/kernel/smp.c:1440:1-13: andnot loop here
> .//arch/powerpc/platforms/powernv/subcore.c:306:2-14: andnot loop here
> .//arch/x86/kernel/apic/x2apic_cluster.c:62:1-13: andnot loop here
> .//drivers/acpi/acpi_pad.c:110:1-13: andnot loop here
> .//drivers/cpufreq/armada-8k-cpufreq.c:148:1-13: andnot loop here
> .//drivers/cpufreq/powernv-cpufreq.c:931:1-13: andnot loop here
> .//drivers/net/ethernet/sfc/efx_channels.c:73:1-13: andnot loop here
> .//drivers/net/ethernet/sfc/siena/efx_channels.c:73:1-13: andnot loop here
> .//kernel/sched/core.c:345:1-13: andnot loop here
> .//kernel/sched/core.c:366:1-13: andnot loop here
> .//net/core/dev.c:3058:1-13: andnot loop here
> 
> A lot of those are actually of the shape
> 
>    for_each_cpu(cpu, mask) {
>        ...
>        cpumask_andnot(mask, ...);
>    }
> 
> I think *some* of the powerpc ones would be a match for for_each_cpu_andnot(),
> but I decided to just stick to the one obvious one in __sched_core_flip().
>    
> Revisions
> =========
> 
> v3 -> v4
> ++++++++
> 
> o Rebased on top of Yury's bitmap-for-next
> o Added Tariq's mlx5e patch
> o Made sched_numa_hop_mask() return cpu_online_mask for the NUMA_NO_NODE &&
>    hops=0 case
> 
> v2 -> v3
> ++++++++
> 
> o Added for_each_cpu_and() and for_each_cpu_andnot() tests (Yury)
> o New patches to fix issues raised by running the above
> 
> o New patch to use for_each_cpu_andnot() in sched/core.c (Yury)
> 
> v1 -> v2
> ++++++++
> 
> o Split _find_next_bit() @invert into @invert1 and @invert2 (Yury)
> o Rebase onto v6.0-rc1
> 
> Cheers,
> Valentin
> 

Hi,

What's the status of this?
Do we have agreement on the changes needed for the next respin?

Regards,
Tariq

Re: [PATCH v4 0/7] sched, net: NUMA-aware CPU spreading interface

Posted by Valentin Schneider 1 year, 6 months ago

On 18/10/22 09:36, Tariq Toukan wrote:
>
> Hi,
>
> What's the status of this?
> Do we have agreement on the changes needed for the next respin?
>

Yep, the bitmap patches are in 6.1-rc1, I need to respin the topology ones
to address Yury's comments. It's in my todolist, I'll get to it soonish.

> Regards,
> Tariq