sched/topology: Optimized copy default topology in sched_init_numa()

[PATCH] sched/topology: Optimized copy default topology in sched_init_numa()

Posted by Hao Jia 3 years, 9 months ago

The size of struct sched_domain_topology_level is 64 bytes.
For NUMA platforms, almost all are multi-core (enable CONFIG_SCHED_MC),
That is to say, the default_topology array has at least 128 bytes that
need to be copied in sched_init_numa(). For most x86 platforms,
CONFIG_SCHED_SMT will be enabled, so more copies will be required.

And memcpy() will be optimized under different architectures.
Fortunately, for platforms with CONFIG_NUMA enabled,
these optimizations are likely to be used.
So, let's use memcpy to copy default topology in sched_init_numa().

Tests are done in an Intel Xeon(R) Platinum 8260 CPU@2.40GHz machine
with 2 NUMA nodes each of which has 24 cores with SMT2 enabled, so 96
CPUs in total.

Use RDTSC to count time-consuming, and based on 5.19-rc4.

Enable CONFIG_SCHED_SMT && CONFIG_SCHED_CLUSTER && CONFIG_SCHED_MC,
So the default_topology array has 256 bytes that need to be copied
in sched_init_numa().
                     5.19-rc4   5.19-rc4 with patch
average tsc ticks    516.57      85.33   （-83.48%*）

Enable CONFIG_SCHED_MC, So the default_topology array has
128 bytes that need to be copied in sched_init_numa().
                     5.19-rc4   5.19-rc4 with patch
average tsc ticks    65.71       55.00   （-16.30%*）

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
---
 kernel/sched/topology.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05b6c2ad90b9..c6f497d263cd 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1918,8 +1918,7 @@ void sched_init_numa(int offline_node)
 	/*
 	 * Copy the default topology bits..
 	 */
-	for (i = 0; sched_domain_topology[i].mask; i++)
-		tl[i] = sched_domain_topology[i];
+	memcpy(tl, sched_domain_topology, sizeof(struct sched_domain_topology_level) * i);
 
 	/*
 	 * Add the NUMA identity distance, aka single NODE.
-- 
2.32.0

Re: [PATCH] sched/topology: Optimized copy default topology in sched_init_numa()

Posted by Valentin Schneider 3 years, 9 months ago

On 27/06/22 18:53, Hao Jia wrote:
> The size of struct sched_domain_topology_level is 64 bytes.
> For NUMA platforms, almost all are multi-core (enable CONFIG_SCHED_MC),
> That is to say, the default_topology array has at least 128 bytes that
> need to be copied in sched_init_numa(). For most x86 platforms,
> CONFIG_SCHED_SMT will be enabled, so more copies will be required.
>
> And memcpy() will be optimized under different architectures.
> Fortunately, for platforms with CONFIG_NUMA enabled,
> these optimizations are likely to be used.
> So, let's use memcpy to copy default topology in sched_init_numa().
>
> Tests are done in an Intel Xeon(R) Platinum 8260 CPU@2.40GHz machine
> with 2 NUMA nodes each of which has 24 cores with SMT2 enabled, so 96
> CPUs in total.
>
> Use RDTSC to count time-consuming, and based on 5.19-rc4.
>
> Enable CONFIG_SCHED_SMT && CONFIG_SCHED_CLUSTER && CONFIG_SCHED_MC,
> So the default_topology array has 256 bytes that need to be copied
> in sched_init_numa().
>                      5.19-rc4   5.19-rc4 with patch
> average tsc ticks    516.57      85.33   （-83.48%*）
>
> Enable CONFIG_SCHED_MC, So the default_topology array has
> 128 bytes that need to be copied in sched_init_numa().
>                      5.19-rc4   5.19-rc4 with patch
> average tsc ticks    65.71       55.00   （-16.30%*）
>
> Signed-off-by: Hao Jia <jiahao.os@bytedance.com>

It's not a very hot path but I guess this lets you shave off a bit of boot
time... While you're at it, you could add an early

  if (nr_node_ids == 1)
          return;

since !NUMA systems still go through sched_init_numa() if they have a
kernel with CONFIG_NUMA (which should be most of them nowdays) and IIRC
they end up with an unused NODE topology level.

Regardless:

Reviewed-by: Valentin Schneider <vschneid@redhat.com>

Re: [External] Re: [PATCH] sched/topology: Optimized copy default topology in sched_init_numa()

Posted by Hao Jia 3 years, 9 months ago

On 2022/7/4 Valentin Schneider wrote:
> 
> It's not a very hot path but I guess this lets you shave off a bit of boot
> time... While you're at it, you could add an early
Thanks for your time and suggestion.
> 
>    if (nr_node_ids == 1)
>            return;
> 

This will cause the values of sched_domains_numa_levels and 
sched_max_numa_distance to be different from before, and 
sched_domains_numa_levels may cause the return value of 
sched_numa_find_closest() to be different.
I'm not sure if it will cause problems.

> since !NUMA systems still go through sched_init_numa() if they have a
> kernel with CONFIG_NUMA (which should be most of them nowdays) and IIRC
> they end up with an unused NODE topology level.
> 

I'm confused why most !NUMA systems enable CONFIG_NUMA in the kernel?
Maybe for scalability?

> Regardless:
> 
> Reviewed-by: Valentin Schneider <vschneid@redhat.com>
> 

How about this?

The size of struct sched_domain_topology_level is 64 bytes.
For NUMA platforms, almost all are multi-core (enable CONFIG_SCHED_MC),
That is to say, the default_topology array has at least 128 bytes that
need to be copied in sched_init_numa(). For most x86 platforms,
CONFIG_SCHED_SMT will be enabled, so more copies will be required.

And memcpy() will be optimized under different architectures.
Fortunately, for platforms with CONFIG_NUMA enabled,
these optimizations are likely to be used.
So, let's use memcpy() to copy default topology in sched_init_numa().

Tests are done in an Intel Xeon(R) Platinum 8260 CPU@2.40GHz machine
with 2 NUMA nodes each of which has 24 cores with SMT2 enabled, so 96
CPUs in total.

Use RDTSC to count time-consuming, and based on 5.19-rc4.

Enable CONFIG_SCHED_SMT && CONFIG_SCHED_CLUSTER && CONFIG_SCHED_MC,
So the default_topology array has 256 bytes that need to be copied
in sched_init_numa().
                      5.19-rc4   5.19-rc4 with patch
average tsc ticks    516.57      85.33   （-83.48%*）

Enable CONFIG_SCHED_MC, So the default_topology array has
128 bytes that need to be copied in sched_init_numa().
                      5.19-rc4   5.19-rc4 with patch
average tsc ticks    65.71       55.00   （-16.30%*）

since !NUMA systems still go through sched_init_numa() if they have a
kernel with CONFIG_NUMA (which should be most of them nowdays) and we
can skip copying and memory allocation of useless default topology.

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
---
  kernel/sched/topology.c | 7 +++++--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05b6c2ad90b9..c439e58f22b9 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1907,6 +1907,9 @@ void sched_init_numa(int offline_node)
  	}
  	rcu_assign_pointer(sched_domains_numa_masks, masks);

+	if (nr_node_ids == 1)
+		goto skip;
+
  	/* Compute default topology size */
  	for (i = 0; sched_domain_topology[i].mask; i++);

@@ -1918,8 +1921,7 @@ void sched_init_numa(int offline_node)
  	/*
  	 * Copy the default topology bits..
  	 */
-	for (i = 0; sched_domain_topology[i].mask; i++)
-		tl[i] = sched_domain_topology[i];
+	memcpy(tl, sched_domain_topology, sizeof(struct 
sched_domain_topology_level) * i);

  	/*
  	 * Add the NUMA identity distance, aka single NODE.
@@ -1946,6 +1948,7 @@ void sched_init_numa(int offline_node)
  	sched_domain_topology_saved = sched_domain_topology;
  	sched_domain_topology = tl;

+skip:
  	sched_domains_numa_levels = nr_levels;
  	WRITE_ONCE(sched_max_numa_distance, 
sched_domains_numa_distance[nr_levels - 1]);

thanks,
Hao

Re: [External] Re: [PATCH] sched/topology: Optimized copy default topology in sched_init_numa()

Posted by Valentin Schneider 3 years, 9 months ago

On 11/07/22 18:28, Hao Jia wrote:
> On 2022/7/4 Valentin Schneider wrote:
>>
>> It's not a very hot path but I guess this lets you shave off a bit of boot
>> time... While you're at it, you could add an early
> Thanks for your time and suggestion.
>>
>>    if (nr_node_ids == 1)
>>            return;
>>
>
> This will cause the values of sched_domains_numa_levels and
> sched_max_numa_distance to be different from before, and
> sched_domains_numa_levels may cause the return value of
> sched_numa_find_closest() to be different.
> I'm not sure if it will cause problems.
>

True, we need to be careful here, but those are all static so they get
initialized to sensible defaults (zero / NULL pointer).

sched_numa_find_closest() will return nr_cpu_ids which make sense, so I
think we can get away with an early return

>> since !NUMA systems still go through sched_init_numa() if they have a
>> kernel with CONFIG_NUMA (which should be most of them nowdays) and IIRC
>> they end up with an unused NODE topology level.
>>
>
> I'm confused why most !NUMA systems enable CONFIG_NUMA in the kernel?
> Maybe for scalability?
>

It just makes things easier on a distribution point of view - just ship a
single kernel image everyone can use, rather than N different images for N
different types of systems.

AFAIA having CONFIG_NUMA on an UMA (!NUMA) system isn't bad, it just adds
more things in the sched_domain_topology during boot time which end up
being unused.

Re: [External] Re: [PATCH] sched/topology: Optimized copy default topology in sched_init_numa()

Posted by Hao Jia 3 years, 9 months ago


On 2022/7/12 Valentin Schneider wrote:
> On 11/07/22 18:28, Hao Jia wrote:
>> On 2022/7/4 Valentin Schneider wrote:
>>>
>>> It's not a very hot path but I guess this lets you shave off a bit of boot
>>> time... While you're at it, you could add an early
>> Thanks for your time and suggestion.
>>>
>>>     if (nr_node_ids == 1)
>>>             return;
>>>
>>
>> This will cause the values of sched_domains_numa_levels and
>> sched_max_numa_distance to be different from before, and
>> sched_domains_numa_levels may cause the return value of
>> sched_numa_find_closest() to be different.
>> I'm not sure if it will cause problems.
>>
> 
> True, we need to be careful here, but those are all static so they get
> initialized to sensible defaults (zero / NULL pointer).
> 
> sched_numa_find_closest() will return nr_cpu_ids which make sense, so I
> think we can get away with an early return

Yes, this may affect the return value of housekeeping anycpu(), which 
doesn't seem to be a problem.

> 
>>> since !NUMA systems still go through sched_init_numa() if they have a
>>> kernel with CONFIG_NUMA (which should be most of them nowdays) and IIRC
>>> they end up with an unused NODE topology level.
>>>
>>
>> I'm confused why most !NUMA systems enable CONFIG_NUMA in the kernel?
>> Maybe for scalability?
>>
> 
> It just makes things easier on a distribution point of view - just ship a
> single kernel image everyone can use, rather than N different images for N
> different types of systems.
> 
> AFAIA having CONFIG_NUMA on an UMA (!NUMA) system isn't bad, it just adds
> more things in the sched_domain_topology during boot time which end up
> being unused.

Thank you very much for your answer.

Thinks,
Hao

>