[PATCH v3] sched/topology: improve topology_span_sane speed

Steve Wahl posted 1 patch 10 months, 1 week ago
kernel/sched/topology.c | 83 ++++++++++++++++++++++++++++-------------
1 file changed, 58 insertions(+), 25 deletions(-)
[PATCH v3] sched/topology: improve topology_span_sane speed
Posted by Steve Wahl 10 months, 1 week ago
Use a different approach to topology_span_sane(), that checks for the
same constraint of no partial overlaps for any two CPU sets for
non-NUMA topology levels, but does so in a way that is O(N) rather
than O(N^2).

Instead of comparing with all other masks to detect collisions, keep
one mask that includes all CPUs seen so far and detect collisions with
a single cpumask_intersects test.

If the current mask has no collisions with previously seen masks, it
should be a new mask, which can be uniquely identified by the lowest
bit set in this mask.  Keep a pointer to this mask for future
reference (in an array indexed by the lowest bit set), and add the
CPUs in this mask to the list of those seen.

If the current mask does collide with previously seen masks, it should
be exactly equal to a mask seen before, looked up in the same array
indexed by the lowest bit set in the mask, a single comparison.

Move the topology_span_sane() check out of the existing topology level
loop, let it use its own loop so that the array allocation can be done
only once, shared across levels.

On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
the average time to take one processor offline is reduced from 2.18
seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
34m49.765s without this change, 16m10.038s with this change in place.)

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
---

Version 3: While the intent of this patch is no functional change, I
discovered that version 2 had conditions where it would give different
results than the original code.  Version 3 returns to the V1 approach,
additionally correcting the handling of masks with no bits set and
fixing the num_possible_cpus() problem Peter Zijlstra noted.  In a
stand-alone test program that used all possible sets of four 4-bit
masks, this algorithm matched the original code in all cases, where
the others did not.

Version 2 discussion:
    https://lore.kernel.org/all/20241031200431.182443-1-steve.wahl@hpe.com/

Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
simplifies the code, and eliminates the erroneous use of
num_possible_cpus() that Peter Zijlstra noted.

Version 1 discussion:
    https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/

 kernel/sched/topology.c | 83 ++++++++++++++++++++++++++++-------------
 1 file changed, 58 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..3fb834301315 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2356,36 +2356,69 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 
 /*
  * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
- * any two given CPUs at this (non-NUMA) topology level.
+ * any two given CPUs on non-NUMA topology levels.
  */
-static bool topology_span_sane(struct sched_domain_topology_level *tl,
-			      const struct cpumask *cpu_map, int cpu)
+static bool topology_span_sane(const struct cpumask *cpu_map)
 {
-	int i = cpu + 1;
+	struct sched_domain_topology_level *tl;
+	const struct cpumask **masks;
+	struct cpumask *covered;
+	int cpu, id;
+	bool ret = false;
 
-	/* NUMA levels are allowed to overlap */
-	if (tl->flags & SDTL_OVERLAP)
-		return true;
+	lockdep_assert_held(&sched_domains_mutex);
+	covered = sched_domains_tmpmask;
+
+	masks = kmalloc_array(nr_cpu_ids, sizeof(struct cpumask *), GFP_KERNEL);
+	if (!masks)
+		return ret;
+
+	for_each_sd_topology(tl) {
+
+		/* NUMA levels are allowed to overlap */
+		if (tl->flags & SDTL_OVERLAP)
+			continue;
+
+		cpumask_clear(covered);
+		memset(masks, 0, nr_cpu_ids * sizeof(struct cpumask *));
 
-	/*
-	 * Non-NUMA levels cannot partially overlap - they must be either
-	 * completely equal or completely disjoint. Otherwise we can end up
-	 * breaking the sched_group lists - i.e. a later get_group() pass
-	 * breaks the linking done for an earlier span.
-	 */
-	for_each_cpu_from(i, cpu_map) {
 		/*
-		 * We should 'and' all those masks with 'cpu_map' to exactly
-		 * match the topology we're about to build, but that can only
-		 * remove CPUs, which only lessens our ability to detect
-		 * overlaps
+		 * Non-NUMA levels cannot partially overlap - they must be either
+		 * completely equal or completely disjoint. Otherwise we can end up
+		 * breaking the sched_group lists - i.e. a later get_group() pass
+		 * breaks the linking done for an earlier span.
 		 */
-		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
-		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
-			return false;
+		for_each_cpu(cpu, cpu_map) {
+			/* lowest bit set in this mask is used as a unique id */
+			id = cpumask_first(tl->mask(cpu));
+
+			/* zeroed masks cannot possibly collide */
+			if (id >= nr_cpu_ids)
+				continue;
+
+			/* if this mask doesn't collide with what we've already seen */
+			if (!cpumask_intersects(tl->mask(cpu), covered)) {
+				/* this failing would be an error in this algorithm */
+				if (WARN_ON(masks[id]))
+					goto notsane;
+
+				/* record the mask we saw for this id */
+				masks[id] = tl->mask(cpu);
+				cpumask_or(covered, tl->mask(cpu), covered);
+			} else if ((!masks[id]) || !cpumask_equal(masks[id], tl->mask(cpu))) {
+				/*
+				 * a collision with covered should have exactly matched
+				 * a previously seen mask with the same id
+				 */
+				goto notsane;
+			}
+		}
 	}
+	ret = true;
 
-	return true;
+ notsane:
+	kfree(masks);
+	return ret;
 }
 
 /*
@@ -2417,9 +2450,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		sd = NULL;
 		for_each_sd_topology(tl) {
 
-			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
-				goto error;
-
 			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
 
 			has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;
@@ -2433,6 +2463,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	if (WARN_ON(!topology_span_sane(cpu_map)))
+		goto error;
+
 	/* Build the groups for the domains */
 	for_each_cpu(i, cpu_map) {
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
-- 
2.26.2
Re: [PATCH v3] sched/topology: improve topology_span_sane speed
Posted by Valentin Schneider 10 months ago
On 10/02/25 09:42, Steve Wahl wrote:
> Use a different approach to topology_span_sane(), that checks for the
> same constraint of no partial overlaps for any two CPU sets for
> non-NUMA topology levels, but does so in a way that is O(N) rather
> than O(N^2).
>
> Instead of comparing with all other masks to detect collisions, keep
> one mask that includes all CPUs seen so far and detect collisions with
> a single cpumask_intersects test.
>
> If the current mask has no collisions with previously seen masks, it
> should be a new mask, which can be uniquely identified by the lowest
> bit set in this mask.  Keep a pointer to this mask for future
> reference (in an array indexed by the lowest bit set), and add the
> CPUs in this mask to the list of those seen.
>
> If the current mask does collide with previously seen masks, it should
> be exactly equal to a mask seen before, looked up in the same array
> indexed by the lowest bit set in the mask, a single comparison.
>
> Move the topology_span_sane() check out of the existing topology level
> loop, let it use its own loop so that the array allocation can be done
> only once, shared across levels.
>
> On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> the average time to take one processor offline is reduced from 2.18
> seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> 34m49.765s without this change, 16m10.038s with this change in place.)
>
> Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> ---
>
> Version 3: While the intent of this patch is no functional change, I
> discovered that version 2 had conditions where it would give different
> results than the original code.  Version 3 returns to the V1 approach,
> additionally correcting the handling of masks with no bits set and
> fixing the num_possible_cpus() problem Peter Zijlstra noted.  In a
> stand-alone test program that used all possible sets of four 4-bit
> masks, this algorithm matched the original code in all cases, where
> the others did not.
>

So looking at my notes from v2 I was under the impression the array-less
approach worked, do you have an example topology where the array-less
approach fails? I usually poke at topology stuff via QEMU so if you have a
virtual topology description I'd be happy to give that a span.

> -	for_each_cpu_from(i, cpu_map) {
>               /*
> -		 * We should 'and' all those masks with 'cpu_map' to exactly
> -		 * match the topology we're about to build, but that can only
> -		 * remove CPUs, which only lessens our ability to detect
> -		 * overlaps
> +		 * Non-NUMA levels cannot partially overlap - they must be either
> +		 * completely equal or completely disjoint. Otherwise we can end up
> +		 * breaking the sched_group lists - i.e. a later get_group() pass
> +		 * breaks the linking done for an earlier span.
>                */
> -		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> -		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> -			return false;
> +		for_each_cpu(cpu, cpu_map) {
> +			/* lowest bit set in this mask is used as a unique id */
> +			id = cpumask_first(tl->mask(cpu));
> +
> +			/* zeroed masks cannot possibly collide */
> +			if (id >= nr_cpu_ids)
> +				continue;
> +

Is it even legal for an online CPU's topology mask to be empty?! I would
assume it should *at least* contain itself.
Re: [PATCH v3] sched/topology: improve topology_span_sane speed
Posted by Steve Wahl 10 months ago
On Fri, Feb 14, 2025 at 03:25:31PM +0100, Valentin Schneider wrote:
> On 10/02/25 09:42, Steve Wahl wrote:
> > Use a different approach to topology_span_sane(), that checks for the
> > same constraint of no partial overlaps for any two CPU sets for
> > non-NUMA topology levels, but does so in a way that is O(N) rather
> > than O(N^2).
> >
> > Instead of comparing with all other masks to detect collisions, keep
> > one mask that includes all CPUs seen so far and detect collisions with
> > a single cpumask_intersects test.
> >
> > If the current mask has no collisions with previously seen masks, it
> > should be a new mask, which can be uniquely identified by the lowest
> > bit set in this mask.  Keep a pointer to this mask for future
> > reference (in an array indexed by the lowest bit set), and add the
> > CPUs in this mask to the list of those seen.
> >
> > If the current mask does collide with previously seen masks, it should
> > be exactly equal to a mask seen before, looked up in the same array
> > indexed by the lowest bit set in the mask, a single comparison.
> >
> > Move the topology_span_sane() check out of the existing topology level
> > loop, let it use its own loop so that the array allocation can be done
> > only once, shared across levels.
> >
> > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> > the average time to take one processor offline is reduced from 2.18
> > seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> > 34m49.765s without this change, 16m10.038s with this change in place.)
> >
> > Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> > ---
> >
> > Version 3: While the intent of this patch is no functional change, I
> > discovered that version 2 had conditions where it would give different
> > results than the original code.  Version 3 returns to the V1 approach,
> > additionally correcting the handling of masks with no bits set and
> > fixing the num_possible_cpus() problem Peter Zijlstra noted.  In a
> > stand-alone test program that used all possible sets of four 4-bit
> > masks, this algorithm matched the original code in all cases, where
> > the others did not.
> >
> 
> So looking at my notes from v2 I was under the impression the array-less
> approach worked, do you have an example topology where the array-less
> approach fails? I usually poke at topology stuff via QEMU so if you have a
> virtual topology description I'd be happy to give that a span.

Valentin, thank you for your time looking at this patch.

Note that I'm trying to make this patch function exactly as the code
did before, only faster, regardless of the inputs.  No functional
change.

Your statement below about assuming a mask should at least contain the
cpu itself is intertwined with finding differences.  This code is
checking the validity of the masks.  If we can't already trust that
the masks are disjoint, why can we trust they at least have the cpu
itself set?

Where the assumption that a cpu's mask contains itself holds true, it
appears v2 and v3 agree.

An example of where the two disagree would be a set of four masks,
0010 0001 0001 0001.  These match the stated criteria being checked
(no overlap between masks that don't exactly match), yet the v2
algorithm would mark it insane, where the original code doesn't.

If it's valid to mark some additional conditions as insane, I'd rather
see that in a different patch, because I'm not comfortable enough with
the ramifications of possibly marking as insane a system that current
code reports as sane.

> > -	for_each_cpu_from(i, cpu_map) {
> >               /*
> > -		 * We should 'and' all those masks with 'cpu_map' to exactly
> > -		 * match the topology we're about to build, but that can only
> > -		 * remove CPUs, which only lessens our ability to detect
> > -		 * overlaps
> > +		 * Non-NUMA levels cannot partially overlap - they must be either
> > +		 * completely equal or completely disjoint. Otherwise we can end up
> > +		 * breaking the sched_group lists - i.e. a later get_group() pass
> > +		 * breaks the linking done for an earlier span.
> >                */
> > -		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> > -		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> > -			return false;
> > +		for_each_cpu(cpu, cpu_map) {
> > +			/* lowest bit set in this mask is used as a unique id */
> > +			id = cpumask_first(tl->mask(cpu));
> > +
> > +			/* zeroed masks cannot possibly collide */
> > +			if (id >= nr_cpu_ids)
> > +				continue;
> > +
> 
> Is it even legal for an online CPU's topology mask to be empty?! I would
> assume it should *at least* contain itself.

Again, this is sanity checking untrusted inputs.  I realized that if a
mask was zero, the value of id would overindex the array.  The check
is cheap.

--> Steve

-- 
Steve Wahl, Hewlett Packard Enterprise
Re: [PATCH v3] sched/topology: improve topology_span_sane speed
Posted by Valentin Schneider 10 months ago
On 14/02/25 09:42, Steve Wahl wrote:
> On Fri, Feb 14, 2025 at 03:25:31PM +0100, Valentin Schneider wrote:
>> On 10/02/25 09:42, Steve Wahl wrote:
>> > Use a different approach to topology_span_sane(), that checks for the
>> > same constraint of no partial overlaps for any two CPU sets for
>> > non-NUMA topology levels, but does so in a way that is O(N) rather
>> > than O(N^2).
>> >
>> > Instead of comparing with all other masks to detect collisions, keep
>> > one mask that includes all CPUs seen so far and detect collisions with
>> > a single cpumask_intersects test.
>> >
>> > If the current mask has no collisions with previously seen masks, it
>> > should be a new mask, which can be uniquely identified by the lowest
>> > bit set in this mask.  Keep a pointer to this mask for future
>> > reference (in an array indexed by the lowest bit set), and add the
>> > CPUs in this mask to the list of those seen.
>> >
>> > If the current mask does collide with previously seen masks, it should
>> > be exactly equal to a mask seen before, looked up in the same array
>> > indexed by the lowest bit set in the mask, a single comparison.
>> >
>> > Move the topology_span_sane() check out of the existing topology level
>> > loop, let it use its own loop so that the array allocation can be done
>> > only once, shared across levels.
>> >
>> > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
>> > the average time to take one processor offline is reduced from 2.18
>> > seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
>> > 34m49.765s without this change, 16m10.038s with this change in place.)
>> >
>> > Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
>> > ---
>> >
>> > Version 3: While the intent of this patch is no functional change, I
>> > discovered that version 2 had conditions where it would give different
>> > results than the original code.  Version 3 returns to the V1 approach,
>> > additionally correcting the handling of masks with no bits set and
>> > fixing the num_possible_cpus() problem Peter Zijlstra noted.  In a
>> > stand-alone test program that used all possible sets of four 4-bit
>> > masks, this algorithm matched the original code in all cases, where
>> > the others did not.
>> >
>>
>> So looking at my notes from v2 I was under the impression the array-less
>> approach worked, do you have an example topology where the array-less
>> approach fails? I usually poke at topology stuff via QEMU so if you have a
>> virtual topology description I'd be happy to give that a span.
>
> Valentin, thank you for your time looking at this patch.
>
> Note that I'm trying to make this patch function exactly as the code
> did before, only faster, regardless of the inputs.  No functional
> change.
>
> Your statement below about assuming a mask should at least contain the
> cpu itself is intertwined with finding differences.  This code is
> checking the validity of the masks.  If we can't already trust that
> the masks are disjoint, why can we trust they at least have the cpu
> itself set?
>

True... Though I think this would already be caught by the sched_domain
debugging infra we have, see sched_domain_debug_one().

So roughly speaking I'd say SCHED_DEBUG+sched_verbose gives you basic
sanity checks on individual sched_domain's (and so indirectly topology
levels), while topology_span_sane() looks at the intersections between the
spans to check it all makes sense as a whole.

Arguably there is some intersection (!) between these two debugging
features, but I think it still makes sense to keep them separate.

> Where the assumption that a cpu's mask contains itself holds true, it
> appears v2 and v3 agree.
>
> An example of where the two disagree would be a set of four masks,
> 0010 0001 0001 0001.  These match the stated criteria being checked
> (no overlap between masks that don't exactly match), yet the v2
> algorithm would mark it insane, where the original code doesn't.
>
> If it's valid to mark some additional conditions as insane, I'd rather
> see that in a different patch, because I'm not comfortable enough with
> the ramifications of possibly marking as insane a system that current
> code reports as sane.
>

Per the above I think it's fairly safe to call that setup insane,
sched_domain_debug_one() would call it so.

IMO your v2 had sufficient checks and was quite neat without the
additional array. Unless I'm missing something I don't see why we couldn't
stick with that approach.
Re: [PATCH v3] sched/topology: improve topology_span_sane speed
Posted by Steve Wahl 9 months, 4 weeks ago
On Mon, Feb 17, 2025 at 11:11:36AM +0100, Valentin Schneider wrote:
> On 14/02/25 09:42, Steve Wahl wrote:
> > On Fri, Feb 14, 2025 at 03:25:31PM +0100, Valentin Schneider wrote:
> >> On 10/02/25 09:42, Steve Wahl wrote:
> >> > Use a different approach to topology_span_sane(), that checks for the
> >> > same constraint of no partial overlaps for any two CPU sets for
> >> > non-NUMA topology levels, but does so in a way that is O(N) rather
> >> > than O(N^2).
> >> >
> >> > Instead of comparing with all other masks to detect collisions, keep
> >> > one mask that includes all CPUs seen so far and detect collisions with
> >> > a single cpumask_intersects test.
> >> >
> >> > If the current mask has no collisions with previously seen masks, it
> >> > should be a new mask, which can be uniquely identified by the lowest
> >> > bit set in this mask.  Keep a pointer to this mask for future
> >> > reference (in an array indexed by the lowest bit set), and add the
> >> > CPUs in this mask to the list of those seen.
> >> >
> >> > If the current mask does collide with previously seen masks, it should
> >> > be exactly equal to a mask seen before, looked up in the same array
> >> > indexed by the lowest bit set in the mask, a single comparison.
> >> >
> >> > Move the topology_span_sane() check out of the existing topology level
> >> > loop, let it use its own loop so that the array allocation can be done
> >> > only once, shared across levels.
> >> >
> >> > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> >> > the average time to take one processor offline is reduced from 2.18
> >> > seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> >> > 34m49.765s without this change, 16m10.038s with this change in place.)
> >> >
> >> > Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> >> > ---
> >> >
> >> > Version 3: While the intent of this patch is no functional change, I
> >> > discovered that version 2 had conditions where it would give different
> >> > results than the original code.  Version 3 returns to the V1 approach,
> >> > additionally correcting the handling of masks with no bits set and
> >> > fixing the num_possible_cpus() problem Peter Zijlstra noted.  In a
> >> > stand-alone test program that used all possible sets of four 4-bit
> >> > masks, this algorithm matched the original code in all cases, where
> >> > the others did not.
> >> >
> >>
> >> So looking at my notes from v2 I was under the impression the array-less
> >> approach worked, do you have an example topology where the array-less
> >> approach fails? I usually poke at topology stuff via QEMU so if you have a
> >> virtual topology description I'd be happy to give that a span.
> >
> > Valentin, thank you for your time looking at this patch.
> >
> > Note that I'm trying to make this patch function exactly as the code
> > did before, only faster, regardless of the inputs.  No functional
> > change.
> >
> > Your statement below about assuming a mask should at least contain the
> > cpu itself is intertwined with finding differences.  This code is
> > checking the validity of the masks.  If we can't already trust that
> > the masks are disjoint, why can we trust they at least have the cpu
> > itself set?
> >
> 
> True... Though I think this would already be caught by the sched_domain
> debugging infra we have, see sched_domain_debug_one().

Note that a previous patch of mine was reverted because it allowed
another problem to surface on a small number of machines (and was
later re-instated after fixing the other problem).

Reference: https://lore.kernel.org/all/20240717213121.3064030-1-steve.wahl@hpe.com

So, I am quite sensitive to introducing unintended behavior changes.

Anyway, sched_domain_debug_one() is only called when
sched_debug_verbose is set.  But, at least as it sits currently,
topology_span_sane() is run at all times, and its return code is acted
on to change system behavior.

If there's a system out there where the cpu masks are buggy but
currently accepted, I don't want this patch to cause that system to
degrade by declaring it insane.

I don't fully understand all the code that sets up masks, as there's a
lot of it.  But as an example, I think I see in
arch/s390/kernel/topology.c, that update_cpu_masks() uses
cpu_group_map() to update masks, and that function zeros the mask and
then returns if the cpu is not set in cpu_setup_mask.  So potentially
there can be some zeroed masks.

[Why am I looking at s390 code? Simply because a 'sort | uniq' on the
possible tl->mask() functions took me to cpu_book_mask() first.]

> So roughly speaking I'd say SCHED_DEBUG+sched_verbose gives you basic
> sanity checks on individual sched_domain's (and so indirectly topology
> levels), while topology_span_sane() looks at the intersections between the
> spans to check it all makes sense as a whole.
> 
> Arguably there is some intersection (!) between these two debugging
> features, but I think it still makes sense to keep them separate.
> 
> > Where the assumption that a cpu's mask contains itself holds true, it
> > appears v2 and v3 agree.
> >
> > An example of where the two disagree would be a set of four masks,
> > 0010 0001 0001 0001.  These match the stated criteria being checked
> > (no overlap between masks that don't exactly match), yet the v2
> > algorithm would mark it insane, where the original code doesn't.
> >
> > If it's valid to mark some additional conditions as insane, I'd rather
> > see that in a different patch, because I'm not comfortable enough with
> > the ramifications of possibly marking as insane a system that current
> > code reports as sane.
> >
> 
> Per the above I think it's fairly safe to call that setup insane,
> sched_domain_debug_one() would call it so.

But this isn't just debug code, at least as it sits now, and system
operation changes based on what it returns.  It's not just a printk.

> IMO your v2 had sufficient checks and was quite neat without the
> additional array. Unless I'm missing something I don't see why we couldn't
> stick with that approach.

I won't deny I liked the appearance of v2.  As a separate follow on
patch I certainly wouldn't object, especially if it came from someone
working on improving the scheduling code, instead of someone who's
just here because this code slows down large machines significantly.

But I would first like to see the speed-up in a low-risk form without
possible functional changes.

--> Steve
-- 
Steve Wahl, Hewlett Packard Enterprise
Re: [PATCH v3] sched/topology: improve topology_span_sane speed
Posted by Valentin Schneider 9 months, 3 weeks ago
On 20/02/25 13:59, Steve Wahl wrote:
> On Mon, Feb 17, 2025 at 11:11:36AM +0100, Valentin Schneider wrote:
>> On 14/02/25 09:42, Steve Wahl wrote:
>> >
>> > Valentin, thank you for your time looking at this patch.
>> >
>> > Note that I'm trying to make this patch function exactly as the code
>> > did before, only faster, regardless of the inputs.  No functional
>> > change.
>> >
>> > Your statement below about assuming a mask should at least contain the
>> > cpu itself is intertwined with finding differences.  This code is
>> > checking the validity of the masks.  If we can't already trust that
>> > the masks are disjoint, why can we trust they at least have the cpu
>> > itself set?
>> >
>>
>> True... Though I think this would already be caught by the sched_domain
>> debugging infra we have, see sched_domain_debug_one().
>
> Note that a previous patch of mine was reverted because it allowed
> another problem to surface on a small number of machines (and was
> later re-instated after fixing the other problem).
>
> Reference: https://lore.kernel.org/all/20240717213121.3064030-1-steve.wahl@hpe.com
>
> So, I am quite sensitive to introducing unintended behavior changes.
>
> Anyway, sched_domain_debug_one() is only called when
> sched_debug_verbose is set.  But, at least as it sits currently,
> topology_span_sane() is run at all times, and its return code is acted
> on to change system behavior.
>
> If there's a system out there where the cpu masks are buggy but
> currently accepted, I don't want this patch to cause that system to
> degrade by declaring it insane.
>
> I don't fully understand all the code that sets up masks, as there's a
> lot of it.  But as an example, I think I see in
> arch/s390/kernel/topology.c, that update_cpu_masks() uses
> cpu_group_map() to update masks, and that function zeros the mask and
> then returns if the cpu is not set in cpu_setup_mask.  So potentially
> there can be some zeroed masks.
>
> [Why am I looking at s390 code? Simply because a 'sort | uniq' on the
> possible tl->mask() functions took me to cpu_book_mask() first.]
>

IIUC that cpu_setup_mask is pretty much cpu_online_mask:

smp_start_secondary(void *cpuvoid)
  cpumask_set_cpu(cpu, &cpu_setup_mask);
  set_cpu_online(cpu, true);

int __cpu_disable(void)
  set_cpu_online(cpu, false);
  cpumask_clear_cpu(cpu, &cpu_setup_mask);

IOW, topology code will build something that isn't a subset of
cpu_setup_mask, thus won't get an empty mask.
Re: [PATCH v3] sched/topology: improve topology_span_sane speed
Posted by Steve Wahl 9 months, 3 weeks ago
On Thu, Feb 20, 2025 at 01:59:20PM -0600, Steve Wahl wrote:
> On Mon, Feb 17, 2025 at 11:11:36AM +0100, Valentin Schneider wrote:
> > On 14/02/25 09:42, Steve Wahl wrote:
> > > On Fri, Feb 14, 2025 at 03:25:31PM +0100, Valentin Schneider wrote:
> > >> On 10/02/25 09:42, Steve Wahl wrote:
> > >> > Use a different approach to topology_span_sane(), that checks for the
> > >> > same constraint of no partial overlaps for any two CPU sets for
> > >> > non-NUMA topology levels, but does so in a way that is O(N) rather
> > >> > than O(N^2).
> > >> >
> > >> > Instead of comparing with all other masks to detect collisions, keep
> > >> > one mask that includes all CPUs seen so far and detect collisions with
> > >> > a single cpumask_intersects test.
> > >> >
> > >> > If the current mask has no collisions with previously seen masks, it
> > >> > should be a new mask, which can be uniquely identified by the lowest
> > >> > bit set in this mask.  Keep a pointer to this mask for future
> > >> > reference (in an array indexed by the lowest bit set), and add the
> > >> > CPUs in this mask to the list of those seen.
> > >> >
> > >> > If the current mask does collide with previously seen masks, it should
> > >> > be exactly equal to a mask seen before, looked up in the same array
> > >> > indexed by the lowest bit set in the mask, a single comparison.
> > >> >
> > >> > Move the topology_span_sane() check out of the existing topology level
> > >> > loop, let it use its own loop so that the array allocation can be done
> > >> > only once, shared across levels.
> > >> >
> > >> > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> > >> > the average time to take one processor offline is reduced from 2.18
> > >> > seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> > >> > 34m49.765s without this change, 16m10.038s with this change in place.)
> > >> >
> > >> > Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> > >> > ---
> > >> >
> > >> > Version 3: While the intent of this patch is no functional change, I
> > >> > discovered that version 2 had conditions where it would give different
> > >> > results than the original code.  Version 3 returns to the V1 approach,
> > >> > additionally correcting the handling of masks with no bits set and
> > >> > fixing the num_possible_cpus() problem Peter Zijlstra noted.  In a
> > >> > stand-alone test program that used all possible sets of four 4-bit
> > >> > masks, this algorithm matched the original code in all cases, where
> > >> > the others did not.
> > >> >
> > >>
> > >> So looking at my notes from v2 I was under the impression the array-less
> > >> approach worked, do you have an example topology where the array-less
> > >> approach fails? I usually poke at topology stuff via QEMU so if you have a
> > >> virtual topology description I'd be happy to give that a span.
> > >
> > > Valentin, thank you for your time looking at this patch.
> > >
> > > Note that I'm trying to make this patch function exactly as the code
> > > did before, only faster, regardless of the inputs.  No functional
> > > change.
> > >
> > > Your statement below about assuming a mask should at least contain the
> > > cpu itself is intertwined with finding differences.  This code is
> > > checking the validity of the masks.  If we can't already trust that
> > > the masks are disjoint, why can we trust they at least have the cpu
> > > itself set?
> > >
> > 
> > True... Though I think this would already be caught by the sched_domain
> > debugging infra we have, see sched_domain_debug_one().
> 
> Note that a previous patch of mine was reverted because it allowed
> another problem to surface on a small number of machines (and was
> later re-instated after fixing the other problem).
> 
> Reference: https://lore.kernel.org/all/20240717213121.3064030-1-steve.wahl@hpe.com
> 
> So, I am quite sensitive to introducing unintended behavior changes.
> 
> Anyway, sched_domain_debug_one() is only called when
> sched_debug_verbose is set.  But, at least as it sits currently,
> topology_span_sane() is run at all times, and its return code is acted
> on to change system behavior.
> 
> If there's a system out there where the cpu masks are buggy but
> currently accepted, I don't want this patch to cause that system to
> degrade by declaring it insane.
> 
> I don't fully understand all the code that sets up masks, as there's a
> lot of it.  But as an example, I think I see in
> arch/s390/kernel/topology.c, that update_cpu_masks() uses
> cpu_group_map() to update masks, and that function zeros the mask and
> then returns if the cpu is not set in cpu_setup_mask.  So potentially
> there can be some zeroed masks.
> 
> [Why am I looking at s390 code? Simply because a 'sort | uniq' on the
> possible tl->mask() functions took me to cpu_book_mask() first.]
> 
> > So roughly speaking I'd say SCHED_DEBUG+sched_verbose gives you basic
> > sanity checks on individual sched_domain's (and so indirectly topology
> > levels), while topology_span_sane() looks at the intersections between the
> > spans to check it all makes sense as a whole.
> > 
> > Arguably there is some intersection (!) between these two debugging
> > features, but I think it still makes sense to keep them separate.
> > 
> > > Where the assumption that a cpu's mask contains itself holds true, it
> > > appears v2 and v3 agree.
> > >
> > > An example of where the two disagree would be a set of four masks,
> > > 0010 0001 0001 0001.  These match the stated criteria being checked
> > > (no overlap between masks that don't exactly match), yet the v2
> > > algorithm would mark it insane, where the original code doesn't.
> > >
> > > If it's valid to mark some additional conditions as insane, I'd rather
> > > see that in a different patch, because I'm not comfortable enough with
> > > the ramifications of possibly marking as insane a system that current
> > > code reports as sane.
> > >
> > 
> > Per the above I think it's fairly safe to call that setup insane,
> > sched_domain_debug_one() would call it so.
> 
> But this isn't just debug code, at least as it sits now, and system
> operation changes based on what it returns.  It's not just a printk.
> 
> > IMO your v2 had sufficient checks and was quite neat without the
> > additional array. Unless I'm missing something I don't see why we couldn't
> > stick with that approach.
> 
> I won't deny I liked the appearance of v2.  As a separate follow on
> patch I certainly wouldn't object, especially if it came from someone
> working on improving the scheduling code, instead of someone who's
> just here because this code slows down large machines significantly.
> 
> But I would first like to see the speed-up in a low-risk form without
> possible functional changes.

Valentin,

How would you feel about a two part series, where part one is my v1
patch, and part 2 is the v2 improvements?  Then if there was a
problem with the v2 improvements, that could be reverted and we'd
still have the speed improvement.

Thanks for considering,

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise
Re: [PATCH v3] sched/topology: improve topology_span_sane speed
Posted by Valentin Schneider 9 months, 3 weeks ago
On 25/02/25 14:02, Steve Wahl wrote:
> On Thu, Feb 20, 2025 at 01:59:20PM -0600, Steve Wahl wrote:
>> On Mon, Feb 17, 2025 at 11:11:36AM +0100, Valentin Schneider wrote:
>> > On 14/02/25 09:42, Steve Wahl wrote:
>> > > On Fri, Feb 14, 2025 at 03:25:31PM +0100, Valentin Schneider wrote:
>> > >> On 10/02/25 09:42, Steve Wahl wrote:
>> > >> > Use a different approach to topology_span_sane(), that checks for the
>> > >> > same constraint of no partial overlaps for any two CPU sets for
>> > >> > non-NUMA topology levels, but does so in a way that is O(N) rather
>> > >> > than O(N^2).
>> > >> >
>> > >> > Instead of comparing with all other masks to detect collisions, keep
>> > >> > one mask that includes all CPUs seen so far and detect collisions with
>> > >> > a single cpumask_intersects test.
>> > >> >
>> > >> > If the current mask has no collisions with previously seen masks, it
>> > >> > should be a new mask, which can be uniquely identified by the lowest
>> > >> > bit set in this mask.  Keep a pointer to this mask for future
>> > >> > reference (in an array indexed by the lowest bit set), and add the
>> > >> > CPUs in this mask to the list of those seen.
>> > >> >
>> > >> > If the current mask does collide with previously seen masks, it should
>> > >> > be exactly equal to a mask seen before, looked up in the same array
>> > >> > indexed by the lowest bit set in the mask, a single comparison.
>> > >> >
>> > >> > Move the topology_span_sane() check out of the existing topology level
>> > >> > loop, let it use its own loop so that the array allocation can be done
>> > >> > only once, shared across levels.
>> > >> >
>> > >> > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
>> > >> > the average time to take one processor offline is reduced from 2.18
>> > >> > seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
>> > >> > 34m49.765s without this change, 16m10.038s with this change in place.)
>> > >> >
>> > >> > Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
>> > >> > ---
>> > >> >
>> > >> > Version 3: While the intent of this patch is no functional change, I
>> > >> > discovered that version 2 had conditions where it would give different
>> > >> > results than the original code.  Version 3 returns to the V1 approach,
>> > >> > additionally correcting the handling of masks with no bits set and
>> > >> > fixing the num_possible_cpus() problem Peter Zijlstra noted.  In a
>> > >> > stand-alone test program that used all possible sets of four 4-bit
>> > >> > masks, this algorithm matched the original code in all cases, where
>> > >> > the others did not.
>> > >> >
>> > >>
>> > >> So looking at my notes from v2 I was under the impression the array-less
>> > >> approach worked, do you have an example topology where the array-less
>> > >> approach fails? I usually poke at topology stuff via QEMU so if you have a
>> > >> virtual topology description I'd be happy to give that a span.
>> > >
>> > > Valentin, thank you for your time looking at this patch.
>> > >
>> > > Note that I'm trying to make this patch function exactly as the code
>> > > did before, only faster, regardless of the inputs.  No functional
>> > > change.
>> > >
>> > > Your statement below about assuming a mask should at least contain the
>> > > cpu itself is intertwined with finding differences.  This code is
>> > > checking the validity of the masks.  If we can't already trust that
>> > > the masks are disjoint, why can we trust they at least have the cpu
>> > > itself set?
>> > >
>> >
>> > True... Though I think this would already be caught by the sched_domain
>> > debugging infra we have, see sched_domain_debug_one().
>>
>> Note that a previous patch of mine was reverted because it allowed
>> another problem to surface on a small number of machines (and was
>> later re-instated after fixing the other problem).
>>
>> Reference: https://lore.kernel.org/all/20240717213121.3064030-1-steve.wahl@hpe.com
>>
>> So, I am quite sensitive to introducing unintended behavior changes.
>>
>> Anyway, sched_domain_debug_one() is only called when
>> sched_debug_verbose is set.  But, at least as it sits currently,
>> topology_span_sane() is run at all times, and its return code is acted
>> on to change system behavior.
>>
>> If there's a system out there where the cpu masks are buggy but
>> currently accepted, I don't want this patch to cause that system to
>> degrade by declaring it insane.
>>
>> I don't fully understand all the code that sets up masks, as there's a
>> lot of it.  But as an example, I think I see in
>> arch/s390/kernel/topology.c, that update_cpu_masks() uses
>> cpu_group_map() to update masks, and that function zeros the mask and
>> then returns if the cpu is not set in cpu_setup_mask.  So potentially
>> there can be some zeroed masks.
>>
>> [Why am I looking at s390 code? Simply because a 'sort | uniq' on the
>> possible tl->mask() functions took me to cpu_book_mask() first.]
>>
>> > So roughly speaking I'd say SCHED_DEBUG+sched_verbose gives you basic
>> > sanity checks on individual sched_domain's (and so indirectly topology
>> > levels), while topology_span_sane() looks at the intersections between the
>> > spans to check it all makes sense as a whole.
>> >
>> > Arguably there is some intersection (!) between these two debugging
>> > features, but I think it still makes sense to keep them separate.
>> >
>> > > Where the assumption that a cpu's mask contains itself holds true, it
>> > > appears v2 and v3 agree.
>> > >
>> > > An example of where the two disagree would be a set of four masks,
>> > > 0010 0001 0001 0001.  These match the stated criteria being checked
>> > > (no overlap between masks that don't exactly match), yet the v2
>> > > algorithm would mark it insane, where the original code doesn't.
>> > >
>> > > If it's valid to mark some additional conditions as insane, I'd rather
>> > > see that in a different patch, because I'm not comfortable enough with
>> > > the ramifications of possibly marking as insane a system that current
>> > > code reports as sane.
>> > >
>> >
>> > Per the above I think it's fairly safe to call that setup insane,
>> > sched_domain_debug_one() would call it so.
>>
>> But this isn't just debug code, at least as it sits now, and system
>> operation changes based on what it returns.  It's not just a printk.
>>
>> > IMO your v2 had sufficient checks and was quite neat without the
>> > additional array. Unless I'm missing something I don't see why we couldn't
>> > stick with that approach.
>>
>> I won't deny I liked the appearance of v2.  As a separate follow on
>> patch I certainly wouldn't object, especially if it came from someone
>> working on improving the scheduling code, instead of someone who's
>> just here because this code slows down large machines significantly.
>>
>> But I would first like to see the speed-up in a low-risk form without
>> possible functional changes.
>
> Valentin,
>
> How would you feel about a two part series, where part one is my v1
> patch, and part 2 is the v2 improvements?  Then if there was a
> problem with the v2 improvements, that could be reverted and we'd
> still have the speed improvement.
>
> Thanks for considering,

Sounds good to me! If it helps, I'll be happy to test that under QEMU with
various fake topologies.