[v2] sched/topology: improve topology_span_sane speed

[PATCH v2] sched/topology: improve topology_span_sane speed

Posted by Steve Wahl 3 weeks, 3 days ago

Use a different approach to topology_span_sane(), that checks for the
same constraint of no partial overlaps for any two CPU sets for
non-NUMA topology levels, but does so in a way that is O(N) rather
than O(N^2).

Instead of comparing with all other masks to detect collisions, keep
one mask that includes all CPUs seen so far and detect collisions with
a single cpumask_intersects test.

If the current mask has no collisions with previously seen masks, it
should be a new mask, which can be uniquely identified ("id") by the
lowest bit set in this mask.  Mark that we've seen a mask with this
id, and add the CPUs in this mask to the list of those seen.

If the current mask does collide with previously seen masks, it should
be exactly equal to a mask seen before, identified once again by the
lowest bit the current mask has set.  It's an error if we haven't seen
a mask with that id, or if the current mask doesn't match the one we
get by looking up that id.

Move the topology_span_sane() check out of the existing topology level
loop, let it do its own looping to match the needs of this algorithm.

On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
the average time to take one processor offline is reduced from 2.18
seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
34m49.765s without this change, 16m10.038s with this change in place.)

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
---

Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
simplifies the code, and eliminates the erroneous use of
num_possible_cpus() that Peter Zijlstra noted.

Version 1 discussion:
    https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/

 kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
 1 file changed, 48 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..6a2a3e91d59e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2356,35 +2356,58 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 
 /*
  * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
- * any two given CPUs at this (non-NUMA) topology level.
+ * any two given CPUs on non-NUMA topology levels.
  */
-static bool topology_span_sane(struct sched_domain_topology_level *tl,
-			      const struct cpumask *cpu_map, int cpu)
+static bool topology_span_sane(const struct cpumask *cpu_map)
 {
-	int i = cpu + 1;
+	struct sched_domain_topology_level *tl;
+	struct cpumask *covered, *id_seen;
+	int cpu;
 
-	/* NUMA levels are allowed to overlap */
-	if (tl->flags & SDTL_OVERLAP)
-		return true;
+	lockdep_assert_held(&sched_domains_mutex);
+	covered = sched_domains_tmpmask;
+	id_seen = sched_domains_tmpmask2;
+
+	for_each_sd_topology(tl) {
+
+		/* NUMA levels are allowed to overlap */
+		if (tl->flags & SDTL_OVERLAP)
+			continue;
+
+		cpumask_clear(covered);
+		cpumask_clear(id_seen);
 
-	/*
-	 * Non-NUMA levels cannot partially overlap - they must be either
-	 * completely equal or completely disjoint. Otherwise we can end up
-	 * breaking the sched_group lists - i.e. a later get_group() pass
-	 * breaks the linking done for an earlier span.
-	 */
-	for_each_cpu_from(i, cpu_map) {
 		/*
-		 * We should 'and' all those masks with 'cpu_map' to exactly
-		 * match the topology we're about to build, but that can only
-		 * remove CPUs, which only lessens our ability to detect
-		 * overlaps
+		 * Non-NUMA levels cannot partially overlap - they must be either
+		 * completely equal or completely disjoint. Otherwise we can end up
+		 * breaking the sched_group lists - i.e. a later get_group() pass
+		 * breaks the linking done for an earlier span.
 		 */
-		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
-		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
-			return false;
+		for_each_cpu(cpu, cpu_map) {
+			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
+			int id;
+
+			/* lowest bit set in this mask is used as a unique id */
+			id = cpumask_first(tl_cpu_mask);
+
+			/* if this mask doesn't collide with what we've already seen */
+			if (!cpumask_intersects(tl_cpu_mask, covered)) {
+				/* Really odd case when cpu != id, likely not sane */
+				if ((cpu != id) && !cpumask_equal(tl_cpu_mask, tl->mask(id)))
+					return false;
+				if (cpumask_test_and_set_cpu(id, id_seen))
+					return false;
+				cpumask_or(covered, tl_cpu_mask, covered);
+			} else if ((!cpumask_test_cpu(id, id_seen)) ||
+				    !cpumask_equal(tl->mask(id), tl_cpu_mask)) {
+				/*
+				 * a collision with covered should have exactly matched
+				 * a previously seen mask with the same id
+				 */
+				return false;
+			}
+		}
 	}
-
 	return true;
 }
 
@@ -2417,9 +2440,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		sd = NULL;
 		for_each_sd_topology(tl) {
 
-			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
-				goto error;
-
 			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
 
 			has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;
@@ -2433,6 +2453,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	if (WARN_ON(!topology_span_sane(cpu_map)))
+		goto error;
+
 	/* Build the groups for the domains */
 	for_each_cpu(i, cpu_map) {
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
-- 
2.26.2

Re: [PATCH v2] sched/topology: improve topology_span_sane speed

Posted by Valentin Schneider 1 week, 5 days ago

On 31/10/24 15:04, Steve Wahl wrote:
> Use a different approach to topology_span_sane(), that checks for the
> same constraint of no partial overlaps for any two CPU sets for
> non-NUMA topology levels, but does so in a way that is O(N) rather
> than O(N^2).
>
> Instead of comparing with all other masks to detect collisions, keep
> one mask that includes all CPUs seen so far and detect collisions with
> a single cpumask_intersects test.
>
> If the current mask has no collisions with previously seen masks, it
> should be a new mask, which can be uniquely identified ("id") by the
> lowest bit set in this mask.  Mark that we've seen a mask with this
> id, and add the CPUs in this mask to the list of those seen.
>
> If the current mask does collide with previously seen masks, it should
> be exactly equal to a mask seen before, identified once again by the
> lowest bit the current mask has set.  It's an error if we haven't seen
> a mask with that id, or if the current mask doesn't match the one we
> get by looking up that id.
>
> Move the topology_span_sane() check out of the existing topology level
> loop, let it do its own looping to match the needs of this algorithm.
>
> On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> the average time to take one processor offline is reduced from 2.18
> seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> 34m49.765s without this change, 16m10.038s with this change in place.)
>
> Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> ---
>
> Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
> simplifies the code, and eliminates the erroneous use of
> num_possible_cpus() that Peter Zijlstra noted.
>
> Version 1 discussion:
>     https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/
>
>  kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
>  1 file changed, 48 insertions(+), 25 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9748a4c8d668..6a2a3e91d59e 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2356,35 +2356,58 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
>  
>  /*
>   * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
> - * any two given CPUs at this (non-NUMA) topology level.
> + * any two given CPUs on non-NUMA topology levels.
>   */
> -static bool topology_span_sane(struct sched_domain_topology_level *tl,
> -			      const struct cpumask *cpu_map, int cpu)
> +static bool topology_span_sane(const struct cpumask *cpu_map)
>  {
> -	int i = cpu + 1;
> +	struct sched_domain_topology_level *tl;
> +	struct cpumask *covered, *id_seen;
> +	int cpu;
>  
> -	/* NUMA levels are allowed to overlap */
> -	if (tl->flags & SDTL_OVERLAP)
> -		return true;
> +	lockdep_assert_held(&sched_domains_mutex);
> +	covered = sched_domains_tmpmask;
> +	id_seen = sched_domains_tmpmask2;
> +
> +	for_each_sd_topology(tl) {
> +
> +		/* NUMA levels are allowed to overlap */
> +		if (tl->flags & SDTL_OVERLAP)
> +			continue;
> +
> +		cpumask_clear(covered);
> +		cpumask_clear(id_seen);
>  
> -	/*
> -	 * Non-NUMA levels cannot partially overlap - they must be either
> -	 * completely equal or completely disjoint. Otherwise we can end up
> -	 * breaking the sched_group lists - i.e. a later get_group() pass
> -	 * breaks the linking done for an earlier span.
> -	 */
> -	for_each_cpu_from(i, cpu_map) {
>  		/*
> -		 * We should 'and' all those masks with 'cpu_map' to exactly
> -		 * match the topology we're about to build, but that can only
> -		 * remove CPUs, which only lessens our ability to detect
> -		 * overlaps
> +		 * Non-NUMA levels cannot partially overlap - they must be either
> +		 * completely equal or completely disjoint. Otherwise we can end up
> +		 * breaking the sched_group lists - i.e. a later get_group() pass
> +		 * breaks the linking done for an earlier span.
>  		 */
> -		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> -		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> -			return false;
> +		for_each_cpu(cpu, cpu_map) {
> +			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
> +			int id;
> +
> +			/* lowest bit set in this mask is used as a unique id */
> +			id = cpumask_first(tl_cpu_mask);
> +
> +			/* if this mask doesn't collide with what we've already seen */
> +			if (!cpumask_intersects(tl_cpu_mask, covered)) {
> +				/* Really odd case when cpu != id, likely not sane */
> +				if ((cpu != id) && !cpumask_equal(tl_cpu_mask, tl->mask(id)))
> +					return false;
> +				if (cpumask_test_and_set_cpu(id, id_seen))
> +					return false;
> +				cpumask_or(covered, tl_cpu_mask, covered);
> +			} else if ((!cpumask_test_cpu(id, id_seen)) ||
> +				    !cpumask_equal(tl->mask(id), tl_cpu_mask)) {
> +				/*
> +				 * a collision with covered should have exactly matched
> +				 * a previously seen mask with the same id
> +				 */
> +				return false;
> +			}
> +		}

Ok so you're speeding it up, but you still get a O(nr_cpu_ids) walk every
hotplug when the check itself only needs to be done at most once per
possible online CPU combination (~ 2^(nr_cpu_ids)). If all CPUs are kicked
to life at boot, then the check only needs to be done once. If you only
boot with a subset of present CPUs to speed things up, the check still
becomes irrelevant once you've kicked the rest to life.

I would reiterate my suggestion to get to a state where the check can be
entirely short-circuited [1].

[1]: http://lore.kernel.org/r/xhsmh8quc5ca4.mognet@vschneid-thinkpadt14sgen2i.remote.csb

Re: [PATCH v2] sched/topology: improve topology_span_sane speed

Posted by Steve Wahl 1 week, 4 days ago

On Tue, Nov 12, 2024 at 05:15:47PM +0100, Valentin Schneider wrote:
> On 31/10/24 15:04, Steve Wahl wrote:
> > Use a different approach to topology_span_sane(), that checks for the
> > same constraint of no partial overlaps for any two CPU sets for
> > non-NUMA topology levels, but does so in a way that is O(N) rather
> > than O(N^2).
> >
> > Instead of comparing with all other masks to detect collisions, keep
> > one mask that includes all CPUs seen so far and detect collisions with
> > a single cpumask_intersects test.
> >
> > If the current mask has no collisions with previously seen masks, it
> > should be a new mask, which can be uniquely identified ("id") by the
> > lowest bit set in this mask.  Mark that we've seen a mask with this
> > id, and add the CPUs in this mask to the list of those seen.
> >
> > If the current mask does collide with previously seen masks, it should
> > be exactly equal to a mask seen before, identified once again by the
> > lowest bit the current mask has set.  It's an error if we haven't seen
> > a mask with that id, or if the current mask doesn't match the one we
> > get by looking up that id.
> >
> > Move the topology_span_sane() check out of the existing topology level
> > loop, let it do its own looping to match the needs of this algorithm.
> >
> > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> > the average time to take one processor offline is reduced from 2.18
> > seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> > 34m49.765s without this change, 16m10.038s with this change in place.)
> >
> > Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> > ---
> >
> > Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
> > simplifies the code, and eliminates the erroneous use of
> > num_possible_cpus() that Peter Zijlstra noted.
> >
> > Version 1 discussion:
> >     https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/
> >
> >  kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
> >  1 file changed, 48 insertions(+), 25 deletions(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9748a4c8d668..6a2a3e91d59e 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2356,35 +2356,58 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
> >  
> >  /*
> >   * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
> > - * any two given CPUs at this (non-NUMA) topology level.
> > + * any two given CPUs on non-NUMA topology levels.
> >   */
> > -static bool topology_span_sane(struct sched_domain_topology_level *tl,
> > -			      const struct cpumask *cpu_map, int cpu)
> > +static bool topology_span_sane(const struct cpumask *cpu_map)
> >  {
> > -	int i = cpu + 1;
> > +	struct sched_domain_topology_level *tl;
> > +	struct cpumask *covered, *id_seen;
> > +	int cpu;
> >  
> > -	/* NUMA levels are allowed to overlap */
> > -	if (tl->flags & SDTL_OVERLAP)
> > -		return true;
> > +	lockdep_assert_held(&sched_domains_mutex);
> > +	covered = sched_domains_tmpmask;
> > +	id_seen = sched_domains_tmpmask2;
> > +
> > +	for_each_sd_topology(tl) {
> > +
> > +		/* NUMA levels are allowed to overlap */
> > +		if (tl->flags & SDTL_OVERLAP)
> > +			continue;
> > +
> > +		cpumask_clear(covered);
> > +		cpumask_clear(id_seen);
> >  
> > -	/*
> > -	 * Non-NUMA levels cannot partially overlap - they must be either
> > -	 * completely equal or completely disjoint. Otherwise we can end up
> > -	 * breaking the sched_group lists - i.e. a later get_group() pass
> > -	 * breaks the linking done for an earlier span.
> > -	 */
> > -	for_each_cpu_from(i, cpu_map) {
> >  		/*
> > -		 * We should 'and' all those masks with 'cpu_map' to exactly
> > -		 * match the topology we're about to build, but that can only
> > -		 * remove CPUs, which only lessens our ability to detect
> > -		 * overlaps
> > +		 * Non-NUMA levels cannot partially overlap - they must be either
> > +		 * completely equal or completely disjoint. Otherwise we can end up
> > +		 * breaking the sched_group lists - i.e. a later get_group() pass
> > +		 * breaks the linking done for an earlier span.
> >  		 */
> > -		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> > -		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> > -			return false;
> > +		for_each_cpu(cpu, cpu_map) {
> > +			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
> > +			int id;
> > +
> > +			/* lowest bit set in this mask is used as a unique id */
> > +			id = cpumask_first(tl_cpu_mask);
> > +
> > +			/* if this mask doesn't collide with what we've already seen */
> > +			if (!cpumask_intersects(tl_cpu_mask, covered)) {
> > +				/* Really odd case when cpu != id, likely not sane */
> > +				if ((cpu != id) && !cpumask_equal(tl_cpu_mask, tl->mask(id)))
> > +					return false;
> > +				if (cpumask_test_and_set_cpu(id, id_seen))
> > +					return false;
> > +				cpumask_or(covered, tl_cpu_mask, covered);
> > +			} else if ((!cpumask_test_cpu(id, id_seen)) ||
> > +				    !cpumask_equal(tl->mask(id), tl_cpu_mask)) {
> > +				/*
> > +				 * a collision with covered should have exactly matched
> > +				 * a previously seen mask with the same id
> > +				 */
> > +				return false;
> > +			}
> > +		}
> 
> Ok so you're speeding it up, but you still get a O(nr_cpu_ids) walk every
> hotplug when the check itself only needs to be done at most once per
> possible online CPU combination (~ 2^(nr_cpu_ids)). If all CPUs are kicked
> to life at boot, then the check only needs to be done once. If you only
> boot with a subset of present CPUs to speed things up, the check still
> becomes irrelevant once you've kicked the rest to life.
> 
> I would reiterate my suggestion to get to a state where the check can be
> entirely short-circuited [1].
> 
> [1]: http://lore.kernel.org/r/xhsmh8quc5ca4.mognet@vschneid-thinkpadt14sgen2i.remote.csb

Bringing forward a bit of that conversation:

> > I tried adding this, surprisingly I saw no effect on the time taken,
> > perhaps even a small slowdown, when combined with my patch.  So at
> > this point I don't intend to add it to v2 of the patch.
> >
> 
> Thanks for testing, I assume your cpu_possible_mask reports more CPUs than
> you have physically plugged in...

That assumption is wrong.  I started with all CPUs enabled.  Disabled
and re-enabled cpus from there.  The timings I got were as I stated,
no effect, perhaps a small slowdown.

> I guess it would make sense to short-circuit the function when
> cpu_map is a subset of what we've previously checked, and then
> re-kick the testing once new CPU(s) are plugged in. Something like
> the untested below?
> 
> Optimisations notwithstanding, IMO we shouldn't be repeating checks if we
> can avoid it.

I will attempt to measure it once more.  I was surprised at my
measured results, but that's why we take them, right?

If I can't measure a difference, though, I am not sure it's
appropriate to include the change with this patch, the point of which
*is* optimization.

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise

Re: [PATCH v2] sched/topology: improve topology_span_sane speed

Posted by Steve Wahl 4 days, 23 hours ago

On Wed, Nov 13, 2024 at 09:42:34AM -0600, Steve Wahl wrote:
> On Tue, Nov 12, 2024 at 05:15:47PM +0100, Valentin Schneider wrote:
> > On 31/10/24 15:04, Steve Wahl wrote:
> > > Use a different approach to topology_span_sane(), that checks for the
> > > same constraint of no partial overlaps for any two CPU sets for
> > > non-NUMA topology levels, but does so in a way that is O(N) rather
> > > than O(N^2).
> > >
> > > Instead of comparing with all other masks to detect collisions, keep
> > > one mask that includes all CPUs seen so far and detect collisions with
> > > a single cpumask_intersects test.
> > >
> > > If the current mask has no collisions with previously seen masks, it
> > > should be a new mask, which can be uniquely identified ("id") by the
> > > lowest bit set in this mask.  Mark that we've seen a mask with this
> > > id, and add the CPUs in this mask to the list of those seen.
> > >
> > > If the current mask does collide with previously seen masks, it should
> > > be exactly equal to a mask seen before, identified once again by the
> > > lowest bit the current mask has set.  It's an error if we haven't seen
> > > a mask with that id, or if the current mask doesn't match the one we
> > > get by looking up that id.
> > >
> > > Move the topology_span_sane() check out of the existing topology level
> > > loop, let it do its own looping to match the needs of this algorithm.
> > >
> > > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> > > the average time to take one processor offline is reduced from 2.18
> > > seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> > > 34m49.765s without this change, 16m10.038s with this change in place.)
> > >
> > > Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> > > ---
> > >
> > > Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
> > > simplifies the code, and eliminates the erroneous use of
> > > num_possible_cpus() that Peter Zijlstra noted.
> > >
> > > Version 1 discussion:
> > >     https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/
> > >
> > >  kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
> > >  1 file changed, 48 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index 9748a4c8d668..6a2a3e91d59e 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2356,35 +2356,58 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
> > >  
> > >  /*
> > >   * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
> > > - * any two given CPUs at this (non-NUMA) topology level.
> > > + * any two given CPUs on non-NUMA topology levels.
> > >   */
> > > -static bool topology_span_sane(struct sched_domain_topology_level *tl,
> > > -			      const struct cpumask *cpu_map, int cpu)
> > > +static bool topology_span_sane(const struct cpumask *cpu_map)
> > >  {
> > > -	int i = cpu + 1;
> > > +	struct sched_domain_topology_level *tl;
> > > +	struct cpumask *covered, *id_seen;
> > > +	int cpu;
> > >  
> > > -	/* NUMA levels are allowed to overlap */
> > > -	if (tl->flags & SDTL_OVERLAP)
> > > -		return true;
> > > +	lockdep_assert_held(&sched_domains_mutex);
> > > +	covered = sched_domains_tmpmask;
> > > +	id_seen = sched_domains_tmpmask2;
> > > +
> > > +	for_each_sd_topology(tl) {
> > > +
> > > +		/* NUMA levels are allowed to overlap */
> > > +		if (tl->flags & SDTL_OVERLAP)
> > > +			continue;
> > > +
> > > +		cpumask_clear(covered);
> > > +		cpumask_clear(id_seen);
> > >  
> > > -	/*
> > > -	 * Non-NUMA levels cannot partially overlap - they must be either
> > > -	 * completely equal or completely disjoint. Otherwise we can end up
> > > -	 * breaking the sched_group lists - i.e. a later get_group() pass
> > > -	 * breaks the linking done for an earlier span.
> > > -	 */
> > > -	for_each_cpu_from(i, cpu_map) {
> > >  		/*
> > > -		 * We should 'and' all those masks with 'cpu_map' to exactly
> > > -		 * match the topology we're about to build, but that can only
> > > -		 * remove CPUs, which only lessens our ability to detect
> > > -		 * overlaps
> > > +		 * Non-NUMA levels cannot partially overlap - they must be either
> > > +		 * completely equal or completely disjoint. Otherwise we can end up
> > > +		 * breaking the sched_group lists - i.e. a later get_group() pass
> > > +		 * breaks the linking done for an earlier span.
> > >  		 */
> > > -		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> > > -		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> > > -			return false;
> > > +		for_each_cpu(cpu, cpu_map) {
> > > +			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
> > > +			int id;
> > > +
> > > +			/* lowest bit set in this mask is used as a unique id */
> > > +			id = cpumask_first(tl_cpu_mask);
> > > +
> > > +			/* if this mask doesn't collide with what we've already seen */
> > > +			if (!cpumask_intersects(tl_cpu_mask, covered)) {
> > > +				/* Really odd case when cpu != id, likely not sane */
> > > +				if ((cpu != id) && !cpumask_equal(tl_cpu_mask, tl->mask(id)))
> > > +					return false;
> > > +				if (cpumask_test_and_set_cpu(id, id_seen))
> > > +					return false;
> > > +				cpumask_or(covered, tl_cpu_mask, covered);
> > > +			} else if ((!cpumask_test_cpu(id, id_seen)) ||
> > > +				    !cpumask_equal(tl->mask(id), tl_cpu_mask)) {
> > > +				/*
> > > +				 * a collision with covered should have exactly matched
> > > +				 * a previously seen mask with the same id
> > > +				 */
> > > +				return false;
> > > +			}
> > > +		}
> > 
> > Ok so you're speeding it up, but you still get a O(nr_cpu_ids) walk every
> > hotplug when the check itself only needs to be done at most once per
> > possible online CPU combination (~ 2^(nr_cpu_ids)). If all CPUs are kicked
> > to life at boot, then the check only needs to be done once. If you only
> > boot with a subset of present CPUs to speed things up, the check still
> > becomes irrelevant once you've kicked the rest to life.
> > 
> > I would reiterate my suggestion to get to a state where the check can be
> > entirely short-circuited [1].
> > 
> > [1]: http://lore.kernel.org/r/xhsmh8quc5ca4.mognet@vschneid-thinkpadt14sgen2i.remote.csb
> 
> Bringing forward a bit of that conversation:
> 
> > > I tried adding this, surprisingly I saw no effect on the time taken,
> > > perhaps even a small slowdown, when combined with my patch.  So at
> > > this point I don't intend to add it to v2 of the patch.
> > >
> > 
> > Thanks for testing, I assume your cpu_possible_mask reports more CPUs than
> > you have physically plugged in...
> 
> That assumption is wrong.  I started with all CPUs enabled.  Disabled
> and re-enabled cpus from there.  The timings I got were as I stated,
> no effect, perhaps a small slowdown.
> 
> > I guess it would make sense to short-circuit the function when
> > cpu_map is a subset of what we've previously checked, and then
> > re-kick the testing once new CPU(s) are plugged in. Something like
> > the untested below?
> > 
> > Optimisations notwithstanding, IMO we shouldn't be repeating checks if we
> > can avoid it.
> 
> I will attempt to measure it once more.  I was surprised at my
> measured results, but that's why we take them, right?
> 
> If I can't measure a difference, though, I am not sure it's
> appropriate to include the change with this patch, the point of which
> *is* optimization.

I completed timing tests; test system has 1920 logical CPUS, 2 threads
per core, 60 cores per NUMA node, and the script offlined then onlined
both threads of half the cores in each node.  Four runs each were
timed.  Times in seconds.

Unpatched kernel:
	Offline		Online
min	3991.57		3967.22
max	4025.59		4028.66
avg	4013.79		3996.75
stddev	15.28		29.90

With my patch:
	Offline		Online
min	1987.97		2130.7
max	2032.25		2150.93
avg	2017.43		2140.18
stddev	20.12		10.25

With my patch + Valentin's extra short-circuit patch:
min	2019.58		2137.43
max	2056.89		2223.86
avg	2033.04		2171.91
stddev	16.83		37.73

I'm at a loss as to why adding the short circuit slowed things down,
but it's in the noise.  If you want a new version of the patch
incorporating both changes after viewing these timings, I'm willing to
make that change.  Let me know how you feel.

Thanks for your time,

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise

Re: [PATCH v2] sched/topology: improve topology_span_sane speed

Posted by Valentin Schneider 4 days, 21 hours ago

On 19/11/24 15:03, Steve Wahl wrote:
> On Wed, Nov 13, 2024 at 09:42:34AM -0600, Steve Wahl wrote:
>> On Tue, Nov 12, 2024 at 05:15:47PM +0100, Valentin Schneider wrote:
>> > On 31/10/24 15:04, Steve Wahl wrote:
>> > > Use a different approach to topology_span_sane(), that checks for the
>> > > same constraint of no partial overlaps for any two CPU sets for
>> > > non-NUMA topology levels, but does so in a way that is O(N) rather
>> > > than O(N^2).
>> > >
>> > > Instead of comparing with all other masks to detect collisions, keep
>> > > one mask that includes all CPUs seen so far and detect collisions with
>> > > a single cpumask_intersects test.
>> > >
>> > > If the current mask has no collisions with previously seen masks, it
>> > > should be a new mask, which can be uniquely identified ("id") by the
>> > > lowest bit set in this mask.  Mark that we've seen a mask with this
>> > > id, and add the CPUs in this mask to the list of those seen.
>> > >
>> > > If the current mask does collide with previously seen masks, it should
>> > > be exactly equal to a mask seen before, identified once again by the
>> > > lowest bit the current mask has set.  It's an error if we haven't seen
>> > > a mask with that id, or if the current mask doesn't match the one we
>> > > get by looking up that id.
>> > >
>> > > Move the topology_span_sane() check out of the existing topology level
>> > > loop, let it do its own looping to match the needs of this algorithm.
>> > >
>> > > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
>> > > the average time to take one processor offline is reduced from 2.18
>> > > seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
>> > > 34m49.765s without this change, 16m10.038s with this change in place.)
>> > >
>> > > Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
>> > > ---
>> > >
>> > > Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
>> > > simplifies the code, and eliminates the erroneous use of
>> > > num_possible_cpus() that Peter Zijlstra noted.
>> > >
>> > > Version 1 discussion:
>> > >     https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/
>> > >
>> > >  kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
>> > >  1 file changed, 48 insertions(+), 25 deletions(-)
>> > >
>> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> > > index 9748a4c8d668..6a2a3e91d59e 100644
>> > > --- a/kernel/sched/topology.c
>> > > +++ b/kernel/sched/topology.c
>> > > @@ -2356,35 +2356,58 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
>> > >
>> > >  /*
>> > >   * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
>> > > - * any two given CPUs at this (non-NUMA) topology level.
>> > > + * any two given CPUs on non-NUMA topology levels.
>> > >   */
>> > > -static bool topology_span_sane(struct sched_domain_topology_level *tl,
>> > > -			      const struct cpumask *cpu_map, int cpu)
>> > > +static bool topology_span_sane(const struct cpumask *cpu_map)
>> > >  {
>> > > -	int i = cpu + 1;
>> > > +	struct sched_domain_topology_level *tl;
>> > > +	struct cpumask *covered, *id_seen;
>> > > +	int cpu;
>> > >
>> > > -	/* NUMA levels are allowed to overlap */
>> > > -	if (tl->flags & SDTL_OVERLAP)
>> > > -		return true;
>> > > +	lockdep_assert_held(&sched_domains_mutex);
>> > > +	covered = sched_domains_tmpmask;
>> > > +	id_seen = sched_domains_tmpmask2;
>> > > +
>> > > +	for_each_sd_topology(tl) {
>> > > +
>> > > +		/* NUMA levels are allowed to overlap */
>> > > +		if (tl->flags & SDTL_OVERLAP)
>> > > +			continue;
>> > > +
>> > > +		cpumask_clear(covered);
>> > > +		cpumask_clear(id_seen);
>> > >
>> > > -	/*
>> > > -	 * Non-NUMA levels cannot partially overlap - they must be either
>> > > -	 * completely equal or completely disjoint. Otherwise we can end up
>> > > -	 * breaking the sched_group lists - i.e. a later get_group() pass
>> > > -	 * breaks the linking done for an earlier span.
>> > > -	 */
>> > > -	for_each_cpu_from(i, cpu_map) {
>> > >                  /*
>> > > -		 * We should 'and' all those masks with 'cpu_map' to exactly
>> > > -		 * match the topology we're about to build, but that can only
>> > > -		 * remove CPUs, which only lessens our ability to detect
>> > > -		 * overlaps
>> > > +		 * Non-NUMA levels cannot partially overlap - they must be either
>> > > +		 * completely equal or completely disjoint. Otherwise we can end up
>> > > +		 * breaking the sched_group lists - i.e. a later get_group() pass
>> > > +		 * breaks the linking done for an earlier span.
>> > >                   */
>> > > -		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
>> > > -		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
>> > > -			return false;
>> > > +		for_each_cpu(cpu, cpu_map) {
>> > > +			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
>> > > +			int id;
>> > > +
>> > > +			/* lowest bit set in this mask is used as a unique id */
>> > > +			id = cpumask_first(tl_cpu_mask);
>> > > +
>> > > +			/* if this mask doesn't collide with what we've already seen */
>> > > +			if (!cpumask_intersects(tl_cpu_mask, covered)) {
>> > > +				/* Really odd case when cpu != id, likely not sane */
>> > > +				if ((cpu != id) && !cpumask_equal(tl_cpu_mask, tl->mask(id)))
>> > > +					return false;
>> > > +				if (cpumask_test_and_set_cpu(id, id_seen))
>> > > +					return false;
>> > > +				cpumask_or(covered, tl_cpu_mask, covered);
>> > > +			} else if ((!cpumask_test_cpu(id, id_seen)) ||
>> > > +				    !cpumask_equal(tl->mask(id), tl_cpu_mask)) {
>> > > +				/*
>> > > +				 * a collision with covered should have exactly matched
>> > > +				 * a previously seen mask with the same id
>> > > +				 */
>> > > +				return false;
>> > > +			}
>> > > +		}
>> >
>> > Ok so you're speeding it up, but you still get a O(nr_cpu_ids) walk every
>> > hotplug when the check itself only needs to be done at most once per
>> > possible online CPU combination (~ 2^(nr_cpu_ids)). If all CPUs are kicked
>> > to life at boot, then the check only needs to be done once. If you only
>> > boot with a subset of present CPUs to speed things up, the check still
>> > becomes irrelevant once you've kicked the rest to life.
>> >
>> > I would reiterate my suggestion to get to a state where the check can be
>> > entirely short-circuited [1].
>> >
>> > [1]: http://lore.kernel.org/r/xhsmh8quc5ca4.mognet@vschneid-thinkpadt14sgen2i.remote.csb
>>
>> Bringing forward a bit of that conversation:
>>
>> > > I tried adding this, surprisingly I saw no effect on the time taken,
>> > > perhaps even a small slowdown, when combined with my patch.  So at
>> > > this point I don't intend to add it to v2 of the patch.
>> > >
>> >
>> > Thanks for testing, I assume your cpu_possible_mask reports more CPUs than
>> > you have physically plugged in...
>>
>> That assumption is wrong.  I started with all CPUs enabled.  Disabled
>> and re-enabled cpus from there.  The timings I got were as I stated,
>> no effect, perhaps a small slowdown.
>>
>> > I guess it would make sense to short-circuit the function when
>> > cpu_map is a subset of what we've previously checked, and then
>> > re-kick the testing once new CPU(s) are plugged in. Something like
>> > the untested below?
>> >
>> > Optimisations notwithstanding, IMO we shouldn't be repeating checks if we
>> > can avoid it.
>>
>> I will attempt to measure it once more.  I was surprised at my
>> measured results, but that's why we take them, right?
>>
>> If I can't measure a difference, though, I am not sure it's
>> appropriate to include the change with this patch, the point of which
>> *is* optimization.
>
> I completed timing tests; test system has 1920 logical CPUS, 2 threads
> per core, 60 cores per NUMA node, and the script offlined then onlined
> both threads of half the cores in each node.  Four runs each were
> timed.  Times in seconds.
>
> Unpatched kernel:
>       Offline		Online
> min	3991.57		3967.22
> max	4025.59		4028.66
> avg	4013.79		3996.75
> stddev	15.28		29.90
>
> With my patch:
>       Offline		Online
> min	1987.97		2130.7
> max	2032.25		2150.93
> avg	2017.43		2140.18
> stddev	20.12		10.25
>
> With my patch + Valentin's extra short-circuit patch:
> min	2019.58		2137.43
> max	2056.89		2223.86
> avg	2033.04		2171.91
> stddev	16.83		37.73
>
> I'm at a loss as to why adding the short circuit slowed things down,
> but it's in the noise.  If you want a new version of the patch
> incorporating both changes after viewing these timings, I'm willing to
> make that change.  Let me know how you feel.
>

Huh, weird. Obviously your patch improves the situation and the
short-circuit as proposed doesn't, so I'd say go forth with your patch and
I'll poke around to figure out what's going on and submit a hopefully
working short-circuit.

Re: [PATCH v2] sched/topology: improve topology_span_sane speed

Posted by Steve Wahl 4 days, 4 hours ago

On Tue, Nov 19, 2024 at 11:54:33PM +0100, Valentin Schneider wrote:
> On 19/11/24 15:03, Steve Wahl wrote:
> > On Wed, Nov 13, 2024 at 09:42:34AM -0600, Steve Wahl wrote:
> >> On Tue, Nov 12, 2024 at 05:15:47PM +0100, Valentin Schneider wrote:
> >> > On 31/10/24 15:04, Steve Wahl wrote:
> >> > > Use a different approach to topology_span_sane(), that checks for the
> >> > > same constraint of no partial overlaps for any two CPU sets for
> >> > > non-NUMA topology levels, but does so in a way that is O(N) rather
> >> > > than O(N^2).
> >> > >
> >> > > Instead of comparing with all other masks to detect collisions, keep
> >> > > one mask that includes all CPUs seen so far and detect collisions with
> >> > > a single cpumask_intersects test.
> >> > >
> >> > > If the current mask has no collisions with previously seen masks, it
> >> > > should be a new mask, which can be uniquely identified ("id") by the
> >> > > lowest bit set in this mask.  Mark that we've seen a mask with this
> >> > > id, and add the CPUs in this mask to the list of those seen.
> >> > >
> >> > > If the current mask does collide with previously seen masks, it should
> >> > > be exactly equal to a mask seen before, identified once again by the
> >> > > lowest bit the current mask has set.  It's an error if we haven't seen
> >> > > a mask with that id, or if the current mask doesn't match the one we
> >> > > get by looking up that id.
> >> > >
> >> > > Move the topology_span_sane() check out of the existing topology level
> >> > > loop, let it do its own looping to match the needs of this algorithm.
> >> > >
> >> > > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> >> > > the average time to take one processor offline is reduced from 2.18
> >> > > seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> >> > > 34m49.765s without this change, 16m10.038s with this change in place.)
> >> > >
> >> > > Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> >> > > ---
> >> > >
> >> > > Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
> >> > > simplifies the code, and eliminates the erroneous use of
> >> > > num_possible_cpus() that Peter Zijlstra noted.
> >> > >
> >> > > Version 1 discussion:
> >> > >     https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/
> >> > >
> >> > >  kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
> >> > >  1 file changed, 48 insertions(+), 25 deletions(-)
> >> > >
> >> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >> > > index 9748a4c8d668..6a2a3e91d59e 100644
> >> > > --- a/kernel/sched/topology.c
> >> > > +++ b/kernel/sched/topology.c
> >> > > @@ -2356,35 +2356,58 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
> >> > >
> >> > >  /*
> >> > >   * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
> >> > > - * any two given CPUs at this (non-NUMA) topology level.
> >> > > + * any two given CPUs on non-NUMA topology levels.
> >> > >   */
> >> > > -static bool topology_span_sane(struct sched_domain_topology_level *tl,
> >> > > -			      const struct cpumask *cpu_map, int cpu)
> >> > > +static bool topology_span_sane(const struct cpumask *cpu_map)
> >> > >  {
> >> > > -	int i = cpu + 1;
> >> > > +	struct sched_domain_topology_level *tl;
> >> > > +	struct cpumask *covered, *id_seen;
> >> > > +	int cpu;
> >> > >
> >> > > -	/* NUMA levels are allowed to overlap */
> >> > > -	if (tl->flags & SDTL_OVERLAP)
> >> > > -		return true;
> >> > > +	lockdep_assert_held(&sched_domains_mutex);
> >> > > +	covered = sched_domains_tmpmask;
> >> > > +	id_seen = sched_domains_tmpmask2;
> >> > > +
> >> > > +	for_each_sd_topology(tl) {
> >> > > +
> >> > > +		/* NUMA levels are allowed to overlap */
> >> > > +		if (tl->flags & SDTL_OVERLAP)
> >> > > +			continue;
> >> > > +
> >> > > +		cpumask_clear(covered);
> >> > > +		cpumask_clear(id_seen);
> >> > >
> >> > > -	/*
> >> > > -	 * Non-NUMA levels cannot partially overlap - they must be either
> >> > > -	 * completely equal or completely disjoint. Otherwise we can end up
> >> > > -	 * breaking the sched_group lists - i.e. a later get_group() pass
> >> > > -	 * breaks the linking done for an earlier span.
> >> > > -	 */
> >> > > -	for_each_cpu_from(i, cpu_map) {
> >> > >                  /*
> >> > > -		 * We should 'and' all those masks with 'cpu_map' to exactly
> >> > > -		 * match the topology we're about to build, but that can only
> >> > > -		 * remove CPUs, which only lessens our ability to detect
> >> > > -		 * overlaps
> >> > > +		 * Non-NUMA levels cannot partially overlap - they must be either
> >> > > +		 * completely equal or completely disjoint. Otherwise we can end up
> >> > > +		 * breaking the sched_group lists - i.e. a later get_group() pass
> >> > > +		 * breaks the linking done for an earlier span.
> >> > >                   */
> >> > > -		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> >> > > -		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> >> > > -			return false;
> >> > > +		for_each_cpu(cpu, cpu_map) {
> >> > > +			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
> >> > > +			int id;
> >> > > +
> >> > > +			/* lowest bit set in this mask is used as a unique id */
> >> > > +			id = cpumask_first(tl_cpu_mask);
> >> > > +
> >> > > +			/* if this mask doesn't collide with what we've already seen */
> >> > > +			if (!cpumask_intersects(tl_cpu_mask, covered)) {
> >> > > +				/* Really odd case when cpu != id, likely not sane */
> >> > > +				if ((cpu != id) && !cpumask_equal(tl_cpu_mask, tl->mask(id)))
> >> > > +					return false;
> >> > > +				if (cpumask_test_and_set_cpu(id, id_seen))
> >> > > +					return false;
> >> > > +				cpumask_or(covered, tl_cpu_mask, covered);
> >> > > +			} else if ((!cpumask_test_cpu(id, id_seen)) ||
> >> > > +				    !cpumask_equal(tl->mask(id), tl_cpu_mask)) {
> >> > > +				/*
> >> > > +				 * a collision with covered should have exactly matched
> >> > > +				 * a previously seen mask with the same id
> >> > > +				 */
> >> > > +				return false;
> >> > > +			}
> >> > > +		}
> >> >
> >> > Ok so you're speeding it up, but you still get a O(nr_cpu_ids) walk every
> >> > hotplug when the check itself only needs to be done at most once per
> >> > possible online CPU combination (~ 2^(nr_cpu_ids)). If all CPUs are kicked
> >> > to life at boot, then the check only needs to be done once. If you only
> >> > boot with a subset of present CPUs to speed things up, the check still
> >> > becomes irrelevant once you've kicked the rest to life.
> >> >
> >> > I would reiterate my suggestion to get to a state where the check can be
> >> > entirely short-circuited [1].
> >> >
> >> > [1]: http://lore.kernel.org/r/xhsmh8quc5ca4.mognet@vschneid-thinkpadt14sgen2i.remote.csb
> >>
> >> Bringing forward a bit of that conversation:
> >>
> >> > > I tried adding this, surprisingly I saw no effect on the time taken,
> >> > > perhaps even a small slowdown, when combined with my patch.  So at
> >> > > this point I don't intend to add it to v2 of the patch.
> >> > >
> >> >
> >> > Thanks for testing, I assume your cpu_possible_mask reports more CPUs than
> >> > you have physically plugged in...
> >>
> >> That assumption is wrong.  I started with all CPUs enabled.  Disabled
> >> and re-enabled cpus from there.  The timings I got were as I stated,
> >> no effect, perhaps a small slowdown.
> >>
> >> > I guess it would make sense to short-circuit the function when
> >> > cpu_map is a subset of what we've previously checked, and then
> >> > re-kick the testing once new CPU(s) are plugged in. Something like
> >> > the untested below?
> >> >
> >> > Optimisations notwithstanding, IMO we shouldn't be repeating checks if we
> >> > can avoid it.
> >>
> >> I will attempt to measure it once more.  I was surprised at my
> >> measured results, but that's why we take them, right?
> >>
> >> If I can't measure a difference, though, I am not sure it's
> >> appropriate to include the change with this patch, the point of which
> >> *is* optimization.
> >
> > I completed timing tests; test system has 1920 logical CPUS, 2 threads
> > per core, 60 cores per NUMA node, and the script offlined then onlined
> > both threads of half the cores in each node.  Four runs each were
> > timed.  Times in seconds.
> >
> > Unpatched kernel:
> >       Offline		Online
> > min	3991.57		3967.22
> > max	4025.59		4028.66
> > avg	4013.79		3996.75
> > stddev	15.28		29.90
> >
> > With my patch:
> >       Offline		Online
> > min	1987.97		2130.7
> > max	2032.25		2150.93
> > avg	2017.43		2140.18
> > stddev	20.12		10.25
> >
> > With my patch + Valentin's extra short-circuit patch:
> > min	2019.58		2137.43
> > max	2056.89		2223.86
> > avg	2033.04		2171.91
> > stddev	16.83		37.73
> >
> > I'm at a loss as to why adding the short circuit slowed things down,
> > but it's in the noise.  If you want a new version of the patch
> > incorporating both changes after viewing these timings, I'm willing to
> > make that change.  Let me know how you feel.
> >
> 
> Huh, weird. Obviously your patch improves the situation and the
> short-circuit as proposed doesn't, so I'd say go forth with your patch and
> I'll poke around to figure out what's going on and submit a hopefully
> working short-circuit.

I think the short-circuit is actually working.  In the previous email
iteration on this, after I did the timings saying it was a tad slower,
I added some printk's to make sure the short circuit was actually
happening, and it appeared to be true.  And this time around, I came
up with a version of your patch that functions without mine; it's not
as pretty before combining all the topology levels into one call to
topology_span_sane, but the result of the bypass is a similar time
saving.

I am thinking that the memory touched by the full topology_span_sane
run may be pre-loading the cache for some of the subsequent
operations.  That's the best I can come up with, at least for the
moment.

--> Steve

-- 
Steve Wahl, Hewlett Packard Enterprise

Re: [PATCH v2] sched/topology: improve topology_span_sane speed

Posted by K Prateek Nayak 2 weeks, 4 days ago

Hello Steve,

On 11/1/2024 1:34 AM, Steve Wahl wrote:
> Use a different approach to topology_span_sane(), that checks for the
> same constraint of no partial overlaps for any two CPU sets for
> non-NUMA topology levels, but does so in a way that is O(N) rather
> than O(N^2).
> 
> Instead of comparing with all other masks to detect collisions, keep
> one mask that includes all CPUs seen so far and detect collisions with
> a single cpumask_intersects test.
> 
> If the current mask has no collisions with previously seen masks, it
> should be a new mask, which can be uniquely identified ("id") by the
> lowest bit set in this mask.  Mark that we've seen a mask with this
> id, and add the CPUs in this mask to the list of those seen.
> 
> If the current mask does collide with previously seen masks, it should
> be exactly equal to a mask seen before, identified once again by the
> lowest bit the current mask has set.  It's an error if we haven't seen
> a mask with that id, or if the current mask doesn't match the one we
> get by looking up that id.
> 
> Move the topology_span_sane() check out of the existing topology level
> loop, let it do its own looping to match the needs of this algorithm.
> 
> On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> the average time to take one processor offline is reduced from 2.18
> seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> 34m49.765s without this change, 16m10.038s with this change in place.)
> 
> Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> ---
> 
> Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
> simplifies the code, and eliminates the erroneous use of
> num_possible_cpus() that Peter Zijlstra noted.
> 
> Version 1 discussion:
>      https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/
> 
>   kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
>   1 file changed, 48 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9748a4c8d668..6a2a3e91d59e 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2356,35 +2356,58 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
>   
>   /*
>    * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
> - * any two given CPUs at this (non-NUMA) topology level.
> + * any two given CPUs on non-NUMA topology levels.
>    */
> -static bool topology_span_sane(struct sched_domain_topology_level *tl,
> -			      const struct cpumask *cpu_map, int cpu)
> +static bool topology_span_sane(const struct cpumask *cpu_map)
>   {
> -	int i = cpu + 1;
> +	struct sched_domain_topology_level *tl;
> +	struct cpumask *covered, *id_seen;
> +	int cpu;
>   
> -	/* NUMA levels are allowed to overlap */
> -	if (tl->flags & SDTL_OVERLAP)
> -		return true;
> +	lockdep_assert_held(&sched_domains_mutex);
> +	covered = sched_domains_tmpmask;
> +	id_seen = sched_domains_tmpmask2;
> +
> +	for_each_sd_topology(tl) {
> +
> +		/* NUMA levels are allowed to overlap */
> +		if (tl->flags & SDTL_OVERLAP)
> +			continue;
> +
> +		cpumask_clear(covered);
> +		cpumask_clear(id_seen);
>   
> -	/*
> -	 * Non-NUMA levels cannot partially overlap - they must be either
> -	 * completely equal or completely disjoint. Otherwise we can end up
> -	 * breaking the sched_group lists - i.e. a later get_group() pass
> -	 * breaks the linking done for an earlier span.
> -	 */
> -	for_each_cpu_from(i, cpu_map) {
>   		/*
> -		 * We should 'and' all those masks with 'cpu_map' to exactly
> -		 * match the topology we're about to build, but that can only
> -		 * remove CPUs, which only lessens our ability to detect
> -		 * overlaps
> +		 * Non-NUMA levels cannot partially overlap - they must be either
> +		 * completely equal or completely disjoint. Otherwise we can end up
> +		 * breaking the sched_group lists - i.e. a later get_group() pass
> +		 * breaks the linking done for an earlier span.
>   		 */
> -		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> -		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> -			return false;
> +		for_each_cpu(cpu, cpu_map) {
> +			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
> +			int id;
> +
> +			/* lowest bit set in this mask is used as a unique id */
> +			id = cpumask_first(tl_cpu_mask);
> +
> +			/* if this mask doesn't collide with what we've already seen */
> +			if (!cpumask_intersects(tl_cpu_mask, covered)) {
> +				/* Really odd case when cpu != id, likely not sane */
> +				if ((cpu != id) && !cpumask_equal(tl_cpu_mask, tl->mask(id)))

I was wondering, since we are doing a "for_each_cpu(cpu, cpu_map)",
wouldn't we always see the "id" cpu first since "id" is nothing but the
first CPU of the topology level mask? Maybe I'm not thinking creatively
enough but I cannot see a scenario where the sanity check will trip
here and not in the "else if" clause down below :)

I've done some basic testing on my system on all NPS modes and also
with "numa=fake" cmdline and I do not see any splats in my case. Please
feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

I'll try to hack the kernel to trip the topology check, with and without
the patch, and also try to get back with some hotplug perf numbers but
for the time being I do not see any false negatives with the approach on
my dual socket 3rd Generation EPYC system (2 x 64C/128T)

-- 
Thanks and Regards,
Prateek

> +					return false;
> +				if (cpumask_test_and_set_cpu(id, id_seen))
> +					return false;
> +				cpumask_or(covered, tl_cpu_mask, covered);
> +			} else if ((!cpumask_test_cpu(id, id_seen)) ||
> +				    !cpumask_equal(tl->mask(id), tl_cpu_mask)) {
> +				/*
> +				 * a collision with covered should have exactly matched
> +				 * a previously seen mask with the same id
> +				 */
> +				return false;
> +			}
> +		}
>   	}
> -
>   	return true;
>   }
>   
> @@ -2417,9 +2440,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>   		sd = NULL;
>   		for_each_sd_topology(tl) {
>   
> -			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
> -				goto error;
> -
>   			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
>   
>   			has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;
> @@ -2433,6 +2453,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>   		}
>   	}
>   
> +	if (WARN_ON(!topology_span_sane(cpu_map)))
> +		goto error;
> +
>   	/* Build the groups for the domains */
>   	for_each_cpu(i, cpu_map) {
>   		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {

Re: [PATCH v2] sched/topology: improve topology_span_sane speed

Posted by Steve Wahl 2 weeks, 3 days ago

On Wed, Nov 06, 2024 at 10:19:13AM +0530, K Prateek Nayak wrote:
> Hello Steve,

Hi, Prateek; thanks for looking at this patch.

> On 11/1/2024 1:34 AM, Steve Wahl wrote:
> > Use a different approach to topology_span_sane(), that checks for the
> > same constraint of no partial overlaps for any two CPU sets for
> > non-NUMA topology levels, but does so in a way that is O(N) rather
> > than O(N^2).
> > 
> > Instead of comparing with all other masks to detect collisions, keep
> > one mask that includes all CPUs seen so far and detect collisions with
> > a single cpumask_intersects test.
> > 
> > If the current mask has no collisions with previously seen masks, it
> > should be a new mask, which can be uniquely identified ("id") by the
> > lowest bit set in this mask.  Mark that we've seen a mask with this
> > id, and add the CPUs in this mask to the list of those seen.
> > 
> > If the current mask does collide with previously seen masks, it should
> > be exactly equal to a mask seen before, identified once again by the
> > lowest bit the current mask has set.  It's an error if we haven't seen
> > a mask with that id, or if the current mask doesn't match the one we
> > get by looking up that id.
> > 
> > Move the topology_span_sane() check out of the existing topology level
> > loop, let it do its own looping to match the needs of this algorithm.
> > 
> > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> > the average time to take one processor offline is reduced from 2.18
> > seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> > 34m49.765s without this change, 16m10.038s with this change in place.)
> > 
> > Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> > ---
> > 
> > Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
> > simplifies the code, and eliminates the erroneous use of
> > num_possible_cpus() that Peter Zijlstra noted.
> > 
> > Version 1 discussion:
> >      https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/
> > 
> >   kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
> >   1 file changed, 48 insertions(+), 25 deletions(-)
> > 
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9748a4c8d668..6a2a3e91d59e 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2356,35 +2356,58 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
> >   /*
> >    * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
> > - * any two given CPUs at this (non-NUMA) topology level.
> > + * any two given CPUs on non-NUMA topology levels.
> >    */
> > -static bool topology_span_sane(struct sched_domain_topology_level *tl,
> > -			      const struct cpumask *cpu_map, int cpu)
> > +static bool topology_span_sane(const struct cpumask *cpu_map)
> >   {
> > -	int i = cpu + 1;
> > +	struct sched_domain_topology_level *tl;
> > +	struct cpumask *covered, *id_seen;
> > +	int cpu;
> > -	/* NUMA levels are allowed to overlap */
> > -	if (tl->flags & SDTL_OVERLAP)
> > -		return true;
> > +	lockdep_assert_held(&sched_domains_mutex);
> > +	covered = sched_domains_tmpmask;
> > +	id_seen = sched_domains_tmpmask2;
> > +
> > +	for_each_sd_topology(tl) {
> > +
> > +		/* NUMA levels are allowed to overlap */
> > +		if (tl->flags & SDTL_OVERLAP)
> > +			continue;
> > +
> > +		cpumask_clear(covered);
> > +		cpumask_clear(id_seen);
> > -	/*
> > -	 * Non-NUMA levels cannot partially overlap - they must be either
> > -	 * completely equal or completely disjoint. Otherwise we can end up
> > -	 * breaking the sched_group lists - i.e. a later get_group() pass
> > -	 * breaks the linking done for an earlier span.
> > -	 */
> > -	for_each_cpu_from(i, cpu_map) {
> >   		/*
> > -		 * We should 'and' all those masks with 'cpu_map' to exactly
> > -		 * match the topology we're about to build, but that can only
> > -		 * remove CPUs, which only lessens our ability to detect
> > -		 * overlaps
> > +		 * Non-NUMA levels cannot partially overlap - they must be either
> > +		 * completely equal or completely disjoint. Otherwise we can end up
> > +		 * breaking the sched_group lists - i.e. a later get_group() pass
> > +		 * breaks the linking done for an earlier span.
> >   		 */
> > -		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> > -		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> > -			return false;
> > +		for_each_cpu(cpu, cpu_map) {
> > +			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
> > +			int id;
> > +
> > +			/* lowest bit set in this mask is used as a unique id */
> > +			id = cpumask_first(tl_cpu_mask);
> > +
> > +			/* if this mask doesn't collide with what we've already seen */
> > +			if (!cpumask_intersects(tl_cpu_mask, covered)) {
> > +				/* Really odd case when cpu != id, likely not sane */
> > +				if ((cpu != id) && !cpumask_equal(tl_cpu_mask, tl->mask(id)))
> 
> I was wondering, since we are doing a "for_each_cpu(cpu, cpu_map)",
> wouldn't we always see the "id" cpu first since "id" is nothing but the
> first CPU of the topology level mask? Maybe I'm not thinking creatively
> enough but I cannot see a scenario where the sanity check will trip
> here and not in the "else if" clause down below :)

The scenario I was thinking of that could trip this is if
tl->mask(cpu) does not actually include this CPU.  It should not
happen, but we are looking to detect a malformed set of masks, so I
added the check.

> I've done some basic testing on my system on all NPS modes and also
> with "numa=fake" cmdline and I do not see any splats in my case. Please
> feel free to include:
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> I'll try to hack the kernel to trip the topology check, with and without
> the patch, and also try to get back with some hotplug perf numbers but
> for the time being I do not see any false negatives with the approach on
> my dual socket 3rd Generation EPYC system (2 x 64C/128T)
> 
> -- 
> Thanks and Regards,
> Prateek

Thanks so much!

--> Steve Wahl

> > +					return false;
> > +				if (cpumask_test_and_set_cpu(id, id_seen))
> > +					return false;
> > +				cpumask_or(covered, tl_cpu_mask, covered);
> > +			} else if ((!cpumask_test_cpu(id, id_seen)) ||
> > +				    !cpumask_equal(tl->mask(id), tl_cpu_mask)) {
> > +				/*
> > +				 * a collision with covered should have exactly matched
> > +				 * a previously seen mask with the same id
> > +				 */
> > +				return false;
> > +			}
> > +		}
> >   	}
> > -
> >   	return true;
> >   }
> > @@ -2417,9 +2440,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >   		sd = NULL;
> >   		for_each_sd_topology(tl) {
> > -			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
> > -				goto error;
> > -
> >   			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
> >   			has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;
> > @@ -2433,6 +2453,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >   		}
> >   	}
> > +	if (WARN_ON(!topology_span_sane(cpu_map)))
> > +		goto error;
> > +
> >   	/* Build the groups for the domains */
> >   	for_each_cpu(i, cpu_map) {
> >   		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> 

-- 
Steve Wahl, Hewlett Packard Enterprise