[v2] sched: Implement shared runqueue in CFS

[PATCH v2 6/7] sched: Shard per-LLC shared runqueues

Posted by David Vernet 2 years, 7 months ago

The SHARED_RUNQ scheduler feature creates a FIFO queue per LLC that
tasks are put into on enqueue, and pulled from when a core in that LLC
would otherwise go idle. For CPUs with large LLCs, this can sometimes
cause significant contention, as illustrated in [0].

[0]: https://lore.kernel.org/all/c8419d9b-2b31-2190-3058-3625bdbcb13d@meta.com/

So as to try and mitigate this contention, we can instead shard the
per-LLC runqueue into multiple per-LLC shards.

While this doesn't outright prevent all contention, it does somewhat mitigate it.
For example, if we run the following schbench command which does almost
nothing other than pound the runqueue:

schbench -L -m 52 -p 512 -r 10 -t 1

we observe with lockstats that sharding significantly decreases
contention.

3 shards:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions       waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      31510503       31510711           0.08          19.98        168932319.64     5.36            31700383      31843851       0.03           17.50        10273968.33      0.32
------------
&shard->lock       15731657          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock       15756516          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock          21766          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock            772          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540
------------
&shard->lock          23458          [<00000000126ec6ab>] newidle_balance+0x45a/0x650
&shard->lock       16505108          [<000000001faf84f9>] enqueue_task_fair+0x459/0x530
&shard->lock       14981310          [<0000000068c0fd75>] pick_next_task_fair+0x4dd/0x510
&shard->lock            835          [<000000002886c365>] dequeue_task_fair+0x4c9/0x540

No sharding:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name        con-bounces    contentions         waittime-min   waittime-max waittime-total         waittime-avg    acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:     117868635      118361486           0.09           393.01       1250954097.25          10.57           119345882     119780601      0.05          343.35       38313419.51      0.32
------------
&shard->lock       59169196          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock       59084239          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0
&shard->lock         108051          [<00000000084a6193>] newidle_balance+0x45a/0x650
------------
&shard->lock       60028355          [<0000000060507011>] __enqueue_entity+0xdc/0x110
&shard->lock         119882          [<00000000084a6193>] newidle_balance+0x45a/0x650
&shard->lock       58213249          [<00000000f1c67316>] __dequeue_entity+0x78/0xa0

The contention is ~3-4x worse if we don't shard at all. This roughly
matches the fact that we had 3 shards on the host where this was
collected. This could be addressed in future patch sets by adding a
debugfs knob to control the sharding granularity. If we make the shards
even smaller (what's in this patch, i.e. a size of 6), the contention
goes away almost entirely:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name    	   con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min  holdtime-max holdtime-total   holdtime-avg
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&shard->lock:      13839849       13877596      0.08          13.23        5389564.95       0.39           46910241      48069307       0.06          16.40        16534469.35      0.34
------------
&shard->lock           3559          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        6992418          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock        6881619          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        6640140          [<000000002266f400>] __dequeue_entity+0x78/0xa0
&shard->lock           3523          [<00000000ea455dcc>] newidle_balance+0x45a/0x650
&shard->lock        7233933          [<000000002a62f2e0>] __enqueue_entity+0xdc/0x110

Interestingly, SHARED_RUNQ performs worse than NO_SHARED_RUNQ on the schbench
benchmark on Milan, but we contend even more on the rq lock:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name         con-bounces    contentions   waittime-min  waittime-max waittime-total   waittime-avg   acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:       9617614        9656091       0.10          79.64        69665812.00      7.21           18092700      67652829       0.11           82.38        344524858.87     5.09
-----------
&rq->__lock        6301611          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        2530807          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock         109360          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         178218          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock        3245506          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock        1294355          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
&rq->__lock        2837804          [<000000003e63bf26>] task_rq_lock+0x43/0xe0
&rq->__lock        1627866          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10

..................................................................................................................................................................................................

&shard->lock:       7338558       7343244       0.10          35.97        7173949.14       0.98           30200858      32679623       0.08           35.59        16270584.52      0.50
------------
&shard->lock        2004142          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2611264          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        2727838          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110
------------
&shard->lock        2737232          [<00000000473978cc>] newidle_balance+0x45a/0x650
&shard->lock        1693341          [<00000000f8aa2c91>] __dequeue_entity+0x78/0xa0
&shard->lock        2912671          [<0000000028f55bb5>] __enqueue_entity+0xdc/0x110

...................................................................................................................................................................................................

If we look at the lock stats with SHARED_RUNQ disabled, the rq lock still
contends the most, but it's significantly less than with it enabled:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name          con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&rq->__lock:        791277         791690        0.12           110.54       4889787.63       6.18            1575996       62390275       0.13           112.66       316262440.56     5.07
-----------
&rq->__lock         263343          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock          19394          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock           4143          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          51094          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170
-----------
&rq->__lock          23756          [<0000000011be1562>] raw_spin_rq_lock_nested+0xa/0x10
&rq->__lock         379048          [<00000000516703f0>] __schedule+0x72/0xaa0
&rq->__lock            677          [<000000003b542e83>] __task_rq_lock+0x51/0xf0
&rq->__lock          47962          [<00000000c38a30f9>] sched_ttwu_pending+0x3d/0x170

In general, the takeaway here is that sharding does help with
contention, but it's not necessarily one size fits all, and it's
workload dependent. For now, let's include sharding to try and avoid
contention, and because it doesn't seem to regress CPUs that don't need
it such as the AMD 7950X.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: David Vernet <void@manifault.com>
---
 kernel/sched/fair.c  | 139 +++++++++++++++++++++++++++++--------------
 kernel/sched/sched.h |   3 +-
 2 files changed, 96 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff2491387201..97985f28a627 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -143,21 +143,28 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * struct shared_runq - Per-LLC queue structure for enqueuing and pulling
  * waking tasks.
  *
+ * struct shared_runq_shard - A structure containing a task list and a spinlock
+ * for a subset of cores in a struct shared_runq.
+ *
  * WHAT
  * ====
  *
  * This structure enables the scheduler to be more aggressively work
- * conserving, by placing waking tasks on a per-LLC FIFO queue that can then be
- * pulled from when another core in the LLC is going to go idle.
- *
- * struct rq stores a pointer to its LLC's shared_runq via struct cfs_rq.
- * Waking tasks are enqueued in a shared_runq at the end of
- * enqueue_task_fair(), and are opportunistically pulled from the shared_runq
- * in newidle_balance(). Tasks enqueued in a shared_runq may be scheduled prior
- * to being pulled from the shared_runq, in which case they're simply dequeued
- * from the shared_runq. A waking task is only enqueued to a shared_runq when
- * it was _not_ manually migrated to the current runqueue by
- * select_task_rq_fair().
+ * conserving, by placing waking tasks on a per-LLC FIFO queue shard that can
+ * then be pulled from when another core in the LLC is going to go idle.
+ *
+ * struct rq stores two pointers in its struct cfs_rq:
+ *
+ * 1. The per-LLC struct shared_runq which contains one or more shards of
+ *    enqueued tasks.
+ *
+ * 2. The shard inside of the per-LLC struct shared_runq which contains the
+ *    list of runnable tasks for that shard.
+ *
+ * Waking tasks are enqueued in the calling CPU's struct shared_runq_shard at
+ * the end of enqueue_task_fair(), and are opportunistically pulled from the
+ * shared_runq in newidle_balance(). Pulling from shards is an O(# shards)
+ * operation.
  *
  * There is currently no task-stealing between shared_runqs in different LLCs,
  * which means that shared_runq is not fully work conserving. This could be
@@ -167,11 +174,12 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * HOW
  * ===
  *
- * An shared_runq is comprised of a list, and a spinlock for synchronization.
- * Given that the critical section for a shared_runq is typically a fast list
- * operation, and that the shared_runq is localized to a single LLC, the
- * spinlock will typically only be contended on workloads that do little else
- * other than hammer the runqueue.
+ * A struct shared_runq_shard is comprised of a list, and a spinlock for
+ * synchronization.  Given that the critical section for a shared_runq is
+ * typically a fast list operation, and that the shared_runq_shard is localized
+ * to a subset of cores on a single LLC (plus other cores in the LLC that pull
+ * from the shard in newidle_balance()), the spinlock will typically only be
+ * contended on workloads that do little else other than hammer the runqueue.
  *
  * WHY
  * ===
@@ -185,48 +193,64 @@ __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
  * it, as well as to strike a balance between work conservation, and L3 cache
  * locality.
  */
-struct shared_runq {
+struct shared_runq_shard {
 	struct list_head list;
 	spinlock_t lock;
 } ____cacheline_aligned;
 
+struct shared_runq {
+	u32 num_shards;
+	struct shared_runq_shard shards[];
+} ____cacheline_aligned;
+
+/* This would likely work better as a configurable knob via debugfs */
+#define SHARED_RUNQ_SHARD_SZ 6
+
 #ifdef CONFIG_SMP
 static struct shared_runq *rq_shared_runq(struct rq *rq)
 {
 	return rq->cfs.shared_runq;
 }
 
-static struct task_struct *shared_runq_pop_task(struct rq *rq)
+static struct shared_runq_shard *rq_shared_runq_shard(struct rq *rq)
+{
+	return rq->cfs.shard;
+}
+
+static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
+{
+	return cpu % runq->num_shards;
+}
+
+static struct task_struct *
+shared_runq_pop_task(struct shared_runq_shard *shard, int target)
 {
 	unsigned long flags;
 	struct task_struct *p;
-	struct shared_runq *shared_runq;
 
-	shared_runq = rq_shared_runq(rq);
-	if (list_empty(&shared_runq->list))
+	if (list_empty(&shard->list))
 		return NULL;
 
-	spin_lock_irqsave(&shared_runq->lock, flags);
-	p = list_first_entry_or_null(&shared_runq->list, struct task_struct,
+	spin_lock_irqsave(&shard->lock, flags);
+	p = list_first_entry_or_null(&shard->list, struct task_struct,
 				     shared_runq_node);
-	if (p && is_cpu_allowed(p, cpu_of(rq)))
+	if (p && is_cpu_allowed(p, target))
 		list_del_init(&p->shared_runq_node);
 	else
 		p = NULL;
-	spin_unlock_irqrestore(&shared_runq->lock, flags);
+	spin_unlock_irqrestore(&shard->lock, flags);
 
 	return p;
 }
 
-static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
+static void shared_runq_push_task(struct shared_runq_shard *shard,
+				  struct task_struct *p)
 {
 	unsigned long flags;
-	struct shared_runq *shared_runq;
 
-	shared_runq = rq_shared_runq(rq);
-	spin_lock_irqsave(&shared_runq->lock, flags);
-	list_add_tail(&p->shared_runq_node, &shared_runq->list);
-	spin_unlock_irqrestore(&shared_runq->lock, flags);
+	spin_lock_irqsave(&shard->lock, flags);
+	list_add_tail(&p->shared_runq_node, &shard->list);
+	spin_unlock_irqrestore(&shard->lock, flags);
 }
 
 static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
@@ -247,7 +271,7 @@ static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p,
 	if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
 		return;
 
-	shared_runq_push_task(rq, p);
+	shared_runq_push_task(rq_shared_runq_shard(rq), p);
 }
 
 static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
@@ -256,8 +280,21 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 	struct rq *src_rq;
 	struct rq_flags src_rf;
 	int ret;
+	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
+	u32 i, starting_idx, curr_idx, num_shards;
 
-	p = shared_runq_pop_task(rq);
+	shared_runq = rq_shared_runq(rq);
+	starting_idx = shared_runq_shard_idx(shared_runq, cpu_of(rq));
+	num_shards = shared_runq->num_shards;
+	for (i = 0; i < num_shards; i++) {
+		curr_idx = (starting_idx + i) % num_shards;
+		shard = &shared_runq->shards[curr_idx];
+
+		p = shared_runq_pop_task(shard, cpu_of(rq));
+		if (p)
+			break;
+	}
 	if (!p)
 		return 0;
 
@@ -287,13 +324,13 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
 static void shared_runq_dequeue_task(struct task_struct *p)
 {
 	unsigned long flags;
-	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
 
 	if (!list_empty(&p->shared_runq_node)) {
-		shared_runq = rq_shared_runq(task_rq(p));
-		spin_lock_irqsave(&shared_runq->lock, flags);
+		shard = rq_shared_runq_shard(task_rq(p));
+		spin_lock_irqsave(&shard->lock, flags);
 		list_del_init(&p->shared_runq_node);
-		spin_unlock_irqrestore(&shared_runq->lock, flags);
+		spin_unlock_irqrestore(&shard->lock, flags);
 	}
 }
 
@@ -13003,19 +13040,31 @@ __init void init_sched_fair_class(void)
 __init void init_sched_fair_class_late(void)
 {
 #ifdef CONFIG_SMP
-	int i;
+	int i, j;
 	struct shared_runq *shared_runq;
+	struct shared_runq_shard *shard;
 	struct rq *rq;
 	struct rq *llc_rq;
+	size_t shared_runq_size;
+	u32 num_shards, shard_idx;
 
 	for_each_possible_cpu(i) {
 		if (per_cpu(sd_llc_id, i) == i) {
 			llc_rq = cpu_rq(i);
-
-			shared_runq = kzalloc_node(sizeof(struct shared_runq),
-					       GFP_KERNEL, cpu_to_node(i));
-			INIT_LIST_HEAD(&shared_runq->list);
-			spin_lock_init(&shared_runq->lock);
+			num_shards = max(per_cpu(sd_llc_size, i) /
+					 SHARED_RUNQ_SHARD_SZ, 1);
+			shared_runq_size = sizeof(struct shared_runq) +
+				num_shards * sizeof(struct shared_runq_shard);
+
+			shared_runq = kzalloc_node(shared_runq_size,
+						   GFP_KERNEL, cpu_to_node(i));
+			shared_runq->num_shards = num_shards;
+			for (j = 0; j < num_shards; j++) {
+				shard = &shared_runq->shards[j];
+
+				INIT_LIST_HEAD(&shard->list);
+				spin_lock_init(&shard->lock);
+			}
 			llc_rq->cfs.shared_runq = shared_runq;
 		}
 	}
@@ -13024,9 +13073,9 @@ __init void init_sched_fair_class_late(void)
 		rq = cpu_rq(i);
 		llc_rq = cpu_rq(per_cpu(sd_llc_id, i));
 
-		if (rq == llc_rq)
-			continue;
 		rq->cfs.shared_runq = llc_rq->cfs.shared_runq;
+		shard_idx = shared_runq_shard_idx(rq->cfs.shared_runq, i);
+		rq->cfs.shard = &rq->cfs.shared_runq->shards[shard_idx];
 	}
 #endif /* SMP */
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8b573dfaba33..ca56a8120088 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -576,7 +576,8 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-	struct shared_runq	*shared_runq;
+	struct shared_runq	 *shared_runq;
+	struct shared_runq_shard *shard;
 	/*
 	 * CFS load tracking
 	 */
-- 
2.40.1

Re: [PATCH v2 6/7] sched: Shard per-LLC shared runqueues

Posted by Peter Zijlstra 2 years, 7 months ago

On Mon, Jul 10, 2023 at 03:03:41PM -0500, David Vernet wrote:

> +struct shared_runq_shard {
>  	struct list_head list;
>  	spinlock_t lock;
>  } ____cacheline_aligned;
>  
> +struct shared_runq {
> +	u32 num_shards;
> +	struct shared_runq_shard shards[];
> +} ____cacheline_aligned;
> +
> +/* This would likely work better as a configurable knob via debugfs */
> +#define SHARED_RUNQ_SHARD_SZ 6
> +
>  #ifdef CONFIG_SMP
>  static struct shared_runq *rq_shared_runq(struct rq *rq)
>  {
>  	return rq->cfs.shared_runq;
>  }
>  
> -static struct task_struct *shared_runq_pop_task(struct rq *rq)
> +static struct shared_runq_shard *rq_shared_runq_shard(struct rq *rq)
> +{
> +	return rq->cfs.shard;
> +}
> +
> +static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
> +{
> +	return cpu % runq->num_shards;

I would suggest either:

	(cpu >> 1) % num_shards

or keeping num_shards even, to give SMT siblings a fighting chance to
hit the same bucket.

(I've no idea how SMT4 (or worse SMT8) is typically enumerated, so
someone from the Power/Sparc/MIPS world would have to go play with that
if they so care)

> +}

> +			num_shards = max(per_cpu(sd_llc_size, i) /
> +					 SHARED_RUNQ_SHARD_SZ, 1);

> +			shared_runq->num_shards = num_shards;

Re: [PATCH v2 6/7] sched: Shard per-LLC shared runqueues

Posted by David Vernet 2 years, 7 months ago

On Tue, Jul 11, 2023 at 12:49:58PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 10, 2023 at 03:03:41PM -0500, David Vernet wrote:
> 
> > +struct shared_runq_shard {
> >  	struct list_head list;
> >  	spinlock_t lock;
> >  } ____cacheline_aligned;
> >  
> > +struct shared_runq {
> > +	u32 num_shards;
> > +	struct shared_runq_shard shards[];
> > +} ____cacheline_aligned;
> > +
> > +/* This would likely work better as a configurable knob via debugfs */
> > +#define SHARED_RUNQ_SHARD_SZ 6
> > +
> >  #ifdef CONFIG_SMP
> >  static struct shared_runq *rq_shared_runq(struct rq *rq)
> >  {
> >  	return rq->cfs.shared_runq;
> >  }
> >  
> > -static struct task_struct *shared_runq_pop_task(struct rq *rq)
> > +static struct shared_runq_shard *rq_shared_runq_shard(struct rq *rq)
> > +{
> > +	return rq->cfs.shard;
> > +}
> > +
> > +static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
> > +{
> > +	return cpu % runq->num_shards;
> 
> I would suggest either:
> 
> 	(cpu >> 1) % num_shards
>
> or keeping num_shards even, to give SMT siblings a fighting chance to
> hit the same bucket.

Given that neither of these approaches guarantees that the SMT siblings
are in the same bucket, I'll just go with your suggestion which is
simpler.

Seems inevitable that we'll want to have another debugfs knob to adjust
the number of shards, but IMO it's preferable to just apply your
suggestion in v3 and hold off on adding that complexity until we know we
need it.

> (I've no idea how SMT4 (or worse SMT8) is typically enumerated, so
> someone from the Power/Sparc/MIPS world would have to go play with that
> if they so care)

Yeah, no idea either. If these things end up varying a lot across
different architectures then we can look into making shard assignment
architecture specific.

> 
> > +}
> 
> > +			num_shards = max(per_cpu(sd_llc_size, i) /
> > +					 SHARED_RUNQ_SHARD_SZ, 1);
> 
> > +			shared_runq->num_shards = num_shards;
> 
>

Re: [PATCH v2 6/7] sched: Shard per-LLC shared runqueues

Posted by Gautham R. Shenoy 2 years, 7 months ago

On Tue, Jul 11, 2023 at 02:57:57PM -0500, David Vernet wrote:
> On Tue, Jul 11, 2023 at 12:49:58PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 10, 2023 at 03:03:41PM -0500, David Vernet wrote:

[..snip..]

> > > +static int shared_runq_shard_idx(const struct shared_runq *runq, int cpu)
> > > +{
> > > +	return cpu % runq->num_shards;
> > 
> > I would suggest either:
> > 
> > 	(cpu >> 1) % num_shards
> >
> > or keeping num_shards even, to give SMT siblings a fighting chance to
> > hit the same bucket.
> 
> Given that neither of these approaches guarantees that the SMT siblings
> are in the same bucket, I'll just go with your suggestion which is
> simpler.
> 
> Seems inevitable that we'll want to have another debugfs knob to adjust
> the number of shards, but IMO it's preferable to just apply your
> suggestion in v3 and hold off on adding that complexity until we know we
> need it.
> 
> > (I've no idea how SMT4 (or worse SMT8) is typically enumerated, so
> > someone from the Power/Sparc/MIPS world would have to go play with that
> > if they so care)
> 
> Yeah, no idea either. If these things end up varying a lot across
> different architectures then we can look into making shard assignment
> architecture specific.

On POWER, the SMT siblings are enumerated in a sequential fashion, i.e

CPU id of a thread = Core_id * threads_per_core + thread_id_within_core.

But IIRC, POWER sets L2 domain as the LLC. On POWER8 (with SMT8) and
POWER9(with SMT4 on Baremetal and SMT8 on VMs), LLC size is 8. Even
with SHARED_RUNQ_SHARD_SZ = 6, there will only be 1 shard with the
current formula

	num_shards = max(per_cpu(sd_llc_size, i)/SHARED_RUNQ_SHARD_SZ, 1);

(Aside: with the above formula, on a topology with 6 < sd_llc_size <
12, num_shards will remain 1, with the shard size exceeding the
intended SHARD_SZ. Was this the intention ?)

Even on x86, there is no uniformity in how the SMT threads are
numbered. On AMD EPYC Baremetal, the first threads of all the cores
are enumerated first and then the sibling threads. So, on an EPYC
server with 128 cores in total, the SMT sibings are {0,128}, {1, 129}, ...

With SHARED_RUNQ_SHARD_SZ = 6,

On Zen2 EPYC Baremetal, with LLC size = 8, num_shards = 1.  This
simplifies stuff!

On Zen3, Zen4 EPYC Baremetal, with LLC size = 16, num_shards = 2.

Here, (cpu % num_shards) ensures that the SMT siblings belong to the
same shard along with 3 other cores.

On some Intel servers, it is possible that the CPU numbers are
interleaved across the two sockets. On my 2 socket, 32Cores per socket
Ice Lake Server, all the even numbered CPUs are in one socket and all
the odd numbered CPUs in the other socket.

The SMT siblings are {0,64}, {2, 66}, .... on one socket and {1, 65},
{3, 67}, .. on the other.

On this system, LLC size = 64. With SHARED_RUNQ_SHARD_SZ = 6,
num_shards = 10.

So with (cpu % num_shards) the siblings {0, 64} ... will belong to
different shards.

What would be good to have is

1. shard_size determined by individual architectures. If none is
   provided, we pick the default shard_size.

2. A sharding scheme which guarantees that SMT siblinngs will belong
   to the same shard as long as shard_size is at least as big as the SMT
   size.

--
Thanks and Regards
gautham.

Re: [PATCH v2 6/7] sched: Shard per-LLC shared runqueues

Posted by Peter Zijlstra 2 years, 7 months ago

On Wed, Jul 12, 2023 at 03:36:27PM +0530, Gautham R. Shenoy wrote:

> On some Intel servers, it is possible that the CPU numbers are
> interleaved across the two sockets. On my 2 socket, 32Cores per socket
> Ice Lake Server, all the even numbered CPUs are in one socket and all
> the odd numbered CPUs in the other socket.
> 
> The SMT siblings are {0,64}, {2, 66}, .... on one socket and {1, 65},
> {3, 67}, .. on the other.

Yeah, Intel SMT enumeration is a mess. There's a random mix of {n,n+1}
and {0..n-1} {n..2n-1}. And then there's the fun hybrid stuff.  Those
appear to do {n,n+1} for the big cores and then continue with the small
cores in a dense set. My 8+8 ADL, has:

{0,1} {2,3} {4,5} {6,7} {8,9} {10,11} {12,13} {14,15} {16} {17} {18} {19} {20} {21} {22} {23}

I suspect it might be easier to re-number the whole show at boot to fit
a sane pattern rather than trying to match the various random garbage
gifted to us by the BIOS.

I wouldn't worry about it too much at this point.