sched/fair: Enhance sync wakeup for short duration tasks

[PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Chen Yu 1 year, 7 months ago

Record the average duration of a task, as there is a requirement
to leverage this information for better task placement.

At first thought the (p->se.sum_exec_runtime / p->nvcsw)
can be used to measure the task duration. However, the
history long past was factored too heavily in such a formula.
Ideally, the old activity should decay and not affect
the current status too much.

Although something based on PELT can be used, se.util_avg might
not be appropriate to describe the task duration:
Task p1 and task p2 are doing frequent ping-pong scheduling on
one CPU, both p1 and p2 have a short duration, but the util_avg
of each task can be up to 50%, which is inconsistent with the
short task duration.

Here's an example to show what the average duration is. Suppose
on CPUx, task p1 and p2 run alternatively:

 --------------------> time

 | p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks |
               ^               ^                                     ^
 |_____________|               |_____________________________________|
                                                                     ^
                                                                     |
                                                                  p1 dequeued

p1's duration is (1 + 0.5)ms. Because if p2 does not preempt p1, p1 can run 1.5ms.
This reflects the nature of a task: how long it wishes to run at most.

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched.h |  3 +++
 kernel/sched/core.c   |  2 ++
 kernel/sched/fair.c   | 12 ++++++++++++
 3 files changed, 17 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90691d99027e..78747d3954fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1339,6 +1339,9 @@ struct task_struct {
 	struct callback_head		cid_work;
 #endif
 
+	u64				prev_sleep_sum_runtime;
+	u64				duration_avg;
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	/* Cache last used pipe for splice(): */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0935f9d4bb7b..7399c4143528 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	p->prev_sleep_sum_runtime = 0;
+	p->duration_avg = 0;
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41b58387023d..445877069fbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6833,6 +6833,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 static void set_next_buddy(struct sched_entity *se);
 
+static inline void dur_avg_update(struct task_struct *p)
+{
+	u64 dur;
+
+	dur = p->se.sum_exec_runtime - p->prev_sleep_sum_runtime;
+	p->prev_sleep_sum_runtime = p->se.sum_exec_runtime;
+	update_avg(&p->duration_avg, dur);
+}
+
 /*
  * The dequeue_task method is called before nr_running is
  * decreased. We remove the task from the rbtree and
@@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
+	if (task_sleep)
+		dur_avg_update(p);
+
 	hrtick_update(rq);
 }
 
-- 
2.25.1

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Mike Galbraith 1 year, 7 months ago

On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0935f9d4bb7b..7399c4143528 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>         p->migration_pending = NULL;
>  #endif
>         init_sched_mm_cid(p);
> +       p->prev_sleep_sum_runtime = 0;
> +       p->duration_avg = 0;
>  }

Beginning life biased toward stacking?

>  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 41b58387023d..445877069fbf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
>
> @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  
>  dequeue_throttle:
>         util_est_update(&rq->cfs, p, task_sleep);
> +       if (task_sleep)
> +               dur_avg_update(p);
> +
>         hrtick_update(rq);
>  }
>

That qualifier looks a bit dangerous.  Microbench components tend to
have only one behavior, but the real world goes through all kinds of
nutty gyrations, intentional and otherwise.

The heuristics in the next patch seem to exhibit a healthy level of
paranoia, but these bits could perhaps use a tad more.  Bad experiences
springs to mind when I stare at that - sleepers going hog, hogs meet
sleeping lock contention, preemption, sync hint not meaning much...

	-Mike

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Chen Yu 1 year, 7 months ago

Hi Mike,

Thanks for your time and giving the insights.

On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
> On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0935f9d4bb7b..7399c4143528 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> >         p->migration_pending = NULL;
> >  #endif
> >         init_sched_mm_cid(p);
> > +       p->prev_sleep_sum_runtime = 0;
> > +       p->duration_avg = 0;
> >  }
> 
> Beginning life biased toward stacking?
>

OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
task stacking in the beginning.

> >  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 41b58387023d..445877069fbf 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> >
> > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  
> >  dequeue_throttle:
> >         util_est_update(&rq->cfs, p, task_sleep);
> > +       if (task_sleep)
> > +               dur_avg_update(p);
> > +
> >         hrtick_update(rq);
> >  }
> >
> 
> That qualifier looks a bit dangerous.  Microbench components tend to
> have only one behavior, but the real world goes through all kinds of
> nutty gyrations, intentional and otherwise.
>

Understand. Unfortunately I don't have access to production environment
so I have to rely on microbenchmarks and a OLTP to check the result. I
get feedback from Abel that the former version of this patch brought
benefit to some short tasks like Redis in the production environment[1].
https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/

I can launch a combination of microbenchmarks in parallel to check the impact.

> The heuristics in the next patch seem to exhibit a healthy level of
> paranoia, but these bits could perhaps use a tad more.  Bad experiences
> springs to mind when I stare at that - sleepers going hog, hogs meet
> sleeping lock contention, preemption, sync hint not meaning much...
>

I see. If I understand correctly, the scenario mentioned above
could bring a false positive of 'short task', which causes
task stacking.

If the sleeper task:
1. is preempted frequently. This should not be a problem, because
   the task duration is unlikely to be shorten by preemption, and
   the short_task() is unlikely to return true.
   Because the task duration will span the time from the task is
   scheduled in, to finally scheduled out due to sleeping(dequeue_task_fair()),
   but not by preemption.
   This time duration should be long enough to not trigger the 'short task'
   case. But since there is a delay queue mechanism under development,
   calculating the duration in dequeue_task_fair() in current patch might not
   be a proper method anymore.

2. meets sleeping lock contention. This would be a false positive 'short task',
   which bring unexpected task stacking.

So consider 1 and 2, I'm thinking of moving the calculating of task duration from
dequeue_task_fair() to wait_woken(). The reason to update the task's duration in
wait_woken() rather than dequeue_task_fair() is that, the former is aware of the
scenario that the task is waiting for the real resource, rather than blocking
on a random sleeping lock. And the wait_woken() is widely used by the driver to
indicate that task is waiting for the resource. For example, the netperf calltrace:

    schedule_timeout+222
    wait_woken+84
    sk_wait_data+378
    tcp_recvmsg_locked+507
    tcp_recvmsg+115
    inet_recvmsg+90
    sock_recvmsg+150

In the future, if there is requirement other scenario could also invoke the newly
introduced update_cur_duration() when needed. For example, the pipe_read() could
use it if the task is going to sleep due to the empty pipe buffer. I change the
code as below, may I have your suggestion on this?

thanks,
Chenyu

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90691d99027e..78747d3954fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1339,6 +1339,9 @@ struct task_struct {
 	struct callback_head		cid_work;
 #endif

+	u64				prev_sleep_sum_runtime;
+	u64				duration_avg;
+
 	struct tlbflush_unmap_batch	tlb_ubc;

 	/* Cache last used pipe for splice(): */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0935f9d4bb7b..7399c4143528 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	p->prev_sleep_sum_runtime = 0;
+	p->duration_avg = 0;
 }

 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41b58387023d..bbeba36d0145 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -744,6 +744,19 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return vruntime_eligible(cfs_rq, se->vruntime);
 }

+void update_curr_duration(void)
+{
+	struct sched_entity *curr = &current->se;
+	unsigned long flags;
+	u64 dur;
+
+	local_irq_save(flags);
+	dur = curr->sum_exec_runtime - current->prev_sleep_sum_runtime;
+	current->prev_sleep_sum_runtime = curr->sum_exec_runtime;
+	update_avg(&current->duration_avg, dur);
+	local_irq_restore(flags);
+}
+
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	u64 min_vruntime = cfs_rq->min_vruntime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 62fd8bc6fd08..7beb604ca76b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3574,6 +3574,7 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }

 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
+extern void update_curr_duration(void);

 #ifdef CONFIG_RT_MUTEXES

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..a0004cc7454f 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -419,8 +419,10 @@ long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout)
 	 * or woken_wake_function() sees our store to current->state.
 	 */
 	set_current_state(mode); /* A */
-	if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !kthread_should_stop_or_park())
+	if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !kthread_should_stop_or_park()) {
+		update_curr_duration();
 		timeout = schedule_timeout(timeout);
+	}
 	__set_current_state(TASK_RUNNING);

 	/*
-- 
2.25.1

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Madadi Vineeth Reddy 1 year, 6 months ago

Hi Chen Yu,

On 30/06/24 18:39, Chen Yu wrote:
> Hi Mike,
> 
> Thanks for your time and giving the insights.
> 
> On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
>> On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 0935f9d4bb7b..7399c4143528 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>>>         p->migration_pending = NULL;
>>>  #endif
>>>         init_sched_mm_cid(p);
>>> +       p->prev_sleep_sum_runtime = 0;
>>> +       p->duration_avg = 0;
>>>  }
>>
>> Beginning life biased toward stacking?
>>
> 
> OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
> task stacking in the beginning.
>    
>>>  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 41b58387023d..445877069fbf 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>>
>>> @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>  
>>>  dequeue_throttle:
>>>         util_est_update(&rq->cfs, p, task_sleep);
>>> +       if (task_sleep)
>>> +               dur_avg_update(p);
>>> +
>>>         hrtick_update(rq);
>>>  }
>>>
>>
>> That qualifier looks a bit dangerous.  Microbench components tend to
>> have only one behavior, but the real world goes through all kinds of
>> nutty gyrations, intentional and otherwise.
>>
> 
> Understand. Unfortunately I don't have access to production environment
> so I have to rely on microbenchmarks and a OLTP to check the result. I
> get feedback from Abel that the former version of this patch brought
> benefit to some short tasks like Redis in the production environment[1].
> https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/

Since the discussion was about real-life workload performance, I ran the DayTrader
workload with three users and three instances. The results show no performance
regression, and a 1% performance gain was observed, which is within the standard
deviation.

Thanks and Regards
Madadi Vineeth Reddy

> 
> I can launch a combination of microbenchmarks in parallel to check the impact.
> 
>> The heuristics in the next patch seem to exhibit a healthy level of
>> paranoia, but these bits could perhaps use a tad more.  Bad experiences
>> springs to mind when I stare at that - sleepers going hog, hogs meet
>> sleeping lock contention, preemption, sync hint not meaning much...

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Chen Yu 1 year, 6 months ago

On 2024-08-05 at 10:08:35 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> 
> On 30/06/24 18:39, Chen Yu wrote:
> > Hi Mike,
> > 
> > Thanks for your time and giving the insights.
> > 
> > On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
> >> On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
> >>>
> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>> index 0935f9d4bb7b..7399c4143528 100644
> >>> --- a/kernel/sched/core.c
> >>> +++ b/kernel/sched/core.c
> >>> @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> >>>         p->migration_pending = NULL;
> >>>  #endif
> >>>         init_sched_mm_cid(p);
> >>> +       p->prev_sleep_sum_runtime = 0;
> >>> +       p->duration_avg = 0;
> >>>  }
> >>
> >> Beginning life biased toward stacking?
> >>
> > 
> > OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
> > task stacking in the beginning.
> >    
> >>>  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 41b58387023d..445877069fbf 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>>
> >>> @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>>  
> >>>  dequeue_throttle:
> >>>         util_est_update(&rq->cfs, p, task_sleep);
> >>> +       if (task_sleep)
> >>> +               dur_avg_update(p);
> >>> +
> >>>         hrtick_update(rq);
> >>>  }
> >>>
> >>
> >> That qualifier looks a bit dangerous.  Microbench components tend to
> >> have only one behavior, but the real world goes through all kinds of
> >> nutty gyrations, intentional and otherwise.
> >>
> > 
> > Understand. Unfortunately I don't have access to production environment
> > so I have to rely on microbenchmarks and a OLTP to check the result. I
> > get feedback from Abel that the former version of this patch brought
> > benefit to some short tasks like Redis in the production environment[1].
> > https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/
> 
> Since the discussion was about real-life workload performance, I ran the DayTrader
> workload with three users and three instances. The results show no performance
> regression, and a 1% performance gain was observed, which is within the standard
> deviation.
>

Thank you Madadi for your test, let me respin this version and send it out with the
fixes after talked with Mike.

thanks,
Chenyu

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Mike Galbraith 1 year, 7 months ago

On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
> Hi Mike,
>
> Thanks for your time and giving the insights.
>
> On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
> > On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 0935f9d4bb7b..7399c4143528 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > >         p->migration_pending = NULL;
> > >  #endif
> > >         init_sched_mm_cid(p);
> > > +       p->prev_sleep_sum_runtime = 0;
> > > +       p->duration_avg = 0;
> > >  }
> >
> > Beginning life biased toward stacking?
> >
>
> OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
> task stacking in the beginning.

Or something, definitely.

> > >  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 41b58387023d..445877069fbf 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > >
> > > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  
> > >  dequeue_throttle:
> > >         util_est_update(&rq->cfs, p, task_sleep);
> > > +       if (task_sleep)
> > > +               dur_avg_update(p);
> > > +
> > >         hrtick_update(rq);
> > >  }
> > >
> >
> > That qualifier looks a bit dangerous.  Microbench components tend to
> > have only one behavior, but the real world goes through all kinds of
> > nutty gyrations, intentional and otherwise.
> >
>
> Understand. Unfortunately I don't have access to production environment
> so I have to rely on microbenchmarks and a OLTP to check the result. I
> get feedback from Abel that the former version of this patch brought
> benefit to some short tasks like Redis in the production environment[1].
> https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/

Here's hoping you get a lot more.

> I can launch a combination of microbenchmarks in parallel to check the impact.
>
> > The heuristics in the next patch seem to exhibit a healthy level of
> > paranoia, but these bits could perhaps use a tad more.  Bad experiences
> > springs to mind when I stare at that - sleepers going hog, hogs meet
> > sleeping lock contention, preemption, sync hint not meaning much...
> >
>
> I see. If I understand correctly, the scenario mentioned above
> could bring a false positive of 'short task', which causes
> task stacking.

Yeah.

> If the sleeper task:
> 1. is preempted frequently. This should not be a problem, because
>    the task duration is unlikely to be shorten by preemption, and
>    the short_task() is unlikely to return true.

Ah, but it not being a problem would beg the question why bother?
There are consequences either way, the dark side is just scarier.

> 2. meets sleeping lock contention. This would be a false positive 'short task',
>    which bring unexpected task stacking.

Yeah.

> So consider 1 and 2, I'm thinking of moving the calculating of task duration from
> dequeue_task_fair() to wait_woken(). The reason to update the task's duration in
> wait_woken() rather than dequeue_task_fair() is that, the former is aware of the
> scenario that the task is waiting for the real resource, rather than blocking
> on a random sleeping lock. And the wait_woken() is widely used by the driver to
> indicate that task is waiting for the resource. For example, the netperf calltrace:
>
>     schedule_timeout+222
>     wait_woken+84
>     sk_wait_data+378
>     tcp_recvmsg_locked+507
>     tcp_recvmsg+115
>     inet_recvmsg+90
>     sock_recvmsg+150
>
> In the future, if there is requirement other scenario could also invoke the newly
> introduced update_cur_duration() when needed. For example, the pipe_read() could
> use it if the task is going to sleep due to the empty pipe buffer. I change the
> code as below, may I have your suggestion on this?

I don't have any suggestions that will help plug the holes, heck, I
squabbled in this arena quite a bit some years ago, and did not win.
Frankly I don't think the scheduler has the information necessary to do
so, it'll always be a case of this will likely do less harm than good,
but will certainly leave at least an occasional mark.

Just take a look at the high speed ping-pong thing (not a benchmark,
that's a box full of tape measures, rather silly, but..).  TCP_RR IS
1:1, has as short a duration as network stack plus scheduler can
possibly make it, and is nearly synchronous to boot, two halves of a
whole, the ONLY thing you can certainly safely stack.. but a shared L2
box still takes a wee hit when you do so.  1:N or M:N tasks can
approach its wakeup frequency range, and there's nothing you can do
about the very same cache to cache latency you're trying to duck, it
just is what it is, and is considered perfectly fine as it is.  That's
a bit of a red flag, but worse is the lack of knowledge wrt what tasks
are actually up to at any given time.  We rashly presume that tasks
waking one another implies a 1:1 relationship, we routinely call them
buddies and generally get away with it.. but during any overlap they
can be doing anything including N way data share, and regardless of
what that is and section size, needless stacking flushes concurrency,
injecting service latency in its place, cost unknown.

Intentional stacking can be jokingly equated to injecting just a wee
bit of SMP kryptonite.. it'll be fine.. at least until it's not ;-)

	-Mike

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Chen Yu 1 year, 7 months ago

Hi Mike,

On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote:
> On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
> > Hi Mike,
> >
> > Thanks for your time and giving the insights.
> >
> > On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
> > > On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
> > > >
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index 0935f9d4bb7b..7399c4143528 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > > >         p->migration_pending = NULL;
> > > >  #endif
> > > >         init_sched_mm_cid(p);
> > > > +       p->prev_sleep_sum_runtime = 0;
> > > > +       p->duration_avg = 0;
> > > >  }
> > >
> > > Beginning life biased toward stacking?
> > >
> >
> > OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
> > task stacking in the beginning.
> 
> Or something, definitely.
>    
> > > >  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 41b58387023d..445877069fbf 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > >
> > > > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > >  
> > > >  dequeue_throttle:
> > > >         util_est_update(&rq->cfs, p, task_sleep);
> > > > +       if (task_sleep)
> > > > +               dur_avg_update(p);
> > > > +
> > > >         hrtick_update(rq);
> > > >  }
> > > >
> > >
> > > That qualifier looks a bit dangerous.  Microbench components tend to
> > > have only one behavior, but the real world goes through all kinds of
> > > nutty gyrations, intentional and otherwise.
> > >
> >
> > Understand. Unfortunately I don't have access to production environment
> > so I have to rely on microbenchmarks and a OLTP to check the result. I
> > get feedback from Abel that the former version of this patch brought
> > benefit to some short tasks like Redis in the production environment[1].
> > https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/
> 
> Here's hoping you get a lot more.
>

We recently received a simulated benchmark for Meta. I'll conduct some tests
to see the results.

> > So consider 1 and 2, I'm thinking of moving the calculating of task duration from
> > dequeue_task_fair() to wait_woken(). The reason to update the task's duration in
> > wait_woken() rather than dequeue_task_fair() is that, the former is aware of the
> > scenario that the task is waiting for the real resource, rather than blocking
> > on a random sleeping lock. And the wait_woken() is widely used by the driver to
> > indicate that task is waiting for the resource. For example, the netperf calltrace:
> >
> >     schedule_timeout+222
> >     wait_woken+84
> >     sk_wait_data+378
> >     tcp_recvmsg_locked+507
> >     tcp_recvmsg+115
> >     inet_recvmsg+90
> >     sock_recvmsg+150
> >
> > In the future, if there is requirement other scenario could also invoke the newly
> > introduced update_cur_duration() when needed. For example, the pipe_read() could
> > use it if the task is going to sleep due to the empty pipe buffer. I change the
> > code as below, may I have your suggestion on this?
> 
> I don't have any suggestions that will help plug the holes, heck, I
> squabbled in this arena quite a bit some years ago, and did not win.
> Frankly I don't think the scheduler has the information necessary to do
> so, it'll always be a case of this will likely do less harm than good,
> but will certainly leave at least an occasional mark.
>

I agree. Unlike bug fixing, this kind of change usually involves trade-offs.
The attempt is to not do harm for most cases, and bring benefit for some cases.

Regarding the necessary information, non-scheduler components might have
better knowledge than the scheduler:

1. If the hint comes from user space, it could be something like
   sched_attr::sync_wakeup.
   It indicates that the task prefers sync-wakeup and is tolerant of task
   stacking. However, I'm unsure how this could be accepted by the community
   if we want to change the user space interface. What is the criteria to
   accept such change? Would the requirement from different production
   environment be considered as the endorsement?

2. If the hint comes from other components in the kernel, it could be the
   driver or others. seccomp in the current kernel code, enforces waking the
   wakee on the current CPU via the WF_CURRENT_CPU flag. However, WF_CURRENT_CPU
   seems a bit aggressive for ordinary tasks. So, wait_woken() could be used when
   needed (by the network component, for example) to indicate a possible
   cache/resource sensitivity of the wakee, and together with the task duration,
   to decide if the wakee can be placed on the current CPU.

> Just take a look at the high speed ping-pong thing (not a benchmark,
> that's a box full of tape measures, rather silly, but..).  TCP_RR IS
> 1:1, has as short a duration as network stack plus scheduler can
> possibly make it, and is nearly synchronous to boot, two halves of a
> whole, the ONLY thing you can certainly safely stack..

I agree, this is a limited scenario. 

> but a shared L2 box still takes a wee hit when you do so.

According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
share the same L2 cache, around 20% improvement was noticed (though not as much
as on the non-L2 shared platform). I haven't delved into the details yet, but my
understanding is that L1 cache-to-cache latency within the L2 domain might also
matter on large servers (which I need to investigate further).

> 1:N or M:N
> tasks can approach its wakeup frequency range, and there's nothing you can do
> about the very same cache to cache latency you're trying to duck, it
> just is what it is, and is considered perfectly fine as it is.  That's
> a bit of a red flag, but worse is the lack of knowledge wrt what tasks
> are actually up to at any given time.  We rashly presume that tasks
> waking one another implies a 1:1 relationship, we routinely call them
> buddies and generally get away with it.. but during any overlap they
> can be doing anything including N way data share, and regardless of
> what that is and section size, needless stacking flushes concurrency,
> injecting service latency in its place, cost unknown.
>

I believe this is a generic issue that the current scheduler faces, where
it attempts to predict the task's behavior based on its runtime. For instance,
task_hot() checks the task runtime to predict whether the task is cache-hot,
regardless of what the task does during its time slice. This is also the case
with WF_SYNC, which provides the scheduler with a hint to wake up on the current
CPU to potentially benefit from cache locality.

A thought occurred to me that one possible method to determine if the waker
and wakee share data could be to leverage the NUMA balance's numa_group data structure.
As numa balance periodically scans the task's VMA space and groups tasks accessing
the same physical page into one numa_group, we can infer that if the waker and wakee
are within the same numa_group, they are likely to share data, and it might be
appropriate to place the wakee on top of the waker.

CC Raghavendra here in case he has any insights.

> Intentional stacking can be jokingly equated to injecting just a wee
> bit of SMP kryptonite.. it'll be fine.. at least until it's not ;-)
>

I fully understand your concern, and this analogy is very interesting.
We will conduct additional tests and share the data/analysis later.

thanks,
Chenyu

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Raghavendra K T 1 year, 7 months ago


On 7/1/2024 8:27 PM, Chen Yu wrote:
> Hi Mike,
> 
> On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote:
>> On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
>>> Hi Mike,
>>>
>>> Thanks for your time and giving the insights.
> 
> According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
> share the same L2 cache, around 20% improvement was noticed (though not as much
> as on the non-L2 shared platform). I haven't delved into the details yet, but my
> understanding is that L1 cache-to-cache latency within the L2 domain might also
> matter on large servers (which I need to investigate further).
> 
>> 1:N or M:N
>> tasks can approach its wakeup frequency range, and there's nothing you can do
>> about the very same cache to cache latency you're trying to duck, it
>> just is what it is, and is considered perfectly fine as it is.  That's
>> a bit of a red flag, but worse is the lack of knowledge wrt what tasks
>> are actually up to at any given time.  We rashly presume that tasks
>> waking one another implies a 1:1 relationship, we routinely call them
>> buddies and generally get away with it.. but during any overlap they
>> can be doing anything including N way data share, and regardless of
>> what that is and section size, needless stacking flushes concurrency,
>> injecting service latency in its place, cost unknown.
>>
> 
> I believe this is a generic issue that the current scheduler faces, where
> it attempts to predict the task's behavior based on its runtime. For instance,
> task_hot() checks the task runtime to predict whether the task is cache-hot,
> regardless of what the task does during its time slice. This is also the case
> with WF_SYNC, which provides the scheduler with a hint to wake up on the current
> CPU to potentially benefit from cache locality.
> 
> A thought occurred to me that one possible method to determine if the waker
> and wakee share data could be to leverage the NUMA balance's numa_group data structure.
> As numa balance periodically scans the task's VMA space and groups tasks accessing
> the same physical page into one numa_group, we can infer that if the waker and wakee
> are within the same numa_group, they are likely to share data, and it might be
> appropriate to place the wakee on top of the waker.
> 
> CC Raghavendra here in case he has any insights.
> 

Agree with your thought here,

So I imagine two possible things to explore here.

1) Use task1, task2 numa_group and check if they belong to same
numa_group, also check if there is a possibility of M:N relationship
by checking if t1/t2->numa_group->nr_tasks > 1 etc

2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
(threads) possibly interested in same VMA.
Latter one looks to be practically difficult because we don't want to
sweep across VMAs perhaps..

> thanks,
> Chenyu

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Chen Yu 1 year, 7 months ago

On 2024-07-03 at 14:04:47 +0530, Raghavendra K T wrote:
> 
> 
> On 7/1/2024 8:27 PM, Chen Yu wrote:
> > Hi Mike,
> > 
> > On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote:
> > > On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
> > > > Hi Mike,
> > > > 
> > > > Thanks for your time and giving the insights.
> > 
> > According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
> > share the same L2 cache, around 20% improvement was noticed (though not as much
> > as on the non-L2 shared platform). I haven't delved into the details yet, but my
> > understanding is that L1 cache-to-cache latency within the L2 domain might also
> > matter on large servers (which I need to investigate further).
> > 
> > > 1:N or M:N
> > > tasks can approach its wakeup frequency range, and there's nothing you can do
> > > about the very same cache to cache latency you're trying to duck, it
> > > just is what it is, and is considered perfectly fine as it is.  That's
> > > a bit of a red flag, but worse is the lack of knowledge wrt what tasks
> > > are actually up to at any given time.  We rashly presume that tasks
> > > waking one another implies a 1:1 relationship, we routinely call them
> > > buddies and generally get away with it.. but during any overlap they
> > > can be doing anything including N way data share, and regardless of
> > > what that is and section size, needless stacking flushes concurrency,
> > > injecting service latency in its place, cost unknown.
> > > 
> > 
> > I believe this is a generic issue that the current scheduler faces, where
> > it attempts to predict the task's behavior based on its runtime. For instance,
> > task_hot() checks the task runtime to predict whether the task is cache-hot,
> > regardless of what the task does during its time slice. This is also the case
> > with WF_SYNC, which provides the scheduler with a hint to wake up on the current
> > CPU to potentially benefit from cache locality.
> > 
> > A thought occurred to me that one possible method to determine if the waker
> > and wakee share data could be to leverage the NUMA balance's numa_group data structure.
> > As numa balance periodically scans the task's VMA space and groups tasks accessing
> > the same physical page into one numa_group, we can infer that if the waker and wakee
> > are within the same numa_group, they are likely to share data, and it might be
> > appropriate to place the wakee on top of the waker.
> > 
> > CC Raghavendra here in case he has any insights.
> > 
> 
> Agree with your thought here,
>

Thanks for taking a look at this, Raghavendra.

> So I imagine two possible things to explore here.
> 
> 1) Use task1, task2 numa_group and check if they belong to same
> numa_group, also check if there is a possibility of M:N relationship
> by checking if t1/t2->numa_group->nr_tasks > 1 etc
>

I see, so do you mean if it is M:N rather than 1:1, we should avoid the
task been woken up on current CPU to avoid task stacking?
 
> 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
> (threads) possibly interested in same VMA.
> Latter one looks to be practically difficult because we don't want to
> sweep across VMAs perhaps..
>

Yeah, we might have to scan the whole VMAs to gather the PID information which
might be a little costly(or maybe subset of the VMAs). And the pids_active[] is
for threads rather than process, stacking the threads seem to not be good
enough(per Mike's comments)

Anyway, I'm preparing for some full tests to see if there is overall benefit
from current version. Later let's investigate if numa balance information could
help here.

thanks,
Chenyu

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Raghavendra K T 1 year, 7 months ago


On 7/3/2024 6:42 PM, Chen Yu wrote:
> On 2024-07-03 at 14:04:47 +0530, Raghavendra K T wrote:
>>
>>
>> On 7/1/2024 8:27 PM, Chen Yu wrote:
>>> Hi Mike,
>>>
>>> On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote:
>>>> On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
>>>>> Hi Mike,
>>>>>
>>>>> Thanks for your time and giving the insights.
>>>
>>> According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
>>> share the same L2 cache, around 20% improvement was noticed (though not as much
>>> as on the non-L2 shared platform). I haven't delved into the details yet, but my
>>> understanding is that L1 cache-to-cache latency within the L2 domain might also
>>> matter on large servers (which I need to investigate further).
>>>
>>>> 1:N or M:N
>>>> tasks can approach its wakeup frequency range, and there's nothing you can do
>>>> about the very same cache to cache latency you're trying to duck, it
>>>> just is what it is, and is considered perfectly fine as it is.  That's
>>>> a bit of a red flag, but worse is the lack of knowledge wrt what tasks
>>>> are actually up to at any given time.  We rashly presume that tasks
>>>> waking one another implies a 1:1 relationship, we routinely call them
>>>> buddies and generally get away with it.. but during any overlap they
>>>> can be doing anything including N way data share, and regardless of
>>>> what that is and section size, needless stacking flushes concurrency,
>>>> injecting service latency in its place, cost unknown.
>>>>
>>>
>>> I believe this is a generic issue that the current scheduler faces, where
>>> it attempts to predict the task's behavior based on its runtime. For instance,
>>> task_hot() checks the task runtime to predict whether the task is cache-hot,
>>> regardless of what the task does during its time slice. This is also the case
>>> with WF_SYNC, which provides the scheduler with a hint to wake up on the current
>>> CPU to potentially benefit from cache locality.
>>>
>>> A thought occurred to me that one possible method to determine if the waker
>>> and wakee share data could be to leverage the NUMA balance's numa_group data structure.
>>> As numa balance periodically scans the task's VMA space and groups tasks accessing
>>> the same physical page into one numa_group, we can infer that if the waker and wakee
>>> are within the same numa_group, they are likely to share data, and it might be
>>> appropriate to place the wakee on top of the waker.
>>>
>>> CC Raghavendra here in case he has any insights.
>>>
>>
>> Agree with your thought here,
>>
> 
> Thanks for taking a look at this, Raghavendra.
> 
>> So I imagine two possible things to explore here.
>>
>> 1) Use task1, task2 numa_group and check if they belong to same
>> numa_group, also check if there is a possibility of M:N relationship
>> by checking if t1/t2->numa_group->nr_tasks > 1 etc
>>
> 
> I see, so do you mean if it is M:N rather than 1:1, we should avoid the
> task been woken up on current CPU to avoid task stacking?

Not sure actually, perhaps depends on usecase. But atleast gives an idea
on the relationship.
Problem here is we only know that they belong to same numa_group,
But we cannot deduce actual relationship (we only know it is > 1)
(1:N or M:N is not known).

>   
>> 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
>> (threads) possibly interested in same VMA.
>> Latter one looks to be practically difficult because we don't want to
>> sweep across VMAs perhaps..
>>
> 
> Yeah, we might have to scan the whole VMAs to gather the PID information which
> might be a little costly(or maybe subset of the VMAs). And the pids_active[] is
> for threads rather than process, stacking the threads seem to not be good
> enough(per Mike's comments)
> 
> Anyway, I'm preparing for some full tests to see if there is overall benefit
> from current version. Later let's investigate if numa balance information could
> help here.
>

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Mike Galbraith 1 year, 7 months ago

On Wed, 2024-07-03 at 14:04 +0530, Raghavendra K T wrote:
>
>
> On 7/1/2024 8:27 PM, Chen Yu wrote:
> >
> > A thought occurred to me that one possible method to determine if the waker
> > and wakee share data could be to leverage the NUMA balance's numa_group data structure.
> > As numa balance periodically scans the task's VMA space and groups tasks accessing
> > the same physical page into one numa_group, we can infer that if the waker and wakee
> > are within the same numa_group, they are likely to share data, and it might be
> > appropriate to place the wakee on top of the waker.
> >
> > CC Raghavendra here in case he has any insights.
> >
>
> Agree with your thought here,
>
> So I imagine two possible things to explore here.
>
> 1) Use task1, task2 numa_group and check if they belong to same
> numa_group, also check if there is a possibility of M:N relationship
> by checking if t1/t2->numa_group->nr_tasks > 1 etc
>
> 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
> (threads) possibly interested in same VMA.
> Latter one looks to be practically difficult because we don't want to
> sweep across VMAs perhaps..

Oooh dear.. as soon as you mention threads, the question of who's
wheelhouse is this in springs to mind, ie should the kernel be
overriding userspace by targeting bits of threaded programs for forced
serialization?

Bah, think I'll just bugger off and let you guys have a go at making
this stacking business do less harm than good.

	-Mike

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Raghavendra K T 1 year, 7 months ago


On 7/3/2024 5:27 PM, Mike Galbraith wrote:
> On Wed, 2024-07-03 at 14:04 +0530, Raghavendra K T wrote:
>>
>>
>> On 7/1/2024 8:27 PM, Chen Yu wrote:
>>>
>>> A thought occurred to me that one possible method to determine if the waker
>>> and wakee share data could be to leverage the NUMA balance's numa_group data structure.
>>> As numa balance periodically scans the task's VMA space and groups tasks accessing
>>> the same physical page into one numa_group, we can infer that if the waker and wakee
>>> are within the same numa_group, they are likely to share data, and it might be
>>> appropriate to place the wakee on top of the waker.
>>>
>>> CC Raghavendra here in case he has any insights.
>>>
>>
>> Agree with your thought here,
>>
>> So I imagine two possible things to explore here.
>>
>> 1) Use task1, task2 numa_group and check if they belong to same
>> numa_group, also check if there is a possibility of M:N relationship
>> by checking if t1/t2->numa_group->nr_tasks > 1 etc
>>
>> 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
>> (threads) possibly interested in same VMA.
>> Latter one looks to be practically difficult because we don't want to
>> sweep across VMAs perhaps..
> 
> Oooh dear.. as soon as you mention threads, the question of who's
> wheelhouse is this in springs to mind, ie should the kernel be
> overriding userspace by targeting bits of threaded programs for forced
> serialization?
> 

Yes.. There is no ROI on this option (mentioned only for completeness).
also we are not looking beyond process. Rather than "Practically
  difficult" I should have rephrased as Practically not an option.

> Bah, think I'll just bugger off and let you guys have a go at making
> this stacking business do less harm than good.
> 
> 	-Mike

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Chen Yu 1 year, 7 months ago

Hi Mike,

On 2024-07-03 at 13:57:19 +0200, Mike Galbraith wrote:
> On Wed, 2024-07-03 at 14:04 +0530, Raghavendra K T wrote:
> >
> >
> > On 7/1/2024 8:27 PM, Chen Yu wrote:
> > >
> > > A thought occurred to me that one possible method to determine if the waker
> > > and wakee share data could be to leverage the NUMA balance's numa_group data structure.
> > > As numa balance periodically scans the task's VMA space and groups tasks accessing
> > > the same physical page into one numa_group, we can infer that if the waker and wakee
> > > are within the same numa_group, they are likely to share data, and it might be
> > > appropriate to place the wakee on top of the waker.
> > >
> > > CC Raghavendra here in case he has any insights.
> > >
> >
> > Agree with your thought here,
> >
> > So I imagine two possible things to explore here.
> >
> > 1) Use task1, task2 numa_group and check if they belong to same
> > numa_group, also check if there is a possibility of M:N relationship
> > by checking if t1/t2->numa_group->nr_tasks > 1 etc
> >
> > 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
> > (threads) possibly interested in same VMA.
> > Latter one looks to be practically difficult because we don't want to
> > sweep across VMAs perhaps..
> 
> Oooh dear.. as soon as you mention threads, the question of who's
> wheelhouse is this in springs to mind, ie should the kernel be
> overriding userspace by targeting bits of threaded programs for forced
> serialization?
>

Do you mean the threads within the same process should run in parallel
as much as possible, regardless of sharing the same data, because the thread
is designed to do so? If so then we should probably skip the 1:1 task stacking
if the waker and the wakee are in the same process.
 
thanks,
Chenyu

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Mike Galbraith 1 year, 7 months ago

On Wed, 2024-07-03 at 21:23 +0800, Chen Yu wrote:
> On 2024-07-03 at 13:57:19 +0200, Mike Galbraith wrote:
> >
> > Oooh dear.. as soon as you mention threads, the question of who's
> > wheelhouse is this in springs to mind, ie should the kernel be
> > overriding userspace by targeting bits of threaded programs for forced
> > serialization?
>
> Do you mean the threads within the same process should run in parallel
> as much as possible, regardless of sharing the same data, because the thread
> is designed to do so? If so then we should probably skip the 1:1 task stacking
> if the waker and the wakee are in the same process.

That's seems like the obviously correct answer.. except for that
leaving virtually the entire issue you're squabbling with lying on the
floor.  Grr, how annoying.

	-Mike

Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Posted by Mike Galbraith 1 year, 7 months ago

On Mon, 2024-07-01 at 22:57 +0800, Chen Yu wrote:
> > Just take a look at the high speed ping-pong thing (not a benchmark,
> > that's a box full of tape measures, rather silly, but..).  TCP_RR IS
> > 1:1, has as short a duration as network stack plus scheduler can
> > possibly make it, and is nearly synchronous to boot, two halves of a
> > whole, the ONLY thing you can certainly safely stack..
>
> I agree, this is a limited scenario.
>
> > but a shared L2 box still takes a wee hit when you do so.
>
> According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
> share the same L2 cache, around 20% improvement was noticed (though not as much
> as on the non-L2 shared platform).

This dinky box doesn't have 500 cores, but it's.. aw, adorable :)

rpi4:/root # ONLY=TCP_RR netperf.sh
TCP_RR-1        unbound    Avg:  31754  Sum:    31754
TCP_RR-1        stacked    Avg:  26625  Sum:    26625
TCP_RR-1        cross-core Avg:  32325  Sum:    32325

rpi4:/root # tbench.sh 1 30 2>&1|grep Throughput
Throughput 139.024 MB/sec  1 clients  1 procs  max_latency=1.116 ms
rpi4:/root # taskset -c 3 tbench.sh 1 30 2>&1|grep Throughput
Throughput 116.765 MB/sec  1 clients  1 procs  max_latency=0.340 ms
rpi4:/root #

This little box running its stock 6.6.33 distro kernel pulls out a
cross-core win for both maximally synchronous TCP_RR and the a bit
lesser so but still pretty close tbench.  The numbers mean little
though, one propagation speed is lovely, but were there more, I'd be as
stuck with them as I am with rpi4's one-speed (all ahead slow) gearbox.

	-Mike