The default preemption policy for voluntary preemption under
PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
class, and lazily for well-behaved, non-idle tasks.
This is the same policy as preempt=none, with an eager handling of
higher priority scheduling classes.
So, compare a SCHED_DEADLINE workload (stress-ng --cyclic) with a
background kernel load of 'stress-ng --mmap':
# stress-ng --mmap 0 &
# stress-ng --cyclic 1 --timeout 300
PREEMPT_AUTO, preempt=none
stress-ng: info: [8827] setting to a 300 second (5 mins, 0.00 secs) run per stressor
stress-ng: info: [8827] dispatching hogs: 1 cyclic
stress-ng: info: [8828] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [8828] cyclic: mean: 23829.70 ns, mode: 3317 ns
stress-ng: info: [8828] cyclic: min: 2688 ns, max: 5701735 ns, std.dev. 123502.57
stress-ng: info: [8828] cyclic: latency percentiles:
stress-ng: info: [8828] cyclic: 25.00%: 6289 ns
stress-ng: info: [8828] cyclic: 50.00%: 13945 ns
stress-ng: info: [8828] cyclic: 75.00%: 25335 ns
stress-ng: info: [8828] cyclic: 90.00%: 34500 ns
stress-ng: info: [8828] cyclic: 95.40%: 41470 ns
stress-ng: info: [8828] cyclic: 99.00%: 85602 ns
stress-ng: info: [8828] cyclic: 99.50%: 301099 ns
stress-ng: info: [8828] cyclic: 99.90%: 1798633 ns
stress-ng: info: [8828] cyclic: 99.99%: 5701735 ns
stress-ng: info: [8827] successful run completed in 300.01s (5 mins, 0.01 secs)
PREEMPT_AUTO, preempt=voluntary
stress-ng: info: [8883] setting to a 300 second (5 mins, 0.00 secs) run per stressor
stress-ng: info: [8883] dispatching hogs: 1 cyclic
stress-ng: info: [8884] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [8884] cyclic: mean: 14169.08 ns, mode: 3355 ns
stress-ng: info: [8884] cyclic: min: 2570 ns, max: 2234939 ns, std.dev. 66056.95
stress-ng: info: [8884] cyclic: latency percentiles:
stress-ng: info: [8884] cyclic: 25.00%: 3665 ns
stress-ng: info: [8884] cyclic: 50.00%: 5409 ns
stress-ng: info: [8884] cyclic: 75.00%: 16009 ns
stress-ng: info: [8884] cyclic: 90.00%: 24392 ns
stress-ng: info: [8884] cyclic: 95.40%: 28827 ns
stress-ng: info: [8884] cyclic: 99.00%: 39153 ns
stress-ng: info: [8884] cyclic: 99.50%: 168643 ns
stress-ng: info: [8884] cyclic: 99.90%: 1186267 ns
stress-ng: info: [8884] cyclic: 99.99%: 2234939 ns
stress-ng: info: [8883] successful run completed in 300.01s (5 mins, 0.01 secs)
So, the latency improves significantly -- even at the 25th percentile
threshold.
This configuration also compares quite favourably to voluntary
preemption under PREEMPT_DYNAMIC.
PREEMPT_DYNAMIC, preempt=voluntary
stress-ng: info: [12252] setting to a 300 second (5 mins, 0.00 secs) run per stressor
stress-ng: info: [12252] dispatching hogs: 1 cyclic
stress-ng: info: [12253] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [12253] cyclic: mean: 19973.46 ns, mode: 3560 ns
stress-ng: info: [12253] cyclic: min: 2541 ns, max: 2751830 ns, std.dev. 68891.71
stress-ng: info: [12253] cyclic: latency percentiles:
stress-ng: info: [12253] cyclic: 25.00%: 4800 ns
stress-ng: info: [12253] cyclic: 50.00%: 12458 ns
stress-ng: info: [12253] cyclic: 75.00%: 25220 ns
stress-ng: info: [12253] cyclic: 90.00%: 35404 ns
stress-ng: info: [12253] cyclic: 95.40%: 43135 ns
stress-ng: info: [12253] cyclic: 99.00%: 61053 ns
stress-ng: info: [12253] cyclic: 99.50%: 98159 ns
stress-ng: info: [12253] cyclic: 99.90%: 1164407 ns
stress-ng: info: [12253] cyclic: 99.99%: 2751830 ns
stress-ng: info: [12252] successful run completed in 300.01s (5 mins, 0.01 secs)
And, as you would expect, we perform identically to preempt=full
with PREEMPT_DYNAMIC (ignoring the outliers at 99.99%.)
PREEMPT_DYNAMIC, preempt=full
stress-ng: info: [12206] setting to a 300 second (5 mins, 0.00 secs) run per stressor
stress-ng: info: [12206] dispatching hogs: 1 cyclic
stress-ng: info: [12207] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [12207] cyclic: mean: 16647.05 ns, mode: 3535 ns
stress-ng: info: [12207] cyclic: min: 2548 ns, max: 4093382 ns, std.dev. 116825.95
stress-ng: info: [12207] cyclic: latency percentiles:
stress-ng: info: [12207] cyclic: 25.00%: 3568 ns
stress-ng: info: [12207] cyclic: 50.00%: 4755 ns
stress-ng: info: [12207] cyclic: 75.00%: 15187 ns
stress-ng: info: [12207] cyclic: 90.00%: 24083 ns
stress-ng: info: [12207] cyclic: 95.40%: 29314 ns
stress-ng: info: [12207] cyclic: 99.00%: 40102 ns
stress-ng: info: [12207] cyclic: 99.50%: 366571 ns
stress-ng: info: [12207] cyclic: 99.90%: 2752037 ns
stress-ng: info: [12207] cyclic: 99.99%: 4093382 ns
stress-ng: info: [12206] successful run completed in 300.01s (5 mins, 0.01 secs)
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Ziljstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
kernel/sched/core.c | 12 ++++++++----
kernel/sched/sched.h | 6 ++++++
2 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aaa87d5fecdd..aa31f23f43f4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1056,6 +1056,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
if (preempt_model_preemptible())
return NR_now;
+ if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
+ return NR_now;
+
if (is_idle_task(curr))
return NR_now;
@@ -2297,7 +2300,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
if (p->sched_class == rq->curr->sched_class)
rq->curr->sched_class->wakeup_preempt(rq, p, flags);
else if (sched_class_above(p->sched_class, rq->curr->sched_class))
- resched_curr(rq);
+ resched_curr_priority(rq);
/*
* A queue event has occurred, and we're going to schedule. In
@@ -8974,11 +8977,11 @@ static void __sched_dynamic_update(int mode)
case preempt_dynamic_none:
if (mode != preempt_dynamic_mode)
pr_info("%s: none\n", PREEMPT_MODE);
- preempt_dynamic_mode = mode;
break;
case preempt_dynamic_voluntary:
- preempt_dynamic_mode = preempt_dynamic_undefined;
+ if (mode != preempt_dynamic_mode)
+ pr_info("%s: voluntary\n", PREEMPT_MODE);
break;
case preempt_dynamic_full:
@@ -8988,9 +8991,10 @@ static void __sched_dynamic_update(int mode)
if (mode != preempt_dynamic_mode)
pr_info("%s: full\n", PREEMPT_MODE);
- preempt_dynamic_mode = mode;
break;
}
+
+ preempt_dynamic_mode = mode;
}
#endif /* CONFIG_PREEMPT_AUTO */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7e7acab1022..197c038b87c6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2466,6 +2466,7 @@ enum resched_opt {
RESCHED_DEFAULT,
RESCHED_FORCE,
RESCHED_TICK,
+ RESCHED_PRIORITY,
};
extern void __resched_curr(struct rq *rq, enum resched_opt opt);
@@ -2480,6 +2481,11 @@ static inline void resched_curr_tick(struct rq *rq)
__resched_curr(rq, RESCHED_TICK);
}
+static inline void resched_curr_priority(struct rq *rq)
+{
+ __resched_curr(rq, RESCHED_PRIORITY);
+}
+
extern void resched_cpu(int cpu);
extern struct rt_bandwidth def_rt_bandwidth;
--
2.31.1
Hi Anukr, On Mon, Feb 12, 2024 at 09:55:50PM -0800, Ankur Arora wrote: > The default preemption policy for voluntary preemption under > PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling > class, and lazily for well-behaved, non-idle tasks. > > This is the same policy as preempt=none, with an eager handling of > higher priority scheduling classes. AFAICS, the meaning of the word 'voluntary' has changed versus the old CONFIG_PREEMPT_VOLUNTARY, with this patch. So the word voluntary does not completely make sense in this context. What is VOLUNTARY about choosing a higher scheduling class? For instance, even in the same scheduling class, there is a notion of higher priority, not just between classes. Example, higher RT priority within RT, or earlier deadline within EEVDF (formerly CFS). IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no 'voluntary' business because 1. The behavior vs =none is to allow higher scheduling class to preempt, it is not about the old voluntary. 2. you are also planning to remove cond_resched()s via this series and leave it to the scheduler right? Or call it preempt=higher, or something? No one is going to understand the meaning of voluntary the way it is implied here IMHO. thanks, - Joel
Joel Fernandes <joel@joelfernandes.org> writes:
> Hi Anukr,
>
> On Mon, Feb 12, 2024 at 09:55:50PM -0800, Ankur Arora wrote:
>> The default preemption policy for voluntary preemption under
>> PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
>> class, and lazily for well-behaved, non-idle tasks.
>>
>> This is the same policy as preempt=none, with an eager handling of
>> higher priority scheduling classes.
>
> AFAICS, the meaning of the word 'voluntary' has changed versus the old
> CONFIG_PREEMPT_VOLUNTARY, with this patch.
>
> So the word voluntary does not completely make sense in this context. What is
> VOLUNTARY about choosing a higher scheduling class?
>
> For instance, even in the same scheduling class, there is a notion of higher
> priority, not just between classes. Example, higher RT priority within RT, or
> earlier deadline within EEVDF (formerly CFS).
Agreed. The higher scheduling class line is pretty fuzzy and after the discussion
with Juri, almost non existent: https://lore.kernel.org/lkml/ZeBPXNFkipU9yytp@localhost.localdomain/.
> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
> 'voluntary' business because
> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
> is not about the old voluntary.
What do you think about folding the higher scheduling class preemption logic
into preempt=none? As Juri pointed out, prioritization of at least the leftmost
deadline task needs to be done for correctness.
(That'll get rid of the current preempt=voluntary model, at least until
there's a separate use for it.)
> 2. you are also planning to remove cond_resched()s via this series and leave
> it to the scheduler right?
Yeah, under PREEMPT_AUTO, cond_resched() will /almost/ be not there. Gets
defined to:
static inline int _cond_resched(void)
{
klp_sched_try_switch();
return 0;
}
Right now, we need cond_resched() to make timely forward progress while
doing live-patching.
> Or call it preempt=higher, or something? No one is going to understand the
> meaning of voluntary the way it is implied here IMHO.
I don't think there's enough to make it worth adding a new model. For
now I'm tending towards moving the correctness parts to preempt=none and
making preempt=voluntary identical to preempt=none.
Thanks for the review.
--
ankur
Hi Ankur,
On 3/5/2024 3:11 AM, Ankur Arora wrote:
>
> Joel Fernandes <joel@joelfernandes.org> writes:
>
[..]
>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
>> 'voluntary' business because
>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
>> is not about the old voluntary.
>
> What do you think about folding the higher scheduling class preemption logic
> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
> deadline task needs to be done for correctness.
>
> (That'll get rid of the current preempt=voluntary model, at least until
> there's a separate use for it.)
Yes I am all in support for that. Its less confusing for the user as well, and
scheduling higher priority class at the next tick for preempt=none sounds good
to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
that is more aggressive, it could be added in the future.
>> 2. you are also planning to remove cond_resched()s via this series and leave
>> it to the scheduler right?
>
> Yeah, under PREEMPT_AUTO, cond_resched() will /almost/ be not there. Gets
> defined to:
>
> static inline int _cond_resched(void)
> {
> klp_sched_try_switch();
> return 0;
> }
>
> Right now, we need cond_resched() to make timely forward progress while
> doing live-patching.
Cool, got it!
>> Or call it preempt=higher, or something? No one is going to understand the
>> meaning of voluntary the way it is implied here IMHO.
>
> I don't think there's enough to make it worth adding a new model. For
> now I'm tending towards moving the correctness parts to preempt=none and
> making preempt=voluntary identical to preempt=none.
Got it, sounds good.
> Thanks for the review.
Sure! Thanks for this work. Looking forward to the next series,
- Joel
On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
> Hi Ankur,
>
> On 3/5/2024 3:11 AM, Ankur Arora wrote:
> >
> > Joel Fernandes <joel@joelfernandes.org> writes:
> >
> [..]
> >> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
> >> 'voluntary' business because
> >> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
> >> is not about the old voluntary.
> >
> > What do you think about folding the higher scheduling class preemption logic
> > into preempt=none? As Juri pointed out, prioritization of at least the leftmost
> > deadline task needs to be done for correctness.
> >
> > (That'll get rid of the current preempt=voluntary model, at least until
> > there's a separate use for it.)
>
> Yes I am all in support for that. Its less confusing for the user as well, and
> scheduling higher priority class at the next tick for preempt=none sounds good
> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
> that is more aggressive, it could be added in the future.
This would be something that happens only after removing cond_resched()
might_sleep() functionality from might_sleep(), correct?
Thanx, Paul
> >> 2. you are also planning to remove cond_resched()s via this series and leave
> >> it to the scheduler right?
> >
> > Yeah, under PREEMPT_AUTO, cond_resched() will /almost/ be not there. Gets
> > defined to:
> >
> > static inline int _cond_resched(void)
> > {
> > klp_sched_try_switch();
> > return 0;
> > }
> >
> > Right now, we need cond_resched() to make timely forward progress while
> > doing live-patching.
>
> Cool, got it!
>
> >> Or call it preempt=higher, or something? No one is going to understand the
> >> meaning of voluntary the way it is implied here IMHO.
> >
> > I don't think there's enough to make it worth adding a new model. For
> > now I'm tending towards moving the correctness parts to preempt=none and
> > making preempt=voluntary identical to preempt=none.
>
> Got it, sounds good.
>
> > Thanks for the review.
>
> Sure! Thanks for this work. Looking forward to the next series,
>
> - Joel
>
On 3/7/2024 2:01 PM, Paul E. McKenney wrote: > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote: >> Hi Ankur, >> >> On 3/5/2024 3:11 AM, Ankur Arora wrote: >>> >>> Joel Fernandes <joel@joelfernandes.org> writes: >>> >> [..] >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no >>>> 'voluntary' business because >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it >>>> is not about the old voluntary. >>> >>> What do you think about folding the higher scheduling class preemption logic >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost >>> deadline task needs to be done for correctness. >>> >>> (That'll get rid of the current preempt=voluntary model, at least until >>> there's a separate use for it.) >> >> Yes I am all in support for that. Its less confusing for the user as well, and >> scheduling higher priority class at the next tick for preempt=none sounds good >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode >> that is more aggressive, it could be added in the future. > > This would be something that happens only after removing cond_resched() > might_sleep() functionality from might_sleep(), correct? Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above, he seems to be suggesting preempting instantly for higher scheduling CLASSES even for preempt=none mode, without having to wait till the next scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not to treat "higher class" any differently than "higher priority" for preempt=none. And if SCHED_DEADLINE has a problem with that, then it already happens so with CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any more than the treatment given to higher priority within same class. Ankur/Juri? Re: cond_resched(), I did not follow you Paul, why does removing the proposed preempt=voluntary mode (i.e. dropping this patch) have to happen only after cond_resched()/might_sleep() modifications? thanks, - Joel
Joel Fernandes <joel@joelfernandes.org> writes: > On 3/7/2024 2:01 PM, Paul E. McKenney wrote: >> On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote: >>> Hi Ankur, >>> >>> On 3/5/2024 3:11 AM, Ankur Arora wrote: >>>> >>>> Joel Fernandes <joel@joelfernandes.org> writes: >>>> >>> [..] >>>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no >>>>> 'voluntary' business because >>>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it >>>>> is not about the old voluntary. >>>> >>>> What do you think about folding the higher scheduling class preemption logic >>>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost >>>> deadline task needs to be done for correctness. >>>> >>>> (That'll get rid of the current preempt=voluntary model, at least until >>>> there's a separate use for it.) >>> >>> Yes I am all in support for that. Its less confusing for the user as well, and >>> scheduling higher priority class at the next tick for preempt=none sounds good >>> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever >>> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode >>> that is more aggressive, it could be added in the future. >> >> This would be something that happens only after removing cond_resched() >> might_sleep() functionality from might_sleep(), correct? > > Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above, > he seems to be suggesting preempting instantly for higher scheduling CLASSES > even for preempt=none mode, without having to wait till the next > scheduling-clock interrupt. Yes, that's what I was suggesting. > Not sure if that makes sense to me, I was asking not > to treat "higher class" any differently than "higher priority" for preempt=none. Ah. Understood. > And if SCHED_DEADLINE has a problem with that, then it already happens so with > CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any > more than the treatment given to higher priority within same class. Ankur/Juri? No. I think that behaviour might be worse for PREEMPT_AUTO. PREEMPT_NONE=y (or PREEMPT_VOLUNTARY=y for that matter) don't quite have a policy around when preemption happens. Preemption might happen quickly, might happen slowly based on when the next preemption point is found. The PREEMPT_AUTO, preempt=none policy in this series will always cause preemption to be at user exit or the next tick. Seems like it would be worse for higher scheduling classes more often. But, I wonder what Juri thinks about this. -- ankur
On 07/03/24 19:49, Ankur Arora wrote: > Joel Fernandes <joel@joelfernandes.org> writes: ... > > Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above, > > he seems to be suggesting preempting instantly for higher scheduling CLASSES > > even for preempt=none mode, without having to wait till the next > > scheduling-clock interrupt. > > Yes, that's what I was suggesting. > > > Not sure if that makes sense to me, I was asking not > > to treat "higher class" any differently than "higher priority" for preempt=none. > > Ah. Understood. > > > And if SCHED_DEADLINE has a problem with that, then it already happens so with > > CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any > > more than the treatment given to higher priority within same class. Ankur/Juri? > > No. I think that behaviour might be worse for PREEMPT_AUTO. > > PREEMPT_NONE=y (or PREEMPT_VOLUNTARY=y for that matter) don't > quite have a policy around when preemption happens. Preemption > might happen quickly, might happen slowly based on when the next > preemption point is found. > > The PREEMPT_AUTO, preempt=none policy in this series will always > cause preemption to be at user exit or the next tick. Seems like > it would be worse for higher scheduling classes more often. > > But, I wonder what Juri thinks about this. As I was saying in my last comment in the other discussion, I'm honestly not sure, mostly because I'm currently fail to see what type of users would choose preempt=none and have tasks scheduled with SCHED_DEADLINE (please suggest example usecases, as I'm pretty sure I'm missing something :). With that said, if the purpose of preempt=none is to have a model which is super conservative wrt preemptions, having to wait one tick to possibly schedule a DEADLINE task still seems kind of broken for DEADLINE, but at least is predictably broken (guess one needs to account for that somehow when coming up with parameters :). Thanks, Juri
Juri Lelli <juri.lelli@redhat.com> writes: > On 07/03/24 19:49, Ankur Arora wrote: >> Joel Fernandes <joel@joelfernandes.org> writes: > > ... > >> > Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above, >> > he seems to be suggesting preempting instantly for higher scheduling CLASSES >> > even for preempt=none mode, without having to wait till the next >> > scheduling-clock interrupt. >> >> Yes, that's what I was suggesting. >> >> > Not sure if that makes sense to me, I was asking not >> > to treat "higher class" any differently than "higher priority" for preempt=none. >> >> Ah. Understood. >> >> > And if SCHED_DEADLINE has a problem with that, then it already happens so with >> > CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any >> > more than the treatment given to higher priority within same class. Ankur/Juri? >> >> No. I think that behaviour might be worse for PREEMPT_AUTO. >> >> PREEMPT_NONE=y (or PREEMPT_VOLUNTARY=y for that matter) don't >> quite have a policy around when preemption happens. Preemption >> might happen quickly, might happen slowly based on when the next >> preemption point is found. >> >> The PREEMPT_AUTO, preempt=none policy in this series will always >> cause preemption to be at user exit or the next tick. Seems like >> it would be worse for higher scheduling classes more often. >> >> But, I wonder what Juri thinks about this. > > As I was saying in my last comment in the other discussion, I'm honestly > not sure, mostly because I'm currently fail to see what type of users > would choose preempt=none and have tasks scheduled with SCHED_DEADLINE > (please suggest example usecases, as I'm pretty sure I'm missing > something :). With that said, if the purpose of preempt=none is to have > a model which is super conservative wrt preemptions, having to wait one > tick to possibly schedule a DEADLINE task still seems kind of broken for > DEADLINE, but at least is predictably broken (guess one needs to account > for that somehow when coming up with parameters :). True :). Let me do some performance comparisons between (preempt=none + the leftmost logic) and whatever is left off in the preempt=voluntary patch. Thanks -- ankur
On Thu, Mar 7, 2024 at 10:49 PM Ankur Arora <ankur.a.arora@oracle.com> wrote: > > The PREEMPT_AUTO, preempt=none policy in this series will always > cause preemption to be at user exit or the next tick. Seems like > it would be worse for higher scheduling classes more often. Ok, sounds good. I believe we are on the same page then. Thanks.
On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote: > > > On 3/7/2024 2:01 PM, Paul E. McKenney wrote: > > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote: > >> Hi Ankur, > >> > >> On 3/5/2024 3:11 AM, Ankur Arora wrote: > >>> > >>> Joel Fernandes <joel@joelfernandes.org> writes: > >>> > >> [..] > >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no > >>>> 'voluntary' business because > >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it > >>>> is not about the old voluntary. > >>> > >>> What do you think about folding the higher scheduling class preemption logic > >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost > >>> deadline task needs to be done for correctness. > >>> > >>> (That'll get rid of the current preempt=voluntary model, at least until > >>> there's a separate use for it.) > >> > >> Yes I am all in support for that. Its less confusing for the user as well, and > >> scheduling higher priority class at the next tick for preempt=none sounds good > >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever > >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode > >> that is more aggressive, it could be added in the future. > > > > This would be something that happens only after removing cond_resched() > > might_sleep() functionality from might_sleep(), correct? > > Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above, > he seems to be suggesting preempting instantly for higher scheduling CLASSES > even for preempt=none mode, without having to wait till the next > scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not > to treat "higher class" any differently than "higher priority" for preempt=none. > > And if SCHED_DEADLINE has a problem with that, then it already happens so with > CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any > more than the treatment given to higher priority within same class. Ankur/Juri? > > Re: cond_resched(), I did not follow you Paul, why does removing the proposed > preempt=voluntary mode (i.e. dropping this patch) have to happen only after > cond_resched()/might_sleep() modifications? Because right now, one large difference between CONFIG_PREEMPT_NONE an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a preemption point, but not for the former. But if might_sleep() becomes debug-only, then there will no longer be this difference. Thanx, Paul
Paul E. McKenney <paulmck@kernel.org> writes:
> On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote:
>>
>>
>> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
>> > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
>> >> Hi Ankur,
>> >>
>> >> On 3/5/2024 3:11 AM, Ankur Arora wrote:
>> >>>
>> >>> Joel Fernandes <joel@joelfernandes.org> writes:
>> >>>
>> >> [..]
>> >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
>> >>>> 'voluntary' business because
>> >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
>> >>>> is not about the old voluntary.
>> >>>
>> >>> What do you think about folding the higher scheduling class preemption logic
>> >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
>> >>> deadline task needs to be done for correctness.
>> >>>
>> >>> (That'll get rid of the current preempt=voluntary model, at least until
>> >>> there's a separate use for it.)
>> >>
>> >> Yes I am all in support for that. Its less confusing for the user as well, and
>> >> scheduling higher priority class at the next tick for preempt=none sounds good
>> >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
>> >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
>> >> that is more aggressive, it could be added in the future.
>> >
>> > This would be something that happens only after removing cond_resched()
>> > might_sleep() functionality from might_sleep(), correct?
>>
>> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
>> he seems to be suggesting preempting instantly for higher scheduling CLASSES
>> even for preempt=none mode, without having to wait till the next
>> scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
>> to treat "higher class" any differently than "higher priority" for preempt=none.
>>
>> And if SCHED_DEADLINE has a problem with that, then it already happens so with
>> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
>> more than the treatment given to higher priority within same class. Ankur/Juri?
>>
>> Re: cond_resched(), I did not follow you Paul, why does removing the proposed
>> preempt=voluntary mode (i.e. dropping this patch) have to happen only after
>> cond_resched()/might_sleep() modifications?
>
> Because right now, one large difference between CONFIG_PREEMPT_NONE
> an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a
> preemption point, but not for the former.
True. But, there is no difference between either of those with
PREEMPT_AUTO=y (at least right now).
For (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y, DEBUG_ATOMIC_SLEEP=y),
might_sleep() is:
# define might_resched() do { } while (0)
# define might_sleep() \
do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
And, cond_resched() for (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y,
DEBUG_ATOMIC_SLEEP=y):
static inline int _cond_resched(void)
{
klp_sched_try_switch();
return 0;
}
#define cond_resched() ({ \
__might_resched(__FILE__, __LINE__, 0); \
_cond_resched(); \
})
And, no change for (PREEMPT_AUTO=y, PREEMPT_NONE=y, DEBUG_ATOMIC_SLEEP=y).
Thanks
Ankur
> But if might_sleep() becomes debug-only, then there will no longer be
> this difference.
On Thu, Mar 07, 2024 at 08:22:30PM -0800, Ankur Arora wrote:
>
> Paul E. McKenney <paulmck@kernel.org> writes:
>
> > On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote:
> >>
> >>
> >> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
> >> > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
> >> >> Hi Ankur,
> >> >>
> >> >> On 3/5/2024 3:11 AM, Ankur Arora wrote:
> >> >>>
> >> >>> Joel Fernandes <joel@joelfernandes.org> writes:
> >> >>>
> >> >> [..]
> >> >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
> >> >>>> 'voluntary' business because
> >> >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
> >> >>>> is not about the old voluntary.
> >> >>>
> >> >>> What do you think about folding the higher scheduling class preemption logic
> >> >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
> >> >>> deadline task needs to be done for correctness.
> >> >>>
> >> >>> (That'll get rid of the current preempt=voluntary model, at least until
> >> >>> there's a separate use for it.)
> >> >>
> >> >> Yes I am all in support for that. Its less confusing for the user as well, and
> >> >> scheduling higher priority class at the next tick for preempt=none sounds good
> >> >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
> >> >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
> >> >> that is more aggressive, it could be added in the future.
> >> >
> >> > This would be something that happens only after removing cond_resched()
> >> > might_sleep() functionality from might_sleep(), correct?
> >>
> >> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
> >> he seems to be suggesting preempting instantly for higher scheduling CLASSES
> >> even for preempt=none mode, without having to wait till the next
> >> scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
> >> to treat "higher class" any differently than "higher priority" for preempt=none.
> >>
> >> And if SCHED_DEADLINE has a problem with that, then it already happens so with
> >> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
> >> more than the treatment given to higher priority within same class. Ankur/Juri?
> >>
> >> Re: cond_resched(), I did not follow you Paul, why does removing the proposed
> >> preempt=voluntary mode (i.e. dropping this patch) have to happen only after
> >> cond_resched()/might_sleep() modifications?
> >
> > Because right now, one large difference between CONFIG_PREEMPT_NONE
> > an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a
> > preemption point, but not for the former.
>
> True. But, there is no difference between either of those with
> PREEMPT_AUTO=y (at least right now).
>
> For (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y, DEBUG_ATOMIC_SLEEP=y),
> might_sleep() is:
>
> # define might_resched() do { } while (0)
> # define might_sleep() \
> do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
>
> And, cond_resched() for (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y,
> DEBUG_ATOMIC_SLEEP=y):
>
> static inline int _cond_resched(void)
> {
> klp_sched_try_switch();
> return 0;
> }
> #define cond_resched() ({ \
> __might_resched(__FILE__, __LINE__, 0); \
> _cond_resched(); \
> })
>
> And, no change for (PREEMPT_AUTO=y, PREEMPT_NONE=y, DEBUG_ATOMIC_SLEEP=y).
As long as it is easy to restore the prior cond_resched() functionality
for testing in the meantime, I should be OK. For example, it would
be great to have the commit removing the old functionality from
cond_resched() at the end of the series,
Thanx, Paul
Paul E. McKenney <paulmck@kernel.org> writes:
> On Thu, Mar 07, 2024 at 08:22:30PM -0800, Ankur Arora wrote:
>>
>> Paul E. McKenney <paulmck@kernel.org> writes:
>>
>> > On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote:
>> >>
>> >>
>> >> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
>> >> > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
>> >> >> Hi Ankur,
>> >> >>
>> >> >> On 3/5/2024 3:11 AM, Ankur Arora wrote:
>> >> >>>
>> >> >>> Joel Fernandes <joel@joelfernandes.org> writes:
>> >> >>>
>> >> >> [..]
>> >> >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
>> >> >>>> 'voluntary' business because
>> >> >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
>> >> >>>> is not about the old voluntary.
>> >> >>>
>> >> >>> What do you think about folding the higher scheduling class preemption logic
>> >> >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
>> >> >>> deadline task needs to be done for correctness.
>> >> >>>
>> >> >>> (That'll get rid of the current preempt=voluntary model, at least until
>> >> >>> there's a separate use for it.)
>> >> >>
>> >> >> Yes I am all in support for that. Its less confusing for the user as well, and
>> >> >> scheduling higher priority class at the next tick for preempt=none sounds good
>> >> >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
>> >> >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
>> >> >> that is more aggressive, it could be added in the future.
>> >> >
>> >> > This would be something that happens only after removing cond_resched()
>> >> > might_sleep() functionality from might_sleep(), correct?
>> >>
>> >> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
>> >> he seems to be suggesting preempting instantly for higher scheduling CLASSES
>> >> even for preempt=none mode, without having to wait till the next
>> >> scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
>> >> to treat "higher class" any differently than "higher priority" for preempt=none.
>> >>
>> >> And if SCHED_DEADLINE has a problem with that, then it already happens so with
>> >> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
>> >> more than the treatment given to higher priority within same class. Ankur/Juri?
>> >>
>> >> Re: cond_resched(), I did not follow you Paul, why does removing the proposed
>> >> preempt=voluntary mode (i.e. dropping this patch) have to happen only after
>> >> cond_resched()/might_sleep() modifications?
>> >
>> > Because right now, one large difference between CONFIG_PREEMPT_NONE
>> > an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a
>> > preemption point, but not for the former.
>>
>> True. But, there is no difference between either of those with
>> PREEMPT_AUTO=y (at least right now).
>>
>> For (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y, DEBUG_ATOMIC_SLEEP=y),
>> might_sleep() is:
>>
>> # define might_resched() do { } while (0)
>> # define might_sleep() \
>> do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
>>
>> And, cond_resched() for (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y,
>> DEBUG_ATOMIC_SLEEP=y):
>>
>> static inline int _cond_resched(void)
>> {
>> klp_sched_try_switch();
>> return 0;
>> }
>> #define cond_resched() ({ \
>> __might_resched(__FILE__, __LINE__, 0); \
>> _cond_resched(); \
>> })
>>
>> And, no change for (PREEMPT_AUTO=y, PREEMPT_NONE=y, DEBUG_ATOMIC_SLEEP=y).
>
> As long as it is easy to restore the prior cond_resched() functionality
> for testing in the meantime, I should be OK. For example, it would
> be great to have the commit removing the old functionality from
> cond_resched() at the end of the series,
I would, of course, be happy to make any changes that helps testing,
but I think I'm missing something that you are saying wrt
cond_resched()/might_sleep().
There's no commit explicitly removing the core cond_reshed()
functionality: PREEMPT_AUTO explicitly selects PREEMPT_BUILD and selects
out PREEMPTION_{NONE,VOLUNTARY}_BUILD.
(That's patch-1 "preempt: introduce CONFIG_PREEMPT_AUTO".)
For the rest it just piggybacks on the CONFIG_PREEMPT_DYNAMIC work
and just piggybacks on (!CONFIG_PREEMPT_DYNAMIC && CONFIG_PREEMPTION):
#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
/* ... */
#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
/* ... */
#elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
/* ... */
#else /* !CONFIG_PREEMPTION */
/* ... */
#endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
#else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */
static inline int _cond_resched(void)
{
klp_sched_try_switch();
return 0;
}
#endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */
Same for might_sleep() (which really amounts to might_resched()):
#ifdef CONFIG_PREEMPT_VOLUNTARY_BUILD
/* ... */
#elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
/* ... */
#elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
/* ... */
#else
# define might_resched() do { } while (0)
#endif /* CONFIG_PREEMPT_* */
But, I doubt that I'm telling you anything new. So, what am I missing?
--
ankur
On Sun, Mar 10, 2024 at 09:50:33PM -0700, Ankur Arora wrote:
>
> Paul E. McKenney <paulmck@kernel.org> writes:
>
> > On Thu, Mar 07, 2024 at 08:22:30PM -0800, Ankur Arora wrote:
> >>
> >> Paul E. McKenney <paulmck@kernel.org> writes:
> >>
> >> > On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote:
> >> >>
> >> >>
> >> >> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
> >> >> > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
> >> >> >> Hi Ankur,
> >> >> >>
> >> >> >> On 3/5/2024 3:11 AM, Ankur Arora wrote:
> >> >> >>>
> >> >> >>> Joel Fernandes <joel@joelfernandes.org> writes:
> >> >> >>>
> >> >> >> [..]
> >> >> >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
> >> >> >>>> 'voluntary' business because
> >> >> >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
> >> >> >>>> is not about the old voluntary.
> >> >> >>>
> >> >> >>> What do you think about folding the higher scheduling class preemption logic
> >> >> >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
> >> >> >>> deadline task needs to be done for correctness.
> >> >> >>>
> >> >> >>> (That'll get rid of the current preempt=voluntary model, at least until
> >> >> >>> there's a separate use for it.)
> >> >> >>
> >> >> >> Yes I am all in support for that. Its less confusing for the user as well, and
> >> >> >> scheduling higher priority class at the next tick for preempt=none sounds good
> >> >> >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
> >> >> >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
> >> >> >> that is more aggressive, it could be added in the future.
> >> >> >
> >> >> > This would be something that happens only after removing cond_resched()
> >> >> > might_sleep() functionality from might_sleep(), correct?
> >> >>
> >> >> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
> >> >> he seems to be suggesting preempting instantly for higher scheduling CLASSES
> >> >> even for preempt=none mode, without having to wait till the next
> >> >> scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
> >> >> to treat "higher class" any differently than "higher priority" for preempt=none.
> >> >>
> >> >> And if SCHED_DEADLINE has a problem with that, then it already happens so with
> >> >> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
> >> >> more than the treatment given to higher priority within same class. Ankur/Juri?
> >> >>
> >> >> Re: cond_resched(), I did not follow you Paul, why does removing the proposed
> >> >> preempt=voluntary mode (i.e. dropping this patch) have to happen only after
> >> >> cond_resched()/might_sleep() modifications?
> >> >
> >> > Because right now, one large difference between CONFIG_PREEMPT_NONE
> >> > an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a
> >> > preemption point, but not for the former.
> >>
> >> True. But, there is no difference between either of those with
> >> PREEMPT_AUTO=y (at least right now).
> >>
> >> For (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y, DEBUG_ATOMIC_SLEEP=y),
> >> might_sleep() is:
> >>
> >> # define might_resched() do { } while (0)
> >> # define might_sleep() \
> >> do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
> >>
> >> And, cond_resched() for (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y,
> >> DEBUG_ATOMIC_SLEEP=y):
> >>
> >> static inline int _cond_resched(void)
> >> {
> >> klp_sched_try_switch();
> >> return 0;
> >> }
> >> #define cond_resched() ({ \
> >> __might_resched(__FILE__, __LINE__, 0); \
> >> _cond_resched(); \
> >> })
> >>
> >> And, no change for (PREEMPT_AUTO=y, PREEMPT_NONE=y, DEBUG_ATOMIC_SLEEP=y).
> >
> > As long as it is easy to restore the prior cond_resched() functionality
> > for testing in the meantime, I should be OK. For example, it would
> > be great to have the commit removing the old functionality from
> > cond_resched() at the end of the series,
>
> I would, of course, be happy to make any changes that helps testing,
> but I think I'm missing something that you are saying wrt
> cond_resched()/might_sleep().
>
> There's no commit explicitly removing the core cond_reshed()
> functionality: PREEMPT_AUTO explicitly selects PREEMPT_BUILD and selects
> out PREEMPTION_{NONE,VOLUNTARY}_BUILD.
> (That's patch-1 "preempt: introduce CONFIG_PREEMPT_AUTO".)
>
> For the rest it just piggybacks on the CONFIG_PREEMPT_DYNAMIC work
> and just piggybacks on (!CONFIG_PREEMPT_DYNAMIC && CONFIG_PREEMPTION):
>
> #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
> /* ... */
> #if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> /* ... */
> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> /* ... */
> #else /* !CONFIG_PREEMPTION */
> /* ... */
> #endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>
> #else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */
> static inline int _cond_resched(void)
> {
> klp_sched_try_switch();
> return 0;
> }
> #endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */
>
> Same for might_sleep() (which really amounts to might_resched()):
>
> #ifdef CONFIG_PREEMPT_VOLUNTARY_BUILD
> /* ... */
> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> /* ... */
> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> /* ... */
> #else
> # define might_resched() do { } while (0)
> #endif /* CONFIG_PREEMPT_* */
>
> But, I doubt that I'm telling you anything new. So, what am I missing?
It is really a choice at your end.
Suppose we enable CONFIG_PREEMPT_AUTO on our fleet, and find that there
was some small set of cond_resched() calls that provided sub-jiffy
preemption that matter to some of our workloads. At that point, what
are our options?
1. Revert CONFIG_PREEMPT_AUTO.
2. Revert only the part that disables the voluntary preemption
semantics of cond_resched(). Which, as you point out, ends up
being the same as #1 above.
3. Hotwire a voluntary preemption into the required locations.
Which we would avoid doing due to upstream-acceptance concerns.
So, how easy would you like to make it for us to use as much of
CONFIG_PREEMPT_AUTO=y under various possible problem scenarios?
Yes, in a perfect world, we would have tested this already, but I
am still chasing down problems induced by simple rcutorture testing.
Cowardly of us, isn't it? ;-)
Thanx, Paul
Paul E. McKenney <paulmck@kernel.org> writes:
> On Sun, Mar 10, 2024 at 09:50:33PM -0700, Ankur Arora wrote:
>>
>> Paul E. McKenney <paulmck@kernel.org> writes:
>>
>> > On Thu, Mar 07, 2024 at 08:22:30PM -0800, Ankur Arora wrote:
>> >>
>> >> Paul E. McKenney <paulmck@kernel.org> writes:
>> >>
>> >> > On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote:
>> >> >>
>> >> >>
>> >> >> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
>> >> >> > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
>> >> >> >> Hi Ankur,
>> >> >> >>
>> >> >> >> On 3/5/2024 3:11 AM, Ankur Arora wrote:
>> >> >> >>>
>> >> >> >>> Joel Fernandes <joel@joelfernandes.org> writes:
>> >> >> >>>
>> >> >> >> [..]
>> >> >> >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
>> >> >> >>>> 'voluntary' business because
>> >> >> >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
>> >> >> >>>> is not about the old voluntary.
>> >> >> >>>
>> >> >> >>> What do you think about folding the higher scheduling class preemption logic
>> >> >> >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
>> >> >> >>> deadline task needs to be done for correctness.
>> >> >> >>>
>> >> >> >>> (That'll get rid of the current preempt=voluntary model, at least until
>> >> >> >>> there's a separate use for it.)
>> >> >> >>
>> >> >> >> Yes I am all in support for that. Its less confusing for the user as well, and
>> >> >> >> scheduling higher priority class at the next tick for preempt=none sounds good
>> >> >> >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
>> >> >> >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
>> >> >> >> that is more aggressive, it could be added in the future.
>> >> >> >
>> >> >> > This would be something that happens only after removing cond_resched()
>> >> >> > might_sleep() functionality from might_sleep(), correct?
>> >> >>
>> >> >> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
>> >> >> he seems to be suggesting preempting instantly for higher scheduling CLASSES
>> >> >> even for preempt=none mode, without having to wait till the next
>> >> >> scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
>> >> >> to treat "higher class" any differently than "higher priority" for preempt=none.
>> >> >>
>> >> >> And if SCHED_DEADLINE has a problem with that, then it already happens so with
>> >> >> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
>> >> >> more than the treatment given to higher priority within same class. Ankur/Juri?
>> >> >>
>> >> >> Re: cond_resched(), I did not follow you Paul, why does removing the proposed
>> >> >> preempt=voluntary mode (i.e. dropping this patch) have to happen only after
>> >> >> cond_resched()/might_sleep() modifications?
>> >> >
>> >> > Because right now, one large difference between CONFIG_PREEMPT_NONE
>> >> > an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a
>> >> > preemption point, but not for the former.
>> >>
>> >> True. But, there is no difference between either of those with
>> >> PREEMPT_AUTO=y (at least right now).
>> >>
>> >> For (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y, DEBUG_ATOMIC_SLEEP=y),
>> >> might_sleep() is:
>> >>
>> >> # define might_resched() do { } while (0)
>> >> # define might_sleep() \
>> >> do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
>> >>
>> >> And, cond_resched() for (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y,
>> >> DEBUG_ATOMIC_SLEEP=y):
>> >>
>> >> static inline int _cond_resched(void)
>> >> {
>> >> klp_sched_try_switch();
>> >> return 0;
>> >> }
>> >> #define cond_resched() ({ \
>> >> __might_resched(__FILE__, __LINE__, 0); \
>> >> _cond_resched(); \
>> >> })
>> >>
>> >> And, no change for (PREEMPT_AUTO=y, PREEMPT_NONE=y, DEBUG_ATOMIC_SLEEP=y).
>> >
>> > As long as it is easy to restore the prior cond_resched() functionality
>> > for testing in the meantime, I should be OK. For example, it would
>> > be great to have the commit removing the old functionality from
>> > cond_resched() at the end of the series,
>>
>> I would, of course, be happy to make any changes that helps testing,
>> but I think I'm missing something that you are saying wrt
>> cond_resched()/might_sleep().
>>
>> There's no commit explicitly removing the core cond_reshed()
>> functionality: PREEMPT_AUTO explicitly selects PREEMPT_BUILD and selects
>> out PREEMPTION_{NONE,VOLUNTARY}_BUILD.
>> (That's patch-1 "preempt: introduce CONFIG_PREEMPT_AUTO".)
>>
>> For the rest it just piggybacks on the CONFIG_PREEMPT_DYNAMIC work
>> and just piggybacks on (!CONFIG_PREEMPT_DYNAMIC && CONFIG_PREEMPTION):
>>
>> #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
>> /* ... */
>> #if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>> /* ... */
>> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
>> /* ... */
>> #else /* !CONFIG_PREEMPTION */
>> /* ... */
>> #endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>>
>> #else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */
>> static inline int _cond_resched(void)
>> {
>> klp_sched_try_switch();
>> return 0;
>> }
>> #endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */
>>
>> Same for might_sleep() (which really amounts to might_resched()):
>>
>> #ifdef CONFIG_PREEMPT_VOLUNTARY_BUILD
>> /* ... */
>> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>> /* ... */
>> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
>> /* ... */
>> #else
>> # define might_resched() do { } while (0)
>> #endif /* CONFIG_PREEMPT_* */
>>
>> But, I doubt that I'm telling you anything new. So, what am I missing?
>
> It is really a choice at your end.
>
> Suppose we enable CONFIG_PREEMPT_AUTO on our fleet, and find that there
> was some small set of cond_resched() calls that provided sub-jiffy
> preemption that matter to some of our workloads. At that point, what
> are our options?
>
> 1. Revert CONFIG_PREEMPT_AUTO.
>
> 2. Revert only the part that disables the voluntary preemption
> semantics of cond_resched(). Which, as you point out, ends up
> being the same as #1 above.
>
> 3. Hotwire a voluntary preemption into the required locations.
> Which we would avoid doing due to upstream-acceptance concerns.
>
> So, how easy would you like to make it for us to use as much of
> CONFIG_PREEMPT_AUTO=y under various possible problem scenarios?
Ah, I see your point. Basically, keep the lazy semantics but -- in
addition -- also provide the ability to dynamically toggle
cond_resched(), might_reshed() as a feature to help move this along
further.
So, as I mentioned earlier, the callsites are already present, and
removing them needs work (with livepatch and more generally to ensure
PREEMPT_AUTO is good enough for the current PREEMPT_* scenarios so
we can ditch cond_resched()).
I honestly don't see any reason not to do this -- I would prefer
this be a temporary thing to help beat PREEMPT_AUTO into shape. And,
this provides an insurance policy for using PREEMPT_AUTO.
That said, I would like Thomas' opinion on this.
> 3. Hotwire a voluntary preemption into the required locations.
> Which we would avoid doing due to upstream-acceptance concerns.
Apropos of this, how would you determine which are the locations
where we specifically need voluntary preemption?
> Yes, in a perfect world, we would have tested this already, but I
> am still chasing down problems induced by simple rcutorture testing.
> Cowardly of us, isn't it? ;-)
Cowards are us :).
--
ankur
On Mon, 11 Mar 2024 at 13:10, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> Ah, I see your point. Basically, keep the lazy semantics but -- in
> addition -- also provide the ability to dynamically toggle
> cond_resched(), might_reshed() as a feature to help move this along
> further.
Please, let's not make up any random hypotheticals.
Honestly, if we ever hit the hypothetical scenario that Paul outlined, let's
(a) deal with it THEN, when we actually know what the situation is
(b) learn and document what it is that actually causes the odd behavior
IOW, instead of assuming that some "cond_resched()" case would even be
the right thing to do, maybe there are other issues going on? Let's
not paper over them by keeping some hack around - and *if* some
cond_resched() model is actually the right model in some individual
place, let's make it the rule that *when* we hit that case, we
document it.
And we should absolutely not have some hypothetical case keep us from
just doing the right thing and getting rid of the existing
cond_resched().
Because any potential future case is *not* going to be the same
cond_resched() that the current case is anyway. It is going to have
some very different cause.
Linus
On Mon, Mar 11, 2024 at 01:23:09PM -0700, Linus Torvalds wrote: > On Mon, 11 Mar 2024 at 13:10, Ankur Arora <ankur.a.arora@oracle.com> wrote: > > > > Ah, I see your point. Basically, keep the lazy semantics but -- in > > addition -- also provide the ability to dynamically toggle > > cond_resched(), might_reshed() as a feature to help move this along > > further. > > Please, let's not make up any random hypotheticals. > > Honestly, if we ever hit the hypothetical scenario that Paul outlined, let's > > (a) deal with it THEN, when we actually know what the situation is > > (b) learn and document what it is that actually causes the odd behavior > > IOW, instead of assuming that some "cond_resched()" case would even be > the right thing to do, maybe there are other issues going on? Let's > not paper over them by keeping some hack around - and *if* some > cond_resched() model is actually the right model in some individual > place, let's make it the rule that *when* we hit that case, we > document it. > > And we should absolutely not have some hypothetical case keep us from > just doing the right thing and getting rid of the existing > cond_resched(). > > Because any potential future case is *not* going to be the same > cond_resched() that the current case is anyway. It is going to have > some very different cause. Fair enough, and that approach just means that we will be reaching out to Ankur and Thomas sooner rather than later if something goes sideways latency-wise. ;-) Thanx, Paul
On Mon, Mar 11 2024 at 17:03, Paul E. McKenney wrote: > On Mon, Mar 11, 2024 at 01:23:09PM -0700, Linus Torvalds wrote: >> Because any potential future case is *not* going to be the same >> cond_resched() that the current case is anyway. It is going to have >> some very different cause. > > Fair enough, and that approach just means that we will be reaching out > to Ankur and Thomas sooner rather than later if something goes sideways > latency-wise. ;-) You can't make an omelette without breaking eggs.
On Tue, Mar 12, 2024 at 01:14:37PM +0100, Thomas Gleixner wrote: > On Mon, Mar 11 2024 at 17:03, Paul E. McKenney wrote: > > On Mon, Mar 11, 2024 at 01:23:09PM -0700, Linus Torvalds wrote: > >> Because any potential future case is *not* going to be the same > >> cond_resched() that the current case is anyway. It is going to have > >> some very different cause. > > > > Fair enough, and that approach just means that we will be reaching out > > to Ankur and Thomas sooner rather than later if something goes sideways > > latency-wise. ;-) > > You can't make an omelette without breaking eggs. Precisely! ;-) Thanx, Paul
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Mon, 11 Mar 2024 at 13:10, Ankur Arora <ankur.a.arora@oracle.com> wrote: >> >> Ah, I see your point. Basically, keep the lazy semantics but -- in >> addition -- also provide the ability to dynamically toggle >> cond_resched(), might_reshed() as a feature to help move this along >> further. > > Please, let's not make up any random hypotheticals. > > Honestly, if we ever hit the hypothetical scenario that Paul outlined, let's > > (a) deal with it THEN, when we actually know what the situation is > > (b) learn and document what it is that actually causes the odd behavior > > IOW, instead of assuming that some "cond_resched()" case would even be > the right thing to do, maybe there are other issues going on? Let's > not paper over them by keeping some hack around - and *if* some > cond_resched() model is actually the right model in some individual > place, let's make it the rule that *when* we hit that case, we > document it. > > And we should absolutely not have some hypothetical case keep us from > just doing the right thing and getting rid of the existing > cond_resched(). > > Because any potential future case is *not* going to be the same > cond_resched() that the current case is anyway. It is going to have > some very different cause. Ack that. And, thanks that makes sense to me. -- ankur
© 2016 - 2026 Red Hat, Inc.