[PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

Ankur Arora posted 9 patches 2 years, 3 months ago
[PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ankur Arora 2 years, 3 months ago
On preempt_model_none() or preempt_model_voluntary() configurations
rescheduling of kernel threads happens only when they allow it, and
only at explicit preemption points, via calls to cond_resched() or
similar.

That leaves out contexts where it is not convenient to periodically
call cond_resched() -- for instance when executing a potentially long
running primitive (such as REP; STOSB.)

This means that we either suffer high scheduling latency or avoid
certain constructs.

Define TIF_ALLOW_RESCHED to demarcate such sections.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/thread_info.h |  2 ++
 include/linux/sched.h              | 30 ++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index d63b02940747..fc6f4121b412 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -100,6 +100,7 @@ struct thread_info {
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
 #define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
+#define TIF_RESCHED_ALLOW	30	/* reschedule if needed */
 
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
@@ -122,6 +123,7 @@ struct thread_info {
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
 #define _TIF_ADDR32		(1 << TIF_ADDR32)
+#define _TIF_RESCHED_ALLOW	(1 << TIF_RESCHED_ALLOW)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 177b3f3676ef..4dd3d91d990f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2245,6 +2245,36 @@ static __always_inline bool need_resched(void)
 	return unlikely(tif_need_resched());
 }
 
+#ifdef TIF_RESCHED_ALLOW
+/*
+ * allow_resched() .. disallow_resched() demarcate a preemptible section.
+ *
+ * Used around primitives where it might not be convenient to periodically
+ * call cond_resched().
+ */
+static inline void allow_resched(void)
+{
+	might_sleep();
+	set_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
+}
+
+static inline void disallow_resched(void)
+{
+	clear_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
+}
+
+static __always_inline bool resched_allowed(void)
+{
+	return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
+}
+
+#else
+static __always_inline bool resched_allowed(void)
+{
+	return false;
+}
+#endif /* TIF_RESCHED_ALLOW */
+
 /*
  * Wrappers for p->thread_info->cpu access. No-op on UP.
  */
-- 
2.31.1
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Andy Lutomirski 2 years, 3 months ago
On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote:
> On preempt_model_none() or preempt_model_voluntary() configurations
> rescheduling of kernel threads happens only when they allow it, and
> only at explicit preemption points, via calls to cond_resched() or
> similar.
>
> That leaves out contexts where it is not convenient to periodically
> call cond_resched() -- for instance when executing a potentially long
> running primitive (such as REP; STOSB.)
>

So I said this not too long ago in the context of Xen PV, but maybe it's time to ask it in general:

Why do we support anything other than full preempt?  I can think of two reasons, neither of which I think is very good:

1. Once upon a time, tracking preempt state was expensive.  But we fixed that.

2. Folklore suggests that there's a latency vs throughput tradeoff, and serious workloads, for some definition of serious, want throughput, so they should run without full preemption.

I think #2 is a bit silly.  If you want throughput, and you're busy waiting for a CPU that wants to run you, but it's not because it's running some low-priority non-preemptible thing (because preempt is set to none or volunary), you're not getting throughput.  If you want to get keep some I/O resource busy to get throughput, but you have excessive latency getting scheduled, you don't get throughput.

If the actual problem is that there's a workload that performs better when scheduling is delayed (which preempt=none and preempt=volunary do, essentialy at random), then maybe someone should identify that workload and fix the scheduler.

So maybe we should just very strongly encourage everyone to run with full preempt and simplify the kernel?
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Thomas Gleixner 2 years, 3 months ago
On Mon, Sep 18 2023 at 20:21, Andy Lutomirski wrote:
> On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote:

> Why do we support anything other than full preempt?  I can think of
> two reasons, neither of which I think is very good:
>
> 1. Once upon a time, tracking preempt state was expensive.  But we fixed that.
>
> 2. Folklore suggests that there's a latency vs throughput tradeoff,
>    and serious workloads, for some definition of serious, want
>    throughput, so they should run without full preemption.

It's absolutely not folklore. Run to completion is has well known
benefits as it avoids contention and avoids the overhead of scheduling
for a large amount of scenarios.

We've seen that painfully in PREEMPT_RT before we came up with the
concept of lazy preemption for throughput oriented tasks.

Thanks,

        tglx
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ingo Molnar 2 years, 3 months ago
* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Mon, Sep 18 2023 at 20:21, Andy Lutomirski wrote:
> > On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote:
> 
> > Why do we support anything other than full preempt?  I can think of
> > two reasons, neither of which I think is very good:
> >
> > 1. Once upon a time, tracking preempt state was expensive.  But we fixed that.
> >
> > 2. Folklore suggests that there's a latency vs throughput tradeoff,
> >    and serious workloads, for some definition of serious, want
> >    throughput, so they should run without full preemption.
> 
> It's absolutely not folklore. Run to completion is has well known 
> benefits as it avoids contention and avoids the overhead of scheduling 
> for a large amount of scenarios.
> 
> We've seen that painfully in PREEMPT_RT before we came up with the 
> concept of lazy preemption for throughput oriented tasks.

Yeah, for a large majority of workloads reduction in preemption increases 
batching and improves cache locality. Most scalability-conscious enterprise 
users want longer timeslices & better cache locality, not shorter 
timeslices with spread out cache use.

There's microbenchmarks that fit mostly in cache that benefit if work is 
immediately processed by freshly woken tasks - but that's not true for most 
workloads with a substantial real-life cache footprint.

Thanks,

	Ingo
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Peter Zijlstra 2 years, 3 months ago
On Wed, Aug 30, 2023 at 11:49:56AM -0700, Ankur Arora wrote:

> +#ifdef TIF_RESCHED_ALLOW
> +/*
> + * allow_resched() .. disallow_resched() demarcate a preemptible section.
> + *
> + * Used around primitives where it might not be convenient to periodically
> + * call cond_resched().
> + */
> +static inline void allow_resched(void)
> +{
> +	might_sleep();
> +	set_tsk_thread_flag(current, TIF_RESCHED_ALLOW);

So the might_sleep() ensures we're not currently having preemption
disabled; but there's nothing that ensures we don't do stupid things
like:

	allow_resched();
	spin_lock();
	...
	spin_unlock();
	disallow_resched();

Which on a PREEMPT_COUNT=n build will cause preemption while holding the
spinlock. I think something like the below will cause sufficient
warnings to avoid growing patterns like that.


Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -5834,6 +5834,13 @@ void preempt_count_add(int val)
 {
 #ifdef CONFIG_DEBUG_PREEMPT
 	/*
+	 * Disabling preemption under TIF_RESCHED_ALLOW doesn't
+	 * work for PREEMPT_COUNT=n builds.
+	 */
+	if (WARN_ON(resched_allowed()))
+		return;
+
+	/*
 	 * Underflow?
 	 */
 	if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ankur Arora 2 years, 3 months ago
Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Aug 30, 2023 at 11:49:56AM -0700, Ankur Arora wrote:
>
>> +#ifdef TIF_RESCHED_ALLOW
>> +/*
>> + * allow_resched() .. disallow_resched() demarcate a preemptible section.
>> + *
>> + * Used around primitives where it might not be convenient to periodically
>> + * call cond_resched().
>> + */
>> +static inline void allow_resched(void)
>> +{
>> +	might_sleep();
>> +	set_tsk_thread_flag(current, TIF_RESCHED_ALLOW);
>
> So the might_sleep() ensures we're not currently having preemption
> disabled; but there's nothing that ensures we don't do stupid things
> like:
>
> 	allow_resched();
> 	spin_lock();
> 	...
> 	spin_unlock();
> 	disallow_resched();
>
> Which on a PREEMPT_COUNT=n build will cause preemption while holding the
> spinlock. I think something like the below will cause sufficient
> warnings to avoid growing patterns like that.

Yeah, I agree this is a problem. I'll expand on the comment above
allow_resched() detailing this scenario.

> Index: linux-2.6/kernel/sched/core.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/core.c
> +++ linux-2.6/kernel/sched/core.c
> @@ -5834,6 +5834,13 @@ void preempt_count_add(int val)
>  {
>  #ifdef CONFIG_DEBUG_PREEMPT
>  	/*
> +	 * Disabling preemption under TIF_RESCHED_ALLOW doesn't
> +	 * work for PREEMPT_COUNT=n builds.
> +	 */
> +	if (WARN_ON(resched_allowed()))
> +		return;
> +
> +	/*
>  	 * Underflow?
>  	 */
>  	if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))

And, maybe something like this to guard against __this_cpu_read()
etc:

diff --git a/lib/smp_processor_id.c b/lib/smp_processor_id.c
index a2bb7738c373..634788f16e9e 100644
--- a/lib/smp_processor_id.c
+++ b/lib/smp_processor_id.c
@@ -13,6 +13,9 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2)
 {
        int this_cpu = raw_smp_processor_id();

+       if (unlikely(resched_allowed()))
+               goto out_error;
+
        if (likely(preempt_count()))
                goto out;

@@ -33,6 +36,7 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2)
        if (system_state < SYSTEM_SCHEDULING)
                goto out;

+out_error:
        /*
         * Avoid recursion:
         */

--
ankur
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Linus Torvalds 2 years, 3 months ago
On Sat, 9 Sept 2023 at 13:16, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> > +     if (WARN_ON(resched_allowed()))
> > +             return;
>
> And, maybe something like this to guard against __this_cpu_read()
> etc:
>
> +++ b/lib/smp_processor_id.c
> @@ -13,6 +13,9 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2)
>  {
>         int this_cpu = raw_smp_processor_id();
>
> +       if (unlikely(resched_allowed()))
> +               goto out_error;

Again, both of those checks are WRONG.

They'll error out even in exceptions / interrupts, when we have a
preempt count already from the exception itself.

So testing "resched_allowed()" that only tests the TIF_RESCHED_ALLOW
bit is wrong, wrong, wrong.

These situations aren't errors if we already had a preemption count
for other reasons. Only trying to disable preemption when in process
context (while TIF_RESCHED_ALLOW) is a problem. Your patch is missing
the check for "are we in a process context" part.

                Linus
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ankur Arora 2 years, 3 months ago
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, 9 Sept 2023 at 13:16, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> > +     if (WARN_ON(resched_allowed()))
>> > +             return;
>>
>> And, maybe something like this to guard against __this_cpu_read()
>> etc:
>>
>> +++ b/lib/smp_processor_id.c
>> @@ -13,6 +13,9 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2)
>>  {
>>         int this_cpu = raw_smp_processor_id();
>>
>> +       if (unlikely(resched_allowed()))
>> +               goto out_error;
>
> Again, both of those checks are WRONG.
>
> They'll error out even in exceptions / interrupts, when we have a
> preempt count already from the exception itself.
>
> So testing "resched_allowed()" that only tests the TIF_RESCHED_ALLOW
> bit is wrong, wrong, wrong.

Yeah, you are right.

I think we can keep these checks, but with this fixed definition of
resched_allowed(). This might be better:

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2260,7 +2260,8 @@ static inline void disallow_resched(void)

 static __always_inline bool resched_allowed(void)
 {
-       return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
+       return unlikely(!preempt_count() &&
+                        test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
 }

Ankur

> These situations aren't errors if we already had a preemption count
> for other reasons. Only trying to disable preemption when in process
> context (while TIF_RESCHED_ALLOW) is a problem. Your patch is missing
> the check for "are we in a process context" part.
>
>                 Linus
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Linus Torvalds 2 years, 3 months ago
On Sat, 9 Sept 2023 at 20:49, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> I think we can keep these checks, but with this fixed definition of
> resched_allowed(). This might be better:
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void)
>
>  static __always_inline bool resched_allowed(void)
>  {
> -       return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
> +       return unlikely(!preempt_count() &&
> +                        test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
>  }

I'm not convinced (at all) that the preempt count is the right thing.

It works for interrupts, yes, because interrupts will increment the
preempt count even on non-preempt kernels (since the preempt count is
also the interrupt context level).

But what about any synchronous trap handling?

In other words, just something like a page fault? A page fault doesn't
increment the preemption count (and in fact many page faults _will_
obviously re-schedule as part of waiting for IO).

A page fault can *itself* say "feel free to preempt me", and that's one thing.

But a page fault can also *interupt* something that said "feel free to
preempt me", and that's a completely *different* thing.

So I feel like the "tsk_thread_flag" was sadly completely the wrong
place to add this bit to, and the wrong place to test it in. What we
really want is "current kernel entry context".

So the right thing to do would basically be to put it in the stack
frame at kernel entry - whether that kernel entry was a system call
(which is doing some big copy that should be preemptible without us
having to add "cond_resched()" in places), or is a page fault (which
will also do things like big page clearings for hugepages)

And we don't have that, do we?

We have "on_thread_stack()", which checks for "are we on the system
call stack". But that doesn't work for page faults.

PeterZ - I feel like I might be either very confused, or missing
something You probably go "Duh, Linus, you're off on one of your crazy
tangents, and none of this is relevant because..."

                 Linus
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Steven Rostedt 2 years, 3 months ago
On Sat, 9 Sep 2023 21:35:54 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 9 Sept 2023 at 20:49, Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >
> > I think we can keep these checks, but with this fixed definition of
> > resched_allowed(). This might be better:
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void)
> >
> >  static __always_inline bool resched_allowed(void)
> >  {
> > -       return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
> > +       return unlikely(!preempt_count() &&
> > +                        test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
> >  }  
> 
> I'm not convinced (at all) that the preempt count is the right thing.
> 
> It works for interrupts, yes, because interrupts will increment the
> preempt count even on non-preempt kernels (since the preempt count is
> also the interrupt context level).
> 
> But what about any synchronous trap handling?
> 
> In other words, just something like a page fault? A page fault doesn't
> increment the preemption count (and in fact many page faults _will_
> obviously re-schedule as part of waiting for IO).

I wonder if we should make it a rule to not allow page faults when
RESCHED_ALLOW is set? Yeah, we can preempt in page faults, but that's not
what the allow_resched() is about. Since the main purpose of that function,
according to the change log, is for kernel threads. Do kernel threads page
fault? (perhaps for vmalloc? but do we take locks in those cases?).

That is, treat allow_resched() like preempt_disable(). If we page fault
with "preempt_disable()" we usually complain about that (unless we do some
magic with *_nofault() functions).

Then we could just add checks in the page fault handlers to see if
allow_resched() is set, and if so, complain about it like we do with
preempt_disable in the might_fault() function.


-- Steve
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Linus Torvalds 2 years, 3 months ago
On Mon, 11 Sept 2023 at 09:48, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I wonder if we should make it a rule to not allow page faults when
> RESCHED_ALLOW is set?

I really think that user copies might actually be one of the prime targets.

Right now we special-case big user copes - see for example
copy_chunked_from_user().

But that's an example of exactly the problem this code has - we
literally make more complex - and objectively *WORSE* code just to
deal with "I want this to be interruptible".

So yes, we could limit RESCHED_ALLOW to not allow page faults, but big
user copies literally are one of the worst problems.

Another example of this this is just plain read/write. It's not a
problem in practice right now, because large pages are effectively
never used.

But just imagine what happens once filemap_read() actually does big folios?

Do you really want this code:

        copied = copy_folio_to_iter(folio, offset, bytes, iter);

to forever use the artificial chunking it does now?

And yes, right now it will still do things in one-page chunks in
copy_page_to_iter(). It doesn't even have cond_resched() - it's
currently in the caller, in filemap_read().

But just think about possible futures.

Now, one option really is to do what I think PeterZ kind of alluded to
- start deprecating PREEMPT_VOLUNTARY and PREEMPT_NONE entirely.

Except we've actually been *adding* to this whole mess, rather than
removing it. So we have actively *expanded* on that preemption choice
with PREEMPT_DYNAMIC.

That's actually reasonably recent, implying that distros really want
to still have the option.

And it seems like it's actually server people who want the "no
preemption" (and presumably avoid all the preempt count stuff entirely
- it's not necessarily the *preemption* that is the cost, it's the
incessant preempt count updates)

                            Linus
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Matthew Wilcox 2 years, 3 months ago
On Mon, Sep 11, 2023 at 01:50:53PM -0700, Linus Torvalds wrote:
> Another example of this this is just plain read/write. It's not a
> problem in practice right now, because large pages are effectively
> never used.
> 
> But just imagine what happens once filemap_read() actually does big folios?
> 
> Do you really want this code:
> 
>         copied = copy_folio_to_iter(folio, offset, bytes, iter);
> 
> to forever use the artificial chunking it does now?
> 
> And yes, right now it will still do things in one-page chunks in
> copy_page_to_iter(). It doesn't even have cond_resched() - it's
> currently in the caller, in filemap_read().

Ah, um.  If you take a look in fs/iomap/buffered-io.c, you'll
see ...

iomap_write_iter:
        size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
                struct folio *folio;
                bytes = min(chunk - offset, iov_iter_count(i));
                if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
                copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);

So we do still cond_resched(), but we might go up to PMD_SIZE
between calls.  This is new code in 6.6 so it hasn't seen use by too
many users yet, but it's certainly bigger than the 16 pages used by
copy_chunked_from_user().  I honestly hadn't thought about preemption
latency.
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Linus Torvalds 2 years, 3 months ago
On Mon, 11 Sept 2023 at 20:27, Matthew Wilcox <willy@infradead.org> wrote:
>
> So we do still cond_resched(), but we might go up to PMD_SIZE
> between calls.  This is new code in 6.6 so it hasn't seen use by too
> many users yet, but it's certainly bigger than the 16 pages used by
> copy_chunked_from_user().  I honestly hadn't thought about preemption
> latency.

The thing about cond_resched() is that you literally won't get anybody
who complains until the big page case is common enough that it hits
special people.

This is also a large part of why I dislike cond_resched() a lot. It's
not just that it's sprinkled randomly in our code-base, it's that it's
*found* and added so randomly.

Some developers will look at code and say "this may be a long loop"
and add it without any numbers. It's rare, but it happens.

And other than that it usually is something like the RT people who
have the latency trackers, and one particular load that they use for
testing.

Oh well. Enough kvetching. I'm not happy about it, but in the end it's
a small annoyance, not a big issue.

                Linus
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Steven Rostedt 2 years, 3 months ago
On Mon, 11 Sep 2023 13:50:53 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And it seems like it's actually server people who want the "no
> preemption" (and presumably avoid all the preempt count stuff entirely
> - it's not necessarily the *preemption* that is the cost, it's the
> incessant preempt count updates)

I'm sure there's some overhead with the preemption itself. With the
meltdown/spectre mitigations going into and out of the kernel does add some
more overhead. And finishing a system call before being preempted may give
some performance benefits for some micro benchmark out there.

Going out on a crazy idea, I wonder if we could get the compiler to help us
here. As all preempt disabled locations are static, and as for functions,
they can be called with preemption enabled or disabled. Would it be
possible for the compiler to mark all locations that need preemption disabled?

If a function is called in a preempt disabled section and also called in a
preempt enable section, it could make two versions of the function (one
where preemption is disabled and one where it is enabled).

Then all we would need is a look up table to know if preemption is safe or
not by looking at the instruction pointer.

Yes, I know this is kind of a wild idea, but I do believe it is possible.

The compiler wouldn't need to know of the concept of "preemption" just a
"make this location special, and keep functions called by that location
special and duplicate them if they are are called outside of this special
section".

 ;-)

-- Steve
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Linus Torvalds 2 years, 3 months ago
On Mon, 11 Sept 2023 at 15:20, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Going out on a crazy idea, I wonder if we could get the compiler to help us
> here. As all preempt disabled locations are static, and as for functions,
> they can be called with preemption enabled or disabled. Would it be
> possible for the compiler to mark all locations that need preemption disabled?

Definitely not.

Those preempt-disabled areas aren't static, for one thing. Any time
you take any exception in kernel space, your exception handles is all
dynamically preemtible or not, possibly depending on architecture
details)

Yes, most exception handlers then have magic rules: page faults won't
get past a particular point if they happened while not preemptible,
for example. And interrupts will disable preemption themselves.

But we have a ton of code that runs lots of subtle code in exception
handlers that is very architecture-dependent, whether it is things
like unaligned fixups, or instruction rewriting things for dynamic
calls, or a lot of very grotty code.

Most (all?) of it could probably be made to be non-preemptible, but
it's a lot of code for a lot of architectures, and it's not the
trivial kind.

And that's ignoring all the code that is run in just regular process
context with no exceptions that is sometimes run under spinlocks, and
sometimes not. There's a *lot* of it. Think something as trivial as
memcpy(), but also kmalloc() or any number of stuff that is just
random support code that can be used from atomic (non-preemtible)
context.

So even if we could rely on magic compiler support that doesn't exist
- which we can't - it's not evenb *remotely* as static as you seem to
think it is.

                 Linus
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ankur Arora 2 years, 3 months ago
Steven Rostedt <rostedt@goodmis.org> writes:

> On Mon, 11 Sep 2023 13:50:53 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> And it seems like it's actually server people who want the "no
>> preemption" (and presumably avoid all the preempt count stuff entirely
>> - it's not necessarily the *preemption* that is the cost, it's the
>> incessant preempt count updates)
>
> I'm sure there's some overhead with the preemption itself. With the
> meltdown/spectre mitigations going into and out of the kernel does add some
> more overhead. And finishing a system call before being preempted may give
> some performance benefits for some micro benchmark out there.
>
> Going out on a crazy idea, I wonder if we could get the compiler to help us
> here. As all preempt disabled locations are static, and as for functions,
> they can be called with preemption enabled or disabled. Would it be
> possible for the compiler to mark all locations that need preemption disabled?

An even crazier version of that idea would be to have
preempt_disable/enable() demarcate regions, and the compiler putting all
of the preemption disabled region out-of-line to a special section.
Seems to me, that then we could do away to preempt_enable/disable()?
(Ignoring the preempt_count used in hardirq etc.)

This would allow preemption always, unless executing in the
preemption-disabled section.

Though I don't have any intuition for how much extra call overhead this
would add.

Ankur

> If a function is called in a preempt disabled section and also called in a
> preempt enable section, it could make two versions of the function (one
> where preemption is disabled and one where it is enabled).
>
> Then all we would need is a look up table to know if preemption is safe or
> not by looking at the instruction pointer.
>
> Yes, I know this is kind of a wild idea, but I do believe it is possible.
>
> The compiler wouldn't need to know of the concept of "preemption" just a
> "make this location special, and keep functions called by that location
> special and duplicate them if they are are called outside of this special
> section".
>
>  ;-)
>
> -- Steve
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Steven Rostedt 2 years, 3 months ago
On Mon, 11 Sep 2023 16:10:31 -0700
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> An even crazier version of that idea would be to have
> preempt_disable/enable() demarcate regions, and the compiler putting all
> of the preemption disabled region out-of-line to a special section.
> Seems to me, that then we could do away to preempt_enable/disable()?
> (Ignoring the preempt_count used in hardirq etc.)
> 

I thought about this too, but wasn't sure if it would be easier or harder
to implement. This would still require the duplicate functions (which I
guess would be the most difficult part).

> This would allow preemption always, unless executing in the
> preemption-disabled section.
> 
> Though I don't have any intuition for how much extra call overhead this
> would add.

I don't think this version would have as high of an overhead. You would get
a direct jump (which isn't bad as all speculation knows exactly where to
look), and it would improve the look up. No table, just a simple range
check.

-- Steve
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Linus Torvalds 2 years, 3 months ago
On Mon, 11 Sept 2023 at 13:50, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Except we've actually been *adding* to this whole mess, rather than
> removing it. So we have actively *expanded* on that preemption choice
> with PREEMPT_DYNAMIC.

Actually, that config option makes no sense.

It makes the sched_cond() behavior conditional with a static call.

But all the *real* overhead is still there and unconditional (ie all
the preempt count updates and the "did it go down to zero and we need
to check" code).

That just seems stupid. It seems to have all the overhead of a
preemptible kernel, just not doing the preemption.

So I must be mis-reading this, or just missing something important.

The real cost seems to be

   PREEMPT_BUILD -> PREEMPTION -> PREEMPT_COUNT

and PREEMPT vs PREEMPT_DYNAMIC makes no difference to that, since both
will end up with that, and thus both cases will have all the spinlock
preempt count stuff.

There must be some non-preempt_count cost that people worry about.

Or maybe I'm just mis-reading the Kconfig stuff entirely. That's
possible, because this seems *so* pointless to me.

Somebody please hit me with a clue-bat to the noggin.

                Linus
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Peter Zijlstra 2 years, 3 months ago
On Mon, Sep 11, 2023 at 02:16:18PM -0700, Linus Torvalds wrote:
> On Mon, 11 Sept 2023 at 13:50, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Except we've actually been *adding* to this whole mess, rather than
> > removing it. So we have actively *expanded* on that preemption choice
> > with PREEMPT_DYNAMIC.
> 
> Actually, that config option makes no sense.
> 
> It makes the sched_cond() behavior conditional with a static call.
> 
> But all the *real* overhead is still there and unconditional (ie all
> the preempt count updates and the "did it go down to zero and we need
> to check" code).
> 
> That just seems stupid. It seems to have all the overhead of a
> preemptible kernel, just not doing the preemption.
> 
> So I must be mis-reading this, or just missing something important.
> 
> The real cost seems to be
> 
>    PREEMPT_BUILD -> PREEMPTION -> PREEMPT_COUNT
> 
> and PREEMPT vs PREEMPT_DYNAMIC makes no difference to that, since both
> will end up with that, and thus both cases will have all the spinlock
> preempt count stuff.
> 
> There must be some non-preempt_count cost that people worry about.
> 
> Or maybe I'm just mis-reading the Kconfig stuff entirely. That's
> possible, because this seems *so* pointless to me.
> 
> Somebody please hit me with a clue-bat to the noggin.

Well, I was about to reply to your previous email explaining this, but
this one time I did read more email..

Yes, PREEMPT_DYNAMIC has all the preempt count twiddling and only nops
out the schedule()/cond_resched() calls where appropriate.

This work was done by a distro (SuSE) and if they're willing to ship
this I'm thinking the overheads are acceptable to them.

For a significant number of workloads the real overhead is the extra
preepmtions themselves more than the counting -- but yes, the counting
is measurable, but probably in the noise compared to other some of the
other horrible things we have done the past years.

Anyway, if distros are fine shipping with PREEMPT_DYNAMIC, then yes,
deleting the other options are definitely an option.
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ingo Molnar 2 years, 3 months ago
* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 11, 2023 at 02:16:18PM -0700, Linus Torvalds wrote:
> > On Mon, 11 Sept 2023 at 13:50, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > Except we've actually been *adding* to this whole mess, rather than
> > > removing it. So we have actively *expanded* on that preemption choice
> > > with PREEMPT_DYNAMIC.
> > 
> > Actually, that config option makes no sense.
> > 
> > It makes the sched_cond() behavior conditional with a static call.
> > 
> > But all the *real* overhead is still there and unconditional (ie all
> > the preempt count updates and the "did it go down to zero and we need
> > to check" code).
> > 
> > That just seems stupid. It seems to have all the overhead of a
> > preemptible kernel, just not doing the preemption.
> > 
> > So I must be mis-reading this, or just missing something important.
> > 
> > The real cost seems to be
> > 
> >    PREEMPT_BUILD -> PREEMPTION -> PREEMPT_COUNT
> > 
> > and PREEMPT vs PREEMPT_DYNAMIC makes no difference to that, since both
> > will end up with that, and thus both cases will have all the spinlock
> > preempt count stuff.
> > 
> > There must be some non-preempt_count cost that people worry about.
> > 
> > Or maybe I'm just mis-reading the Kconfig stuff entirely. That's
> > possible, because this seems *so* pointless to me.
> > 
> > Somebody please hit me with a clue-bat to the noggin.
> 
> Well, I was about to reply to your previous email explaining this, but 
> this one time I did read more email..
> 
> Yes, PREEMPT_DYNAMIC has all the preempt count twiddling and only nops 
> out the schedule()/cond_resched() calls where appropriate.
> 
> This work was done by a distro (SuSE) and if they're willing to ship this 
> I'm thinking the overheads are acceptable to them.
> 
> For a significant number of workloads the real overhead is the extra 
> preepmtions themselves more than the counting -- but yes, the counting is 
> measurable, but probably in the noise compared to other some of the other 
> horrible things we have done the past years.
> 
> Anyway, if distros are fine shipping with PREEMPT_DYNAMIC, then yes, 
> deleting the other options are definitely an option.

Yes, so my understanding is that distros generally worry more about 
macro-overhead, for example material changes to a random subset of key 
benchmarks that specific enterprise customers care about, and distros are 
not nearly as sensitive about micro-overhead that preempt_count() 
maintenance causes.

PREEMPT_DYNAMIC is basically a reflection of that: the desire to have only 
a single kernel image, but a boot-time toggle to differentiate between 
desktop and server loads and have CONFIG_PREEMPT (desktop) but also 
PREEMPT_VOLUNTARY behavior (server).

There's also the view that PREEMPT kernels are a bit more QA-friendly, 
because atomic code sequences are much better defined & enforced via kernel 
warnings. Without preempt_count we only have irqs-off warnings, that are 
only a small fraction of all critical sections in the kernel.

Ideally we'd be able to patch out most of the preempt_count maintenance 
overhead too - OTOH these days it's little more than noise on most CPUs, 
considering the kind of horrible security-workaround overhead we have on 
almost all x86 CPU types ... :-/

Thanks,

	Ingo
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ankur Arora 2 years, 3 months ago
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, 9 Sept 2023 at 20:49, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> I think we can keep these checks, but with this fixed definition of
>> resched_allowed(). This might be better:
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void)
>>
>>  static __always_inline bool resched_allowed(void)
>>  {
>> -       return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
>> +       return unlikely(!preempt_count() &&
>> +                        test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
>>  }
>
> I'm not convinced (at all) that the preempt count is the right thing.
>
> It works for interrupts, yes, because interrupts will increment the
> preempt count even on non-preempt kernels (since the preempt count is
> also the interrupt context level).
>
> But what about any synchronous trap handling?
>
> In other words, just something like a page fault? A page fault doesn't
> increment the preemption count (and in fact many page faults _will_
> obviously re-schedule as part of waiting for IO).
>
> A page fault can *itself* say "feel free to preempt me", and that's one thing.
>
> But a page fault can also *interupt* something that said "feel free to
> preempt me", and that's a completely *different* thing.
>
> So I feel like the "tsk_thread_flag" was sadly completely the wrong
> place to add this bit to, and the wrong place to test it in. What we
> really want is "current kernel entry context".

So, what we want allow_resched() to say is: feel free to reschedule
if in a reschedulable context.

The problem with doing that with an allow_resched tsk_thread_flag is
that the flag is really only valid while it is executing in the context
it was set.
And, trying to validate the flag by checking the preempt_count() makes
it pretty fragile, given that now we are tying it with the specifics of
whether the handling of arbitrary interrupts bumps up the
preempt_count() or not.

> So the right thing to do would basically be to put it in the stack
> frame at kernel entry - whether that kernel entry was a system call
> (which is doing some big copy that should be preemptible without us
> having to add "cond_resched()" in places), or is a page fault (which
> will also do things like big page clearings for hugepages)

Seems to me that associating an allow_resched flag with the stack also
has similar issue. Couldn't the context level change while we are on the
same stack?

I guess the problem is that allow_resched()/disallow_resched() really
need to demarcate a section of code having some property, but instead
set up state that has much wider scope.

Maybe code that allows resched can be in a new .section ".text.resched"
or whatever, and we could use something like this as a check:

  int resched_allowed(regs) {
        return !preempt_count() && in_resched_function(regs->rip);
  }

(allow_resched()/disallow_resched() shouldn't be needed except for
debug checks.)

We still need the !preempt_count() check, but now both the conditions
in the test express two orthogonal ideas:

  - !preempt_count(): preemption is safe in the current context
  - in_resched_function(regs->rip): okay to reschedule here

So in this example, it should allow scheduling inside both the
clear_pages_reschedulable() calls:

  -> page_fault()
     clear_page_reschedulable();
     -> page_fault()
        clear_page_reschedulable();

Here though, rescheduling could happen only in the first call to
clear_page_reschedulable():

  -> page_fault()
     clear_page_reschedulable();
     -> hardirq()
         -> page_fault()
            clear_page_reschedulable();

Does that make any sense, or I'm just talking through my hat?

--
ankur
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Linus Torvalds 2 years, 3 months ago
On Sun, 10 Sept 2023 at 03:01, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> Seems to me that associating an allow_resched flag with the stack also
> has similar issue. Couldn't the context level change while we are on the
> same stack?

On x86-64 no, but in other situations yes.

> I guess the problem is that allow_resched()/disallow_resched() really
> need to demarcate a section of code having some property, but instead
> set up state that has much wider scope.
>
> Maybe code that allows resched can be in a new .section ".text.resched"
> or whatever, and we could use something like this as a check:

Yes. I'm starting to think that that the only sane solution is to
limit cases that can do this a lot, and the "instruciton pointer
region" approach would certainly work.

At the same time I really hate that, because I was hoping we'd be able
to use this to not have so many of those annoying and random
"cond_resched()" calls.

I literally merged another batch of "random stupid crap" an hour ago:
commit 3d7d72a34e05 ("x86/sgx: Break up long non-preemptible delays in
sgx_vepc_release()") literally just adds manual 'cond_resched()' calls
in random places.

I was hoping that we'd have some generic way to deal with this where
we could just say "this thing is reschedulable", and get rid of - or
at least not increasingly add to - the cond_resched() mess.

Of course, that was probably always unrealistic, and those random
cond_resched() calls we just added probably couldn't just be replaced
by "you can reschedule me" simply because the functions quite possibly
end up taking some lock hidden in one of the xa_xyz() functions.

For the _particular_ case of "give me a preemptible big memory copy /
clear", the section model seems fine. It's just that we do have quite
a bit of code where we can end up with long loops that want that
cond_resched() too that I was hoping we'd _also_ be able to solve.

                   Linus
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Peter Zijlstra 2 years, 3 months ago
On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote:

> I was hoping that we'd have some generic way to deal with this where
> we could just say "this thing is reschedulable", and get rid of - or
> at least not increasingly add to - the cond_resched() mess.

Isn't that called PREEMPT=y ? That tracks precisely all the constraints
required to know when/if we can preempt.

The whole voluntary preempt model is basically the traditional
co-operative preemption model and that fully relies on manual yields.

The problem with the REP prefix (and Xen hypercalls) is that
they're long running instructions and it becomes fundamentally
impossible to put a cond_resched() in.

> Yes. I'm starting to think that that the only sane solution is to
> limit cases that can do this a lot, and the "instruciton pointer
> region" approach would certainly work.

From a code locality / I-cache POV, I think a sorted list of
(non overlapping) ranges might be best.
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ankur Arora 2 years, 3 months ago
Peter Zijlstra <peterz@infradead.org> writes:

> On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote:
>
>> I was hoping that we'd have some generic way to deal with this where
>> we could just say "this thing is reschedulable", and get rid of - or
>> at least not increasingly add to - the cond_resched() mess.
>
> Isn't that called PREEMPT=y ? That tracks precisely all the constraints
> required to know when/if we can preempt.
>
> The whole voluntary preempt model is basically the traditional
> co-operative preemption model and that fully relies on manual yields.

Yeah, but as Linus says, this means a lot of code is just full of
cond_resched(). For instance a loop the process_huge_page() uses
this pattern:

   for (...) {
       cond_resched();
       clear_page(i);

       cond_resched();
       clear_page(j);
   }

> The problem with the REP prefix (and Xen hypercalls) is that
> they're long running instructions and it becomes fundamentally
> impossible to put a cond_resched() in.
>
>> Yes. I'm starting to think that that the only sane solution is to
>> limit cases that can do this a lot, and the "instruciton pointer
>> region" approach would certainly work.
>
> From a code locality / I-cache POV, I think a sorted list of
> (non overlapping) ranges might be best.

Yeah, agreed. There are a few problems with doing that though.

I was thinking of using a check of this kind to schedule out when
it is executing in this "reschedulable" section:
        !preempt_count() && in_resched_function(regs->rip);

For preemption=full, this should mostly work.
For preemption=voluntary, though this'll only work with out-of-line
locks, not if the lock is inlined.

(Both, should have problems with __this_cpu_* and the like, but
maybe we can handwave that away with sparse/objtool etc.)

How expensive would be always having PREEMPT_COUNT=y?

--
ankur
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Peter Zijlstra 2 years, 3 months ago
On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
> 
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote:
> >
> >> I was hoping that we'd have some generic way to deal with this where
> >> we could just say "this thing is reschedulable", and get rid of - or
> >> at least not increasingly add to - the cond_resched() mess.
> >
> > Isn't that called PREEMPT=y ? That tracks precisely all the constraints
> > required to know when/if we can preempt.
> >
> > The whole voluntary preempt model is basically the traditional
> > co-operative preemption model and that fully relies on manual yields.
> 
> Yeah, but as Linus says, this means a lot of code is just full of
> cond_resched(). For instance a loop the process_huge_page() uses
> this pattern:
> 
>    for (...) {
>        cond_resched();
>        clear_page(i);
> 
>        cond_resched();
>        clear_page(j);
>    }

Yeah, that's what co-operative preemption gets you.

> > The problem with the REP prefix (and Xen hypercalls) is that
> > they're long running instructions and it becomes fundamentally
> > impossible to put a cond_resched() in.
> >
> >> Yes. I'm starting to think that that the only sane solution is to
> >> limit cases that can do this a lot, and the "instruciton pointer
> >> region" approach would certainly work.
> >
> > From a code locality / I-cache POV, I think a sorted list of
> > (non overlapping) ranges might be best.
> 
> Yeah, agreed. There are a few problems with doing that though.
> 
> I was thinking of using a check of this kind to schedule out when
> it is executing in this "reschedulable" section:
>         !preempt_count() && in_resched_function(regs->rip);
> 
> For preemption=full, this should mostly work.
> For preemption=voluntary, though this'll only work with out-of-line
> locks, not if the lock is inlined.
> 
> (Both, should have problems with __this_cpu_* and the like, but
> maybe we can handwave that away with sparse/objtool etc.)

So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
thing, and then only search the range when TIF flag is set.

And I'm thinking it might be a good idea to have objtool validate the
range only contains simple instructions, the moment it contains control
flow I'm thinking it's too complicated.

> How expensive would be always having PREEMPT_COUNT=y?

Effectively I think that is true today. At the very least Debian and
SuSE (I can't find a RHEL .config in a hurry but I would think they too)
ship with PREEMPT_DYNAMIC=y.

Mel, I'm sure you ran numbers at the time (you always do), what if any
was the measured overhead from PREEMPT_DYNAMIC vs 'regular' voluntary
preemption?
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Thomas Gleixner 2 years, 3 months ago
On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote:
> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
>> > The problem with the REP prefix (and Xen hypercalls) is that
>> > they're long running instructions and it becomes fundamentally
>> > impossible to put a cond_resched() in.
>> >
>> >> Yes. I'm starting to think that that the only sane solution is to
>> >> limit cases that can do this a lot, and the "instruciton pointer
>> >> region" approach would certainly work.
>> >
>> > From a code locality / I-cache POV, I think a sorted list of
>> > (non overlapping) ranges might be best.
>> 
>> Yeah, agreed. There are a few problems with doing that though.
>> 
>> I was thinking of using a check of this kind to schedule out when
>> it is executing in this "reschedulable" section:
>>         !preempt_count() && in_resched_function(regs->rip);
>> 
>> For preemption=full, this should mostly work.
>> For preemption=voluntary, though this'll only work with out-of-line
>> locks, not if the lock is inlined.
>> 
>> (Both, should have problems with __this_cpu_* and the like, but
>> maybe we can handwave that away with sparse/objtool etc.)
>
> So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
> thing, and then only search the range when TIF flag is set.
>
> And I'm thinking it might be a good idea to have objtool validate the
> range only contains simple instructions, the moment it contains control
> flow I'm thinking it's too complicated.

Can we take a step back and look at the problem from a scheduling
perspective?

The basic operation of a non-preemptible kernel is time slice
scheduling, which means that a task can run more or less undisturbed for
a full time slice once it gets on the CPU unless it schedules away
voluntary via a blocking operation.

This works pretty well as long as everything runs in userspace as the
preemption points in the return to user space path are independent of
the preemption model.

These preemption points handle both time slice exhaustion and priority
based preemption.

With PREEMPT=NONE these are the only available preemption points.

That means that kernel code can run more or less indefinitely until it
schedules out or returns to user space, which is obviously not possible
for kernel threads.

To prevent starvation the kernel gained voluntary preemption points,
i.e. cond_resched(), which has to be added manually to code as a
developer sees fit.

Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as
additional preemption points. might_resched() utilizes the existing
might_sched() debug points, which are in code paths which might block on
a contended resource. These debug points are mostly in core and
infrastructure code and are in code paths which can block anyway. The
only difference is that they allow preemption even when the resource is
uncontended.

Additionally we have PREEMPT=FULL which utilizes every zero transition
of preeempt_count as a potential preemption point.

Now we have the situation of long running data copies or data clear
operations which run fully in hardware, but can be interrupted. As the
interrupt return to kernel mode does not preempt in the NONE and
VOLUNTARY cases, new workarounds emerged. Mostly by defining a data
chunk size and adding cond_reched() again.

That's ugly and does not work for long lasting hardware operations so we
ended up with the suggestion of TIF_ALLOW_RESCHED to work around
that. But again this needs to be manually annotated in the same way as a
IP range based preemption scheme requires annotation.

TBH. I detest all of this.

Both cond_resched() and might_sleep/sched() are completely random
mechanisms as seen from time slice operation and the data chunk based
mechanism is just heuristics which works as good as heuristics tend to
work. allow_resched() is not any different and IP based preemption
mechanism are not going to be any better.

The approach here is: Prevent the scheduler to make decisions and then
mitigate the fallout with heuristics.

That's just backwards as it moves resource control out of the scheduler
into random code which has absolutely no business to do resource
control.

We have the reverse issue observed in PREEMPT_RT. The fact that spinlock
held sections became preemtible caused even more preemption activity
than on a PREEMPT=FULL kernel. The worst side effect of that was
extensive lock contention.

The way how we addressed that was to add a lazy preemption mode, which
tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to
preempt tasks which all belong to the SCHED_OTHER scheduling class. This
works pretty well and gains back a massive amount of performance for the
non-realtime throughput oriented tasks without affecting the
schedulability of real-time tasks at all. IOW, it does not take control
away from the scheduler. It cooperates with the scheduler and leaves the
ultimate decisions to it.

I think we can do something similar for the problem at hand, which
avoids most of these heuristic horrors and control boundary violations.

The main issue is that long running operations do not honour the time
slice and we work around that with cond_resched() and now have ideas
with this new TIF bit and IP ranges.

None of that is really well defined in respect to time slices. In fact
its not defined at all versus any aspect of scheduling behaviour.

What about the following:

   1) Keep preemption count and the real preemption points enabled
      unconditionally. That's not more overhead than the current
      DYNAMIC_PREEMPT mechanism as long as the preemption count does not
      go to zero, i.e. the folded NEED_RESCHED bit stays set.

      From earlier experiments I know that the overhead of preempt_count
      is minimal and only really observable with micro benchmarks.
      Otherwise it ends up in the noise as long as the slow path is not
      taken.

      I did a quick check comparing a plain inc/dec pair vs. the
      DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is
      in the non-conclusive noise.

      20 years ago this was a real issue because we did not have:

       - the folding of NEED_RESCHED into the preempt count
       
       - the cacheline optimizations which make the preempt count cache
         pretty much always cache hot

       - the hardware was way less capable

      I'm not saying that preempt_count is completely free today as it
      obviously adds more text and affects branch predictors, but as the
      major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
      acceptable and tolerable tradeoff.

   2) When the scheduler wants to set NEED_RESCHED due it sets
      NEED_RESCHED_LAZY instead which is only evaluated in the return to
      user space preemption points.

      As NEED_RESCHED_LAZY is not folded into the preemption count the
      preemption count won't become zero, so the task can continue until
      it hits return to user space.

      That preserves the existing behaviour.

   3) When the scheduler tick observes that the time slice is exhausted,
      then it folds the NEED_RESCHED bit into the preempt count which
      causes the real preemption points to actually preempt including
      the return from interrupt to kernel path.

      That even allows the scheduler to enforce preemption for e.g. RT
      class tasks without changing anything else.

      I'm pretty sure that this gets rid of cond_resched(), which is an
      impressive list of instances:

	./drivers	 392
	./fs		 318
	./mm		 189
	./kernel	 184
	./arch		  95
	./net		  83
	./include	  46
	./lib		  36
	./crypto	  16
	./sound		  16
	./block		  11
	./io_uring	  13
	./security	  11
	./ipc		   3

      That list clearly documents that the majority of these
      cond_resched() invocations is in code which neither should care
      nor should have any influence on the core scheduling decision
      machinery.

I think it's worth a try as it just fits into the existing preemption
scheme, solves the issue of long running kernel functions, prevents
invalid preemption and can utilize the existing instrumentation and
debug infrastructure.

Most importantly it gives control back to the scheduler and does not
make it depend on the mercy of cond_resched(), allow_resched() or
whatever heuristics sprinkled all over the kernel.

To me this makes a lot of sense, but I might be on the completely wrong
track. Se feel free to tell me that I'm completely nuts and/or just not
seeing the obvious.

Thanks,

        tglx
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Steven Rostedt 2 years, 1 month ago
On Tue, 19 Sep 2023 01:42:03 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

>    2) When the scheduler wants to set NEED_RESCHED due it sets
>       NEED_RESCHED_LAZY instead which is only evaluated in the return to
>       user space preemption points.
> 
>       As NEED_RESCHED_LAZY is not folded into the preemption count the
>       preemption count won't become zero, so the task can continue until
>       it hits return to user space.
> 
>       That preserves the existing behaviour.

I'm looking into extending this concept to user space and to VMs.

I'm calling this the "extended scheduler time slice" (ESTS pronounced "estis")

The ideas is this. Have VMs/user space share a memory region with the
kernel that is per thread/vCPU. This would be registered via a syscall or
ioctl on some defined file or whatever. Then, when entering user space /
VM, if NEED_RESCHED_LAZY (or whatever it's eventually called) is set, it
checks if the thread has this memory region and a special bit in it is
set, and if it does, it does not schedule. It will treat it like a long
kernel system call.

The kernel will then set another bit in the shared memory region that will
tell user space / VM that the kernel wanted to schedule, but is allowing it
to finish its critical section. When user space / VM is done with the
critical section, it will check the bit that may be set by the kernel and
if it is set, it should do a sched_yield() or VMEXIT so that the kernel can
now schedule it.

What about DOS you say? It's no different than running a long system call.
No task can run forever. It's not a "preempt disable", it's just "give me
some more time". A "NEED_RESCHED" will always schedule, just like a kernel
system call that takes a long time. The goal is to allow user space to get
out of critical sections that we know can cause problems if they get
preempted. Usually it's a user space / VM lock is held or maybe a VM
interrupt handler that needs to wake up a task on another vCPU.

If we are worried about abuse, we could even punish tasks that don't call
sched_yield() by the time its extended time slice is taken. Even without
that punishment, if we have EEVDF, this extension will make it less
eligible the next time around.

The goal is to prevent a thread / vCPU being preempted while holding a lock
or resource that other threads / vCPUs will want. That is, prevent
contention, as that's usually the biggest issue with performance in user
space and VMs.

I'm going to work on a POC, and see if I can get some benchmarks on how
much this could help tasks like databases and VMs in general.

-- Steve
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Sergey Senozhatsky 2 years, 1 month ago
On (23/10/24 10:34), Steven Rostedt wrote:
> On Tue, 19 Sep 2023 01:42:03 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> >    2) When the scheduler wants to set NEED_RESCHED due it sets
> >       NEED_RESCHED_LAZY instead which is only evaluated in the return to
> >       user space preemption points.
> > 
> >       As NEED_RESCHED_LAZY is not folded into the preemption count the
> >       preemption count won't become zero, so the task can continue until
> >       it hits return to user space.
> > 
> >       That preserves the existing behaviour.
> 
> I'm looking into extending this concept to user space and to VMs.
> 
> I'm calling this the "extended scheduler time slice" (ESTS pronounced "estis")
> 
> The ideas is this. Have VMs/user space share a memory region with the
> kernel that is per thread/vCPU. This would be registered via a syscall or
> ioctl on some defined file or whatever. Then, when entering user space /
> VM, if NEED_RESCHED_LAZY (or whatever it's eventually called) is set, it
> checks if the thread has this memory region and a special bit in it is
> set, and if it does, it does not schedule. It will treat it like a long
> kernel system call.
> 
> The kernel will then set another bit in the shared memory region that will
> tell user space / VM that the kernel wanted to schedule, but is allowing it
> to finish its critical section. When user space / VM is done with the
> critical section, it will check the bit that may be set by the kernel and
> if it is set, it should do a sched_yield() or VMEXIT so that the kernel can
> now schedule it.
> 
> What about DOS you say? It's no different than running a long system call.
> No task can run forever. It's not a "preempt disable", it's just "give me
> some more time". A "NEED_RESCHED" will always schedule, just like a kernel
> system call that takes a long time. The goal is to allow user space to get
> out of critical sections that we know can cause problems if they get
> preempted. Usually it's a user space / VM lock is held or maybe a VM
> interrupt handler that needs to wake up a task on another vCPU.
> 
> If we are worried about abuse, we could even punish tasks that don't call
> sched_yield() by the time its extended time slice is taken. Even without
> that punishment, if we have EEVDF, this extension will make it less
> eligible the next time around.
> 
> The goal is to prevent a thread / vCPU being preempted while holding a lock
> or resource that other threads / vCPUs will want. That is, prevent
> contention, as that's usually the biggest issue with performance in user
> space and VMs.

I think some time ago we tried to check guest's preempt count on each vm-exit
and we'd vm-enter if guest exited from a critical section (those that bump
preempt count) so that it can hopefully finish whatever is was going to
do and vmexit again. We didn't look into covering guest's RCU read-side
critical sections.

Can you educate me, is your PoC significantly different from guest preempt
count check?
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Steven Rostedt 2 years, 1 month ago
On Thu, 26 Oct 2023 16:50:16 +0900
Sergey Senozhatsky <senozhatsky@chromium.org> wrote:

> > The goal is to prevent a thread / vCPU being preempted while holding a lock
> > or resource that other threads / vCPUs will want. That is, prevent
> > contention, as that's usually the biggest issue with performance in user
> > space and VMs.  
> 
> I think some time ago we tried to check guest's preempt count on each vm-exit
> and we'd vm-enter if guest exited from a critical section (those that bump
> preempt count) so that it can hopefully finish whatever is was going to
> do and vmexit again. We didn't look into covering guest's RCU read-side
> critical sections.
> 
> Can you educate me, is your PoC significantly different from guest preempt
> count check?

No, it's probably very similar. Just the mechanism to allow it to run
longer may be different.

-- Steve
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Steven Rostedt 2 years, 1 month ago
On Tue, 24 Oct 2023 10:34:26 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I'm going to work on a POC, and see if I can get some benchmarks on how
> much this could help tasks like databases and VMs in general.

And that was much easier than I thought it would be. It also shows some
great results!

I started with Thomas's PREEMPT_AUTO.patch from the rt-devel tree:

 https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches

So you need to select:

  CONFIG_PREEMPT_AUTO

The below is my proof of concept patch. It still has debugging in it, and
I'm sure the interface will need to be changed.

There's now a new file:  /sys/kernel/extend_sched

Attached is a program that tests it. It mmaps that file, with:

 struct extend_map {
	unsigned long		flags;
 };

 static __thread struct extend_map *extend_map;

That is, there's this structure for every thread. It's assigned with:

	fd = open("/sys/kernel/extend_sched", O_RDWR);
	extend_map = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

I don't actually like this interface, as it wastes a full page for just two
bits :-p

Anyway, to tell the kernel to "extend" the time slice if possible because
it's in a critical section, we have:

 static void extend(void)
 {
	if (!extend_map)
		return;

	extend_map->flags = 1;
 }

And to say that's it's done:

 static void unextend(void)
 {
	unsigned long prev;

	if (!extend_map)
		return;

	prev = xchg(&extend_map->flags, 0);
	if (prev & 2)
		sched_yield();
 }

So, bit 1 is for user space to tell the kernel "please extend me", and bit
two is for the kernel to tell user space "OK, I extended you, but call
sched_yield() when done".

This test program creates 1 + number of CPUs threads, that run in a loop
for 5 seconds. Each thread will grab a user space spin lock (not a futex,
but just shared memory). Before grabbing the lock it will call "extend()",
if it fails to grab the lock, it calls "unextend()" and spins on the lock
until its free, where it will try again. Then after it gets the lock, it
will update a counter, and release the lock, calling "unextend()" as well.
Then it will spin on the counter until it increments again to allow another
task to get into the critical section.

With the init of the extend_map disabled and it doesn't use the extend
code, it ends with:

 Ran for 3908165 times
 Total wait time: 33.965654

I can give you stdev and all that too, but the above is pretty much the
same after several runs.

After enabling the extend code, it has:

 Ran for 4829340 times
 Total wait time: 32.635407

It was able to get into the critical section almost 1 million times more in
those 5 seconds! That's a 23% improvement!

The wait time for getting into the critical section also dropped by the
total of over a second (4% improvement).

I ran a traceeval tool on it (still work in progress, but I can post when
it's done), and with the following trace, and the writes to trace-marker
(tracefs_printf)

 trace-cmd record -e sched_switch ./extend-sched

It showed that without the extend, each task was preempted while holding
the lock around 200 times. With the extend, only one task was ever
preempted while holding the lock, and it only happened once!

Below is my patch (with debugging and on top of Thomas's PREEMPT_AUTO.patch):

Attached is the program I tested it with. It uses libtracefs to write to
the trace_marker file, but if you don't want to build it with libtracefs:

  gcc -o extend-sched extend-sched.c `pkg-config --libs --cflags libtracefs` -lpthread

You can just do:

 grep -v tracefs extend-sched.c > extend-sched-notracefs.c

And build that.

-- Steve

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9b13b7d4f1d3..fb540dd0dec0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -740,6 +740,10 @@ struct kmap_ctrl {
 #endif
 };
 
+struct extend_map {
+	long				flags;
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -802,6 +806,8 @@ struct task_struct {
 	unsigned int			core_occupation;
 #endif
 
+	struct extend_map		*extend_map;
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index c1f706038637..21d0e4d81d33 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -147,17 +147,32 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
 static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 					    unsigned long ti_work)
 {
+	unsigned long ignore_mask;
+
 	/*
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
 	 */
 	while (ti_work & EXIT_TO_USER_MODE_WORK) {
+		ignore_mask = 0;
 
 		local_irq_enable_exit_to_user(ti_work);
 
-		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
+		if (ti_work & _TIF_NEED_RESCHED) {
 			schedule();
 
+		} else if (ti_work & _TIF_NEED_RESCHED_LAZY) {
+			if (!current->extend_map ||
+			    !(current->extend_map->flags & 1)) {
+				schedule();
+			} else {
+				trace_printk("Extend!\n");
+				/* Allow to leave with NEED_RESCHED_LAZY still set */
+				ignore_mask |= _TIF_NEED_RESCHED_LAZY;
+				current->extend_map->flags |= 2;
+			}
+		}
+
 		if (ti_work & _TIF_UPROBE)
 			uprobe_notify_resume(regs);
 
@@ -184,6 +199,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		tick_nohz_user_enter_prepare();
 
 		ti_work = read_thread_flags();
+		ti_work &= ~ignore_mask;
 	}
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
diff --git a/kernel/exit.c b/kernel/exit.c
index edb50b4c9972..ddf89ec9ab62 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -906,6 +906,13 @@ void __noreturn do_exit(long code)
 	if (tsk->io_context)
 		exit_io_context(tsk);
 
+	if (tsk->extend_map) {
+		unsigned long addr = (unsigned long)tsk->extend_map;
+
+		virt_to_page(addr)->mapping = NULL;
+		free_page(addr);
+	}
+
 	if (tsk->splice_pipe)
 		free_pipe_info(tsk->splice_pipe);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..da2214082d25 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1166,6 +1166,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->wake_q.next = NULL;
 	tsk->worker_private = NULL;
 
+	tsk->extend_map = NULL;
+
 	kcov_task_init(tsk);
 	kmsan_task_create(tsk);
 	kmap_local_fork(tsk);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 976092b7bd45..297061cfa08d 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -32,3 +32,4 @@ obj-y += core.o
 obj-y += fair.o
 obj-y += build_policy.o
 obj-y += build_utility.o
+obj-y += extend.o
diff --git a/kernel/sched/extend.c b/kernel/sched/extend.c
new file mode 100644
index 000000000000..a632e1a8f57b
--- /dev/null
+++ b/kernel/sched/extend.c
@@ -0,0 +1,90 @@
+#include <linux/kobject.h>
+#include <linux/pagemap.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+
+#ifdef CONFIG_SYSFS
+static ssize_t extend_sched_read(struct file *file,  struct kobject *kobj,
+				 struct bin_attribute *bin_attr,
+				 char *buf, loff_t off, size_t len)
+{
+	static const char output[] = "Extend scheduling time slice\n";
+
+	printk("%s:%d\n", __func__, __LINE__);
+	if (off >= sizeof(output))
+		return 0;
+
+	strscpy(buf, output + off, len);
+	return min((ssize_t)len, sizeof(output) - off - 1);
+}
+
+static ssize_t extend_sched_write(struct file *file, struct kobject *kobj,
+				  struct bin_attribute *bin_attr,
+				  char *buf, loff_t off, size_t len)
+{
+	printk("%s:%d\n", __func__, __LINE__);
+	return -EINVAL;
+}
+
+static vm_fault_t extend_sched_mmap_fault(struct vm_fault *vmf)
+{
+	vm_fault_t ret = VM_FAULT_SIGBUS;
+
+	trace_printk("%s:%d\n", __func__, __LINE__);
+	/* Only has one page */
+	if (vmf->pgoff || !current->extend_map)
+		return ret;
+
+	vmf->page = virt_to_page(current->extend_map);
+
+	get_page(vmf->page);
+	vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+	vmf->page->index   = vmf->pgoff;
+
+	return 0;
+}
+
+static void extend_sched_mmap_open(struct vm_area_struct *vma)
+{
+	printk("%s:%d\n", __func__, __LINE__);
+	WARN_ON(!current->extend_map);
+}
+
+static const struct vm_operations_struct extend_sched_vmops = {
+	.open		= extend_sched_mmap_open,
+	.fault		= extend_sched_mmap_fault,
+};
+
+static int extend_sched_mmap(struct file *file, struct kobject *kobj,
+			     struct bin_attribute *attr,
+			     struct vm_area_struct *vma)
+{
+	if (current->extend_map)
+		return -EBUSY;
+
+	current->extend_map = page_to_virt(alloc_page(GFP_USER | __GFP_ZERO));
+	if (!current->extend_map)
+		return -ENOMEM;
+
+	vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP | VM_MAYWRITE, 0);
+	vma->vm_ops = &extend_sched_vmops;
+
+	return 0;
+}
+
+static struct bin_attribute extend_sched_attr = {
+	.attr = {
+		.name = "extend_sched",
+		.mode = 0777,
+	},
+	.read = &extend_sched_read,
+	.write = &extend_sched_write,
+	.mmap = &extend_sched_mmap,
+};
+
+static __init int extend_init(void)
+{
+	return sysfs_create_bin_file(kernel_kobj, &extend_sched_attr);
+}
+late_initcall(extend_init);
+#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 700b140ac1bb..17ca22e80384 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -993,9 +993,10 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool
 		resched_curr(rq);
 	} else {
 		/* Did the task ignore the lazy reschedule request? */
-		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY))
+		if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) {
+			trace_printk("Force resched?\n");
 			resched_curr(rq);
-		else
+		} else
 			resched_curr_lazy(rq);
 	}
 	clear_buddies(cfs_rq, se);
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ankur Arora 2 years, 3 months ago
Thomas Gleixner <tglx@linutronix.de> writes:

> On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote:
>> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
>>> > The problem with the REP prefix (and Xen hypercalls) is that
>>> > they're long running instructions and it becomes fundamentally
>>> > impossible to put a cond_resched() in.
>>> >
>>> >> Yes. I'm starting to think that that the only sane solution is to
>>> >> limit cases that can do this a lot, and the "instruciton pointer
>>> >> region" approach would certainly work.
>>> >
>>> > From a code locality / I-cache POV, I think a sorted list of
>>> > (non overlapping) ranges might be best.
>>>
>>> Yeah, agreed. There are a few problems with doing that though.
>>>
>>> I was thinking of using a check of this kind to schedule out when
>>> it is executing in this "reschedulable" section:
>>>         !preempt_count() && in_resched_function(regs->rip);
>>>
>>> For preemption=full, this should mostly work.
>>> For preemption=voluntary, though this'll only work with out-of-line
>>> locks, not if the lock is inlined.
>>>
>>> (Both, should have problems with __this_cpu_* and the like, but
>>> maybe we can handwave that away with sparse/objtool etc.)
>>
>> So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges
>> thing, and then only search the range when TIF flag is set.
>>
>> And I'm thinking it might be a good idea to have objtool validate the
>> range only contains simple instructions, the moment it contains control
>> flow I'm thinking it's too complicated.
>
> Can we take a step back and look at the problem from a scheduling
> perspective?
>
> The basic operation of a non-preemptible kernel is time slice
> scheduling, which means that a task can run more or less undisturbed for
> a full time slice once it gets on the CPU unless it schedules away
> voluntary via a blocking operation.
>
> This works pretty well as long as everything runs in userspace as the
> preemption points in the return to user space path are independent of
> the preemption model.
>
> These preemption points handle both time slice exhaustion and priority
> based preemption.
>
> With PREEMPT=NONE these are the only available preemption points.
>
> That means that kernel code can run more or less indefinitely until it
> schedules out or returns to user space, which is obviously not possible
> for kernel threads.
>
> To prevent starvation the kernel gained voluntary preemption points,
> i.e. cond_resched(), which has to be added manually to code as a
> developer sees fit.
>
> Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as
> additional preemption points. might_resched() utilizes the existing
> might_sched() debug points, which are in code paths which might block on
> a contended resource. These debug points are mostly in core and
> infrastructure code and are in code paths which can block anyway. The
> only difference is that they allow preemption even when the resource is
> uncontended.
>
> Additionally we have PREEMPT=FULL which utilizes every zero transition
> of preeempt_count as a potential preemption point.
>
> Now we have the situation of long running data copies or data clear
> operations which run fully in hardware, but can be interrupted. As the
> interrupt return to kernel mode does not preempt in the NONE and
> VOLUNTARY cases, new workarounds emerged. Mostly by defining a data
> chunk size and adding cond_reched() again.
>
> That's ugly and does not work for long lasting hardware operations so we
> ended up with the suggestion of TIF_ALLOW_RESCHED to work around
> that. But again this needs to be manually annotated in the same way as a
> IP range based preemption scheme requires annotation.
>
> TBH. I detest all of this.
>
> Both cond_resched() and might_sleep/sched() are completely random
> mechanisms as seen from time slice operation and the data chunk based
> mechanism is just heuristics which works as good as heuristics tend to
> work. allow_resched() is not any different and IP based preemption
> mechanism are not going to be any better.

Agreed. I was looking at how to add resched sections etc, and in
addition to the randomness the choice of where exactly to add it seemed
to be quite fuzzy. A recipe for future kruft.

> The approach here is: Prevent the scheduler to make decisions and then
> mitigate the fallout with heuristics.
>
> That's just backwards as it moves resource control out of the scheduler
> into random code which has absolutely no business to do resource
> control.
>
> We have the reverse issue observed in PREEMPT_RT. The fact that spinlock
> held sections became preemtible caused even more preemption activity
> than on a PREEMPT=FULL kernel. The worst side effect of that was
> extensive lock contention.
>
> The way how we addressed that was to add a lazy preemption mode, which
> tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to
> preempt tasks which all belong to the SCHED_OTHER scheduling class. This
> works pretty well and gains back a massive amount of performance for the
> non-realtime throughput oriented tasks without affecting the
> schedulability of real-time tasks at all. IOW, it does not take control
> away from the scheduler. It cooperates with the scheduler and leaves the
> ultimate decisions to it.
>
> I think we can do something similar for the problem at hand, which
> avoids most of these heuristic horrors and control boundary violations.
>
> The main issue is that long running operations do not honour the time
> slice and we work around that with cond_resched() and now have ideas
> with this new TIF bit and IP ranges.
>
> None of that is really well defined in respect to time slices. In fact
> its not defined at all versus any aspect of scheduling behaviour.
>
> What about the following:
>
>    1) Keep preemption count and the real preemption points enabled
>       unconditionally. That's not more overhead than the current
>       DYNAMIC_PREEMPT mechanism as long as the preemption count does not
>       go to zero, i.e. the folded NEED_RESCHED bit stays set.
>
>       From earlier experiments I know that the overhead of preempt_count
>       is minimal and only really observable with micro benchmarks.
>       Otherwise it ends up in the noise as long as the slow path is not
>       taken.
>
>       I did a quick check comparing a plain inc/dec pair vs. the
>       DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is
>       in the non-conclusive noise.
>
>       20 years ago this was a real issue because we did not have:
>
>        - the folding of NEED_RESCHED into the preempt count
>
>        - the cacheline optimizations which make the preempt count cache
>          pretty much always cache hot
>
>        - the hardware was way less capable
>
>       I'm not saying that preempt_count is completely free today as it
>       obviously adds more text and affects branch predictors, but as the
>       major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
>       acceptable and tolerable tradeoff.
>
>    2) When the scheduler wants to set NEED_RESCHED due it sets
>       NEED_RESCHED_LAZY instead which is only evaluated in the return to
>       user space preemption points.
>
>       As NEED_RESCHED_LAZY is not folded into the preemption count the
>       preemption count won't become zero, so the task can continue until
>       it hits return to user space.
>
>       That preserves the existing behaviour.
>
>    3) When the scheduler tick observes that the time slice is exhausted,
>       then it folds the NEED_RESCHED bit into the preempt count which
>       causes the real preemption points to actually preempt including
>       the return from interrupt to kernel path.

Right, and currently we check cond_resched() all the time in expectation
that something might need a resched.

Folding it in with the scheduler determining when next preemption happens
seems to make a lot of sense to me.


Thanks
Ankur

>       That even allows the scheduler to enforce preemption for e.g. RT
>       class tasks without changing anything else.
>
>       I'm pretty sure that this gets rid of cond_resched(), which is an
>       impressive list of instances:
>
> 	./drivers	 392
> 	./fs		 318
> 	./mm		 189
> 	./kernel	 184
> 	./arch		  95
> 	./net		  83
> 	./include	  46
> 	./lib		  36
> 	./crypto	  16
> 	./sound		  16
> 	./block		  11
> 	./io_uring	  13
> 	./security	  11
> 	./ipc		   3
>
>       That list clearly documents that the majority of these
>       cond_resched() invocations is in code which neither should care
>       nor should have any influence on the core scheduling decision
>       machinery.
>
> I think it's worth a try as it just fits into the existing preemption
> scheme, solves the issue of long running kernel functions, prevents
> invalid preemption and can utilize the existing instrumentation and
> debug infrastructure.
>
> Most importantly it gives control back to the scheduler and does not
> make it depend on the mercy of cond_resched(), allow_resched() or
> whatever heuristics sprinkled all over the kernel.

> To me this makes a lot of sense, but I might be on the completely wrong
> track. Se feel free to tell me that I'm completely nuts and/or just not
> seeing the obvious.
>
> Thanks,
>
>         tglx


--
ankur
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ingo Molnar 2 years, 3 months ago
* Thomas Gleixner <tglx@linutronix.de> wrote:

> Additionally we have PREEMPT=FULL which utilizes every zero transition
> of preeempt_count as a potential preemption point.

Just to complete this nice new entry to Documentation/sched/: in 
PREEMPT=FULL there's also IRQ-return driven preemption of kernel-mode code, 
at almost any instruction boundary the hardware allows, in addition to the 
preemption driven by regular zero transition of preempt_count in 
syscall/kthread code.

Thanks,

	Ingo
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Linus Torvalds 2 years, 3 months ago
On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> What about the following:
>
>    1) Keep preemption count and the real preemption points enabled
>       unconditionally.

Well, it's certainly the simplest solution, and gets rid of not just
the 'rep string' issue, but gets rid of all the cond_resched() hackery
entirely.

>       20 years ago this was a real issue because we did not have:
>
>        - the folding of NEED_RESCHED into the preempt count
>
>        - the cacheline optimizations which make the preempt count cache
>          pretty much always cache hot
>
>        - the hardware was way less capable
>
>       I'm not saying that preempt_count is completely free today as it
>       obviously adds more text and affects branch predictors, but as the
>       major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
>       acceptable and tolerable tradeoff.

Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in
most distros does speak for just admitting that the PREEMPT_NONE /
VOLUNTARY approach isn't actually used, and is only causing pain.

>    2) When the scheduler wants to set NEED_RESCHED due it sets
>       NEED_RESCHED_LAZY instead which is only evaluated in the return to
>       user space preemption points.

Is this just to try to emulate the existing PREEMPT_NONE behavior?

If the new world order is that the time slice is always honored, then
the "this might be a latency issue" goes away. Good.

And we'd also get better coverage for the *debug* aim of
"might_sleep()" and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on
PREEMPT_COUNT always existing.

But because the latency argument is gone, the "might_resched()" should
then just be removed entirely from "might_sleep()", so that
might_sleep() would *only* be that DEBUG_ATOMIC_SLEEP thing.

That argues for your suggestion too, since we had a performance issue
due to "might_sleep()" _not_ being just a debug thing, and pointlessly
causing a reschedule in a place where reschedules were _allowed_, but
certainly much less than optimal.

Which then caused that fairly recent commit 4542057e18ca ("mm: avoid
'might_sleep()' in get_mmap_lock_carefully()").

However, that does bring up an issue: even with full preemption, there
are certainly places where we are *allowed* to schedule (when the
preempt count is zero), but there are also some places that are
*better* than other places to schedule (for example, when we don't
hold any other locks).

So, I do think that if we just decide to go "let's just always be
preemptible", we might still have points in the kernel where
preemption might be *better* than in others points.

But none of might_resched(), might_sleep() _or_ cond_resched() are
necessarily that kind of "this is a good point" thing. They come from
a different background.

So what I think what you are saying is that we'd have the following situation:

 - scheduling at "return to user space" is presumably always a good thing.

A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
whatever) would cover that, and would give us basically the existing
CONFIG_PREEMPT_NONE behavior.

So a config variable (either compile-time with PREEMPT_NONE or a
dynamic one with DYNAMIC_PREEMPT set to none) would make any external
wakeup only set that bit.

And then a "fully preemptible low-latency desktop" would set the
preempt-count bit too.

 - but the "timeslice over" case would always set the
preempt-count-bit, regardless of any config, and would guarantee that
we have reasonable latencies.

This all makes cond_resched() (and might_resched()) pointless, and
they can just go away.

Then the question becomes whether we'd want to introduce a *new*
concept, which is a "if you are going to schedule, do it now rather
than later, because I'm taking a lock, and while it's a preemptible
lock, I'd rather not sleep while holding this resource".

I suspect we want to avoid that for now, on the assumption that it's
hopefully not a problem in practice (the recently addressed problem
with might_sleep() was that it actively *moved* the scheduling point
to a bad place, not that scheduling could happen there, so instead of
optimizing scheduling, it actively pessimized it). But I thought I'd
mention it.

Anyway, I'm definitely not opposed. We'd get rid of a config option
that is presumably not very widely used, and we'd simplify a lot of
issues, and get rid of all these badly defined "cond_preempt()"
things.

                Linus
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ingo Molnar 2 years, 3 months ago
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > What about the following:
> >
> >    1) Keep preemption count and the real preemption points enabled
> >       unconditionally.
> 
> Well, it's certainly the simplest solution, and gets rid of not just
> the 'rep string' issue, but gets rid of all the cond_resched() hackery
> entirely.
> 
> >       20 years ago this was a real issue because we did not have:
> >
> >        - the folding of NEED_RESCHED into the preempt count
> >
> >        - the cacheline optimizations which make the preempt count cache
> >          pretty much always cache hot
> >
> >        - the hardware was way less capable
> >
> >       I'm not saying that preempt_count is completely free today as it
> >       obviously adds more text and affects branch predictors, but as the
> >       major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
> >       acceptable and tolerable tradeoff.
> 
> Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most 
> distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY 
> approach isn't actually used, and is only causing pain.

The macro-behavior of NONE/VOLUNTARY is still used & relied upon in server 
distros - and that's the behavior that enterprise distros truly cared 
about.

Micro-overhead of NONE/VOLUNTARY vs. FULL is nonzero but is in the 'noise' 
category for all major distros I'd say.

And that's what Thomas's proposal achieves: keep the nicely execution-batched 
NONE/VOLUNTARY scheduling behavior for SCHED_OTHER tasks, while having the 
latency advantages of fully-preemptible kernel code for RT and critical 
tasks.

So I'm fully on board with this. It would reduce the number of preemption 
variants to just two: regular kernel and PREEMPT_RT. Yummie!

> >    2) When the scheduler wants to set NEED_RESCHED due it sets
> >       NEED_RESCHED_LAZY instead which is only evaluated in the return to
> >       user space preemption points.
> 
> Is this just to try to emulate the existing PREEMPT_NONE behavior?

Yes: I'd guesstimate that the batching caused by timeslice-laziness that is 
naturally part of NONE/VOLUNTARY resolves ~90%+ of observable 
macro-performance regressions between NONE/VOLUNTARY and PREEMPT/RT.

> If the new world order is that the time slice is always honored, then the 
> "this might be a latency issue" goes away. Good.
> 
> And we'd also get better coverage for the *debug* aim of "might_sleep()" 
> and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on PREEMPT_COUNT always 
> existing.
> 
> But because the latency argument is gone, the "might_resched()" should 
> then just be removed entirely from "might_sleep()", so that might_sleep() 
> would *only* be that DEBUG_ATOMIC_SLEEP thing.

Correct. And that's even a minor code generation advantage, as we wouldn't 
have these additional hundreds of random/statistical preemption checks.

> That argues for your suggestion too, since we had a performance issue due 
> to "might_sleep()" _not_ being just a debug thing, and pointlessly 
> causing a reschedule in a place where reschedules were _allowed_, but 
> certainly much less than optimal.
> 
> Which then caused that fairly recent commit 4542057e18ca ("mm: avoid 
> 'might_sleep()' in get_mmap_lock_carefully()").

4542057e18ca is arguably kind of a workaround though - and with the 
preempt_count + NEED_RESCHED_LAZY approach we'd have both the latency 
advantages *and* the execution-batching performance advantages of 
NONE/VOLUNTARY that 4542057e18ca exposed.

> However, that does bring up an issue: even with full preemption, there 
> are certainly places where we are *allowed* to schedule (when the preempt 
> count is zero), but there are also some places that are *better* than 
> other places to schedule (for example, when we don't hold any other 
> locks).
> 
> So, I do think that if we just decide to go "let's just always be 
> preemptible", we might still have points in the kernel where preemption 
> might be *better* than in others points.

So in the broadest sense we have 3 stages of pending preemption:

   NEED_RESCHED_LAZY
   NEED_RESCHED_SOON
   NEED_RESCHED_NOW

And we'd transition:

  - from    0 -> SOON when an eligible task is woken up,
  - from LAZY -> SOON when current timeslice is exhausted,
  - from SOON -> NOW  when no locks/resources are held.

  [ With a fast-track for RT or other urgent tasks to enter NOW immediately. ]

On the regular kernels it's probably not worth modeling the SOON/NOW split, 
as we'd have to track the depth of sleeping locks as well, which we don't 
do right now.

On PREEMPT_RT the SOON/NOW distinction possibly makes sense, as there we 
are aware of locking depth already and it would be relatively cheap to 
check for it on natural 0-preempt_count boundaries.


> But none of might_resched(), might_sleep() _or_ cond_resched() are 
> necessarily that kind of "this is a good point" thing. They come from a 
> different background.

Correct, they come from two sources:

 - They are hundreds of points that we know are 'technically correct' 
   preemption points, and they break up ~90% of long latencies by brute 
   force & chance.

 - Explicitly identified problem points that added a cond_resched() or its 
   equivalent. These are rare and also tend to bitrot, because *removing* 
   them is always more risky than adding them, so they tend to accumulate.

> So what I think what you are saying is that we'd have the following 
> situation:
> 
>  - scheduling at "return to user space" is presumably always a good thing.
> 
> A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
> whatever) would cover that, and would give us basically the existing
> CONFIG_PREEMPT_NONE behavior.
> 
> So a config variable (either compile-time with PREEMPT_NONE or a
> dynamic one with DYNAMIC_PREEMPT set to none) would make any external
> wakeup only set that bit.
> 
> And then a "fully preemptible low-latency desktop" would set the
> preempt-count bit too.

I'd even argue that we only need two preemption modes, and that 'fully 
preemptible low-latency desktop' is an artifact of poor latencies on 
PREEMPT_NONE.

Ie. in the long run - after a careful period of observing performance 
regressions and other dragons - we'd only have *two* preemption modes left:

   !PREEMPT_RT     # regular kernel. Single default behavior.
   PREEMPT_RT=y    # -rt kernel, because rockets, satellites & cars matter.

Any other application level preemption preferences can be expressed via 
scheduling policies & priorities.

Nothing else. We don't need PREEMPT_DYNAMIC, PREEMPT_VOLUNTARY or 
PREEMPT_NONE in any of their variants, probably not even as runtime knobs.

People who want shorter timeslices can set shorter timeslices, and people 
who want immediate preemption of certain tasks can manage priorities.

>  - but the "timeslice over" case would always set the preempt-count-bit, 
> regardless of any config, and would guarantee that we have reasonable 
> latencies.

Yes. Probably a higher nice-priority task becoming runnable would cause 
immediate preemption too, in addition to RT tasks.

Ie. the execution batching would be for same-priority groups of SCHED_OTHER 
tasks.

> This all makes cond_resched() (and might_resched()) pointless, and
> they can just go away.

Yep.

> Then the question becomes whether we'd want to introduce a *new* concept, 
> which is a "if you are going to schedule, do it now rather than later, 
> because I'm taking a lock, and while it's a preemptible lock, I'd rather 
> not sleep while holding this resource".

Something close to this concept is naturally available on PREEMPT_RT 
kernels, which only use a single central lock primitive (rt_mutex), but it 
would have be added explicitly for regular kernels.

We could do the following intermediate step:

 - Remove all the random cond_resched() points such as might_sleep()
 - Turn all explicit cond_resched() points into 'ideal point to reschedule'.

 - Maybe even rename it from cond_resched() to resched_point(), to signal 
   the somewhat different role.

While cond_resched() and resched_point() are not 100% matches, they are 
close enough, as most existing cond_resched() points were added to places 
that cause the least amount of disruption with held resources.

But I think it would be better to add resched_point() as a new API, and add 
it to places where there's a performance benefit. Clean slate, 
documentation, and all that.

> I suspect we want to avoid that for now, on the assumption that it's 
> hopefully not a problem in practice (the recently addressed problem with 
> might_sleep() was that it actively *moved* the scheduling point to a bad 
> place, not that scheduling could happen there, so instead of optimizing 
> scheduling, it actively pessimized it). But I thought I'd mention it.
> 
> Anyway, I'm definitely not opposed. We'd get rid of a config option that 
> is presumably not very widely used, and we'd simplify a lot of issues, 
> and get rid of all these badly defined "cond_preempt()" things.

I think we can get rid of *all* the preemption model Kconfig knobs, except 
PREEMPT_RT. :-)

Thanks,

	Ingo
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Thomas Gleixner 2 years, 3 months ago
Ingo!

On Tue, Sep 19 2023 at 10:03, Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> Then the question becomes whether we'd want to introduce a *new* concept, 
>> which is a "if you are going to schedule, do it now rather than later, 
>> because I'm taking a lock, and while it's a preemptible lock, I'd rather 
>> not sleep while holding this resource".
>
> Something close to this concept is naturally available on PREEMPT_RT 
> kernels, which only use a single central lock primitive (rt_mutex), but it 
> would have be added explicitly for regular kernels.
>
> We could do the following intermediate step:
>
>  - Remove all the random cond_resched() points such as might_sleep()
>  - Turn all explicit cond_resched() points into 'ideal point to reschedule'.
>
>  - Maybe even rename it from cond_resched() to resched_point(), to signal 
>    the somewhat different role.
>
> While cond_resched() and resched_point() are not 100% matches, they are 
> close enough, as most existing cond_resched() points were added to places 
> that cause the least amount of disruption with held resources.
>
> But I think it would be better to add resched_point() as a new API, and add 
> it to places where there's a performance benefit. Clean slate, 
> documentation, and all that.

Lets not go there. You just replace one magic mushroom with a different
flavour. We want to get rid of them completely.

The whole point is to let the scheduler decide and give it enough
information to make informed decisions.

So with the LAZY scheme in effect, there is no real reason to have these
extra points and I rather add task::sleepable_locks_held and do that
accounting in the relevant lock/unlock paths. Based on that the
scheduler can decide whether it grants a time slice expansion or just
says no.

That's extremly cheap and well defined.

You can document the hell out of resched_point(), but it won't be any
different from the existing ones and always subject to personal
preference and goals and its going to be sprinkled all over the place
just like the existing ones. So where is the gain?

Thanks,

        tglx
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ingo Molnar 2 years, 3 months ago
* Ingo Molnar <mingo@kernel.org> wrote:

> > Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most 
> > distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY 
> > approach isn't actually used, and is only causing pain.
> 
> The macro-behavior of NONE/VOLUNTARY is still used & relied upon in 
> server distros - and that's the behavior that enterprise distros truly 
> cared about.
> 
> Micro-overhead of NONE/VOLUNTARY vs. FULL is nonzero but is in the 
> 'noise' category for all major distros I'd say.
> 
> And that's what Thomas's proposal achieves: keep the nicely 
> execution-batched NONE/VOLUNTARY scheduling behavior for SCHED_OTHER 
> tasks, while having the latency advantages of fully-preemptible kernel 
> code for RT and critical tasks.
> 
> So I'm fully on board with this. It would reduce the number of preemption 
> variants to just two: regular kernel and PREEMPT_RT. Yummie!

As an additional side note: with various changes such as EEVDF the 
scheduler is a lot less preemption-happy these days, without wrecking 
latencies & timeslice distribution.

So in principle we might not even need the NEED_RESCHED_LAZY extra bit, 
which -rt uses as a kind of additional layer to make sure they don't change 
scheduling policy.

Ie. a modern scheduler might have mooted much of this change:

   4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()")

... because now we'll only reschedule on timeslice exhaustion, or if a task 
comes in with a big deadline deficit.

And even the deadline-deficit wakeup preemption can be turned off further 
with:

    $ echo NO_WAKEUP_PREEMPTION > /debug/sched/features

And we are considering making that the default behavior for same-prio tasks 
- basically turn same-prio SCHED_OTHER tasks into SCHED_BATCH - which 
should be quite similar to what NEED_RESCHED_LAZY achieves on -rt.

Thanks,

	Ingo
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Thomas Gleixner 2 years, 3 months ago
On Tue, Sep 19 2023 at 10:43, Ingo Molnar wrote:
> * Ingo Molnar <mingo@kernel.org> wrote:
> Ie. a modern scheduler might have mooted much of this change:
>
>    4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()")
>
> ... because now we'll only reschedule on timeslice exhaustion, or if a task 
> comes in with a big deadline deficit.
>
> And even the deadline-deficit wakeup preemption can be turned off further 
> with:
>
>     $ echo NO_WAKEUP_PREEMPTION > /debug/sched/features
>
> And we are considering making that the default behavior for same-prio tasks 
> - basically turn same-prio SCHED_OTHER tasks into SCHED_BATCH - which 
> should be quite similar to what NEED_RESCHED_LAZY achieves on -rt.

I don't think that you can get rid of NEED_RESCHED_LAZY for !RT because
there is a clear advantage of having the return to user preemption
point.

It spares to have the kernel/user transition just to get the task back
via the timeslice interrupt. I experimented with that on RT and the
result was definitely worse.

We surely can revisit that, but I'd really start with the straight
forward mappable LAZY bit approach and if experimentation turns out to
provide good enough results by not setting that bit at all, then we
still can do so without changing anything except the core scheduler
decision logic.

It's again a cheap thing due to the way how the return to user TIF
handling works:

	ti_work = read_thread_flags();
	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
		ti_work = exit_to_user_mode_loop(regs, ti_work);

TIF_LAZY_RESCHED is part of EXIT_TO_USER_MODE_WORK, so the non-work case
does not become more expensive than today. If any of the bits is set,
then the slowpath wont get measurably different performance whether the bit
is evaluated or not in exit_to_user_mode_loop().

As we really want TIF_LAZY_RESCHED for RT, we just keep all of this
consistent in terms of code and purely a scheduler decision whether it
utilizes it or not. As a consequence PREEMPT_RT is not longer special in
that regard and the main RT difference becomes the lock substitution and
forced interrupt threading.

For the magic 'spare me the extra conditional' optimization of
exit_to_user_mode_loop() if LAZY can be optimized out for !RT because
the scheduler is sooo clever (which I doubt), we can just use the same
approach as for other TIF bits and define them to 0 :)

So lets start consistent and optimize on top if really required.

Thanks,

        tglx
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Matthew Wilcox 2 years, 3 months ago
On Tue, Sep 12, 2023 at 10:26:06AM +0200, Peter Zijlstra wrote:
> > How expensive would be always having PREEMPT_COUNT=y?
> 
> Effectively I think that is true today. At the very least Debian and
> SuSE (I can't find a RHEL .config in a hurry but I would think they too)
> ship with PREEMPT_DYNAMIC=y.

$ grep PREEMPT uek-rpm/ol9/config-x86_64
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DRM_I915_PREEMPT_TIMEOUT=640
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

$ grep PREEMPT uek-rpm/ol9/config-aarch64
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Phil Auld 2 years, 3 months ago
On Tue, Sep 12, 2023 at 10:26:06AM +0200 Peter Zijlstra wrote:
> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote:
> > 
> > How expensive would be always having PREEMPT_COUNT=y?
> 
> Effectively I think that is true today. At the very least Debian and
> SuSE (I can't find a RHEL .config in a hurry but I would think they too)
> ship with PREEMPT_DYNAMIC=y.
>

Yes, RHEL too.

Cheers,
Phil

--
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by andrew.cooper3@citrix.com 2 years, 3 months ago
On 11/09/2023 4:04 pm, Peter Zijlstra wrote:
> On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote:
>
>> I was hoping that we'd have some generic way to deal with this where
>> we could just say "this thing is reschedulable", and get rid of - or
>> at least not increasingly add to - the cond_resched() mess.
> Isn't that called PREEMPT=y ? That tracks precisely all the constraints
> required to know when/if we can preempt.
>
> The whole voluntary preempt model is basically the traditional
> co-operative preemption model and that fully relies on manual yields.
>
> The problem with the REP prefix (and Xen hypercalls) is that
> they're long running instructions and it becomes fundamentally
> impossible to put a cond_resched() in.

Any VMM - Xen isn't special here.

And if we're talking about instructions, then CPUID, GETSEC and
ENCL{S,U} and plenty of {RD,WR}MSRs in in a similar category, being
effectively blocking RPC operations to something else in the platform.

The Xen evtchn upcall logic in Linux does cond_resched() when possible. 
i.e. long-running hypercalls issued with interrupts enabled can
reschedule if an interrupt occurs, which is pretty close to how REP
works too.

~Andrew
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Linus Torvalds 2 years, 3 months ago
On Fri, 8 Sept 2023 at 00:03, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Which on a PREEMPT_COUNT=n build will cause preemption while holding the
> spinlock. I think something like the below will cause sufficient
> warnings to avoid growing patterns like that.

Hmm. I don't think that warning is valid.

Disabling preemption is actually fine if it's done in an interrupt,
iow if we have

        allow_resched();
           -> irq happens
                spin_lock();  // Ok and should *not* complain
                ...
                spin_unlock();
            <- irq return (and preemption)

which actually makes me worry about the nested irq case, because this
would *not* be ok:

        allow_resched();
           -> irq happens
                -> *nested* irq happens
                <- nested irq return (and preemption)

ie the allow_resched() needs to still honor the irq count, and a
nested irq return obviously must not cause any preemption.

I've lost sight of the original patch series, and I assume / hope that
the above isn't actually an issue, but exactly because I've lost sight
of the original patches and only have this one in my mailbox I wanted
to check.

            Linus
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ankur Arora 2 years, 3 months ago
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, 8 Sept 2023 at 00:03, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> Which on a PREEMPT_COUNT=n build will cause preemption while holding the
>> spinlock. I think something like the below will cause sufficient
>> warnings to avoid growing patterns like that.
>
> Hmm. I don't think that warning is valid.
>
> Disabling preemption is actually fine if it's done in an interrupt,
> iow if we have
>
>         allow_resched();
>            -> irq happens
>                 spin_lock();  // Ok and should *not* complain
>                 ...
>                 spin_unlock();
>             <- irq return (and preemption)
>
> which actually makes me worry about the nested irq case, because this
> would *not* be ok:
>
>         allow_resched();
>            -> irq happens
>                 -> *nested* irq happens
>                 <- nested irq return (andapreemption)
>
> ie the allow_resched() needs to still honor the irq count, and a
> nested irq return obviously must not cause any preemption.

IIUC, this should be equivalent to:

01         allow_resched();
02            -> irq happens
03               preempt_count_add(HARDIRQ_OFFSET);
04                -> nested irq happens
05                   preempt_count_add(HARDIRQ_OFFSET);
06
07                   preempt_count_sub(HARDIRQ_OFFSET);
08                 <- nested irq return
09               preempt_count_sub(HARDIRQ_OFFSET);

So, even if there were nested interrrupts, then the !preempt_count()
check in raw_irqentry_exit_cond_resched() should ensure that no
preemption happens until after line 09.

> I've lost sight of the original patch series, and I assume / hope that
> the above isn't actually an issue, but exactly because I've lost sight
> of the original patches and only have this one in my mailbox I wanted
> to check.

Yeah, sorry about that. The irqentry_exit_allow_resched() is pretty much
this:

+void irqentry_exit_allow_resched(void)
+{
+	if (resched_allowed())
+		raw_irqentry_exit_cond_resched();
+}

So, as long as raw_irqentry_exit_cond_resched() won't allow early
preemption, having allow_resched() set, shouldn't either.

--
ankur
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Peter Zijlstra 2 years, 3 months ago
On Fri, Sep 08, 2023 at 10:30:57PM -0700, Ankur Arora wrote:

> > which actually makes me worry about the nested irq case, because this
> > would *not* be ok:
> >
> >         allow_resched();
> >            -> irq happens
> >                 -> *nested* irq happens
> >                 <- nested irq return (andapreemption)
> >
> > ie the allow_resched() needs to still honor the irq count, and a
> > nested irq return obviously must not cause any preemption.
> 
> IIUC, this should be equivalent to:
> 
> 01         allow_resched();
> 02            -> irq happens
> 03               preempt_count_add(HARDIRQ_OFFSET);
> 04                -> nested irq happens
> 05                   preempt_count_add(HARDIRQ_OFFSET);
> 06
> 07                   preempt_count_sub(HARDIRQ_OFFSET);
> 08                 <- nested irq return
> 09               preempt_count_sub(HARDIRQ_OFFSET);
> 
> So, even if there were nested interrrupts, then the !preempt_count()
> check in raw_irqentry_exit_cond_resched() should ensure that no
> preemption happens until after line 09.

Yes, this.
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Peter Zijlstra 2 years, 3 months ago
On Fri, Sep 08, 2023 at 10:15:07AM -0700, Linus Torvalds wrote:
> On Fri, 8 Sept 2023 at 00:03, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Which on a PREEMPT_COUNT=n build will cause preemption while holding the
> > spinlock. I think something like the below will cause sufficient
> > warnings to avoid growing patterns like that.
> 
> Hmm. I don't think that warning is valid.
> 
> Disabling preemption is actually fine if it's done in an interrupt,
> iow if we have
> 
>         allow_resched();
>            -> irq happens
>                 spin_lock();  // Ok and should *not* complain
>                 ...
>                 spin_unlock();
>             <- irq return (and preemption)

Indeed.

> 
> which actually makes me worry about the nested irq case, because this
> would *not* be ok:
> 
>         allow_resched();
>            -> irq happens
>                 -> *nested* irq happens
>                 <- nested irq return (and preemption)
> 
> ie the allow_resched() needs to still honor the irq count, and a
> nested irq return obviously must not cause any preemption.

I think we killed nested interrupts a fair number of years ago, but I'll
recheck -- but not today, sleep is imminent.
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Linus Torvalds 2 years, 3 months ago
On Fri, 8 Sept 2023 at 15:50, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > which actually makes me worry about the nested irq case, because this
> > would *not* be ok:
> >
> >         allow_resched();
> >            -> irq happens
> >                 -> *nested* irq happens
> >                 <- nested irq return (and preemption)
> >
> > ie the allow_resched() needs to still honor the irq count, and a
> > nested irq return obviously must not cause any preemption.
>
> I think we killed nested interrupts a fair number of years ago, but I'll
> recheck -- but not today, sleep is imminent.

I don't think it has to be an interrupt. I think the TIF_ALLOW_RESCHED
thing needs to look out for any nested exception (ie only ever trigger
if it's returning to the kernel "task" stack).

Because I could easily see us wanting to do "I'm going a big user
copy, it should do TIF_ALLOW_RESCHED, and I don't have preemption on",
and then instead of that first "irq happens", you have "page fault
happens" instead.

And inside that page fault handling you may well have critical
sections (like a spinlock) that is fine - but the fact that the
"process context" had TIF_ALLOW_RESCHED most certainly does *not* mean
that the page fault handler can reschedule.

Maybe it already does. As mentioned, I lost sight of the patch series,
even though I saw it originally (and liked it - only realizing on your
complaint that it migth be more dangerous than I thought).

Basically, the "allow resched" should be a marker for a single context
level only. Kind of like a register state bit that gets saved on the
exception stack. Not a "anything happening within this process is now
preemptible".

I'm hoping Ankur will just pipe in and say "of course I already
implemented it that way, see XYZ".

              Linus
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ankur Arora 2 years, 3 months ago
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, 8 Sept 2023 at 15:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> >
>> > which actually makes me worry about the nested irq case, because this
>> > would *not* be ok:
>> >
>> >         allow_resched();
>> >            -> irq happens
>> >                 -> *nested* irq happens
>> >                 <- nested irq return (and preemption)
>> >
>> > ie the allow_resched() needs to still honor the irq count, and a
>> > nested irq return obviously must not cause any preemption.
>>
>> I think we killed nested interrupts a fair number of years ago, but I'll
>> recheck -- but not today, sleep is imminent.
>
> I don't think it has to be an interrupt. I think the TIF_ALLOW_RESCHED
> thing needs to look out for any nested exception (ie only ever trigger
> if it's returning to the kernel "task" stack).
>
> Because I could easily see us wanting to do "I'm going a big user
> copy, it should do TIF_ALLOW_RESCHED, and I don't have preemption on",
> and then instead of that first "irq happens", you have "page fault
> happens" instead.
>
> And inside that page fault handling you may well have critical
> sections (like a spinlock) that is fine - but the fact that the
> "process context" had TIF_ALLOW_RESCHED most certainly does *not* mean
> that the page fault handler can reschedule.
>
> Maybe it already does. As mentioned, I lost sight of the patch series,
> even though I saw it originally (and liked it - only realizing on your
> complaint that it migth be more dangerous than I thought).
>
> Basically, the "allow resched" should be a marker for a single context
> level only. Kind of like a register state bit that gets saved on the
> exception stack. Not a "anything happening within this process is now
> preemptible".

Yeah, exactly. Though, not even a single context level, but a flag
attached to a single context at the process level only. Using
preempt_count() == 0 as the preemption boundary.

However, this has a problem with the PREEMPT_COUNT=n case because that
doesn't have a preemption boundary.

In the example that Peter gave:

        allow_resched();
        spin_lock();
            -> irq happens
            <- irq returns

            ---> preemption happens
        spin_unlock();
        disallow_resched();

So, here the !preempt_count() clause in raw_irqentry_exit_cond_resched()
won't protect us.

My thinking was to restrict allow_resched() to be used only around
primitive operations. But, I couldn't think of any way to enforce that.

I think the warning in preempt_count_add() as Peter suggested
upthread is a good idea. But, that's only for CONFIG_DEBUG_PREEMPT.


--
ankur
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Peter Zijlstra 2 years, 3 months ago
On Fri, Sep 08, 2023 at 11:39:47PM -0700, Ankur Arora wrote:

> Yeah, exactly. Though, not even a single context level, but a flag
> attached to a single context at the process level only. Using
> preempt_count() == 0 as the preemption boundary.
> 
> However, this has a problem with the PREEMPT_COUNT=n case because that
> doesn't have a preemption boundary.

So, with a little sleep, the nested exception/interrupt case should be
good, irqenrty_enter() / irqentry_nmi_enter() unconditionally increment
preempt_count with HARDIRQ_OFFSET / NMI_OFFSET.

So while regular preempt_{dis,en}able() will turn into a NOP, the entry
code *will* continue to increment preempt_count.
Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED
Posted by Ankur Arora 2 years, 3 months ago
Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Sep 08, 2023 at 11:39:47PM -0700, Ankur Arora wrote:
>
>> Yeah, exactly. Though, not even a single context level, but a flag
>> attached to a single context at the process level only. Using
>> preempt_count() == 0 as the preemption boundary.
>>
>> However, this has a problem with the PREEMPT_COUNT=n case because that
>> doesn't have a preemption boundary.
>
> So, with a little sleep, the nested exception/interrupt case should be
> good, irqenrty_enter() / irqentry_nmi_enter() unconditionally increment
> preempt_count with HARDIRQ_OFFSET / NMI_OFFSET.
>
> So while regular preempt_{dis,en}able() will turn into a NOP, the entry
> code *will* continue to increment preempt_count.

Right, I was talking about the regular preempt_disable()/_enable() that
will turn into a NOP with PREEMPT_COUNT=n.

Actually, let me reply to the mail where you had described this case.

--
ankur